jobloom

JobLoom finds jobs directly from company career sites before many job boards, then routes you into detailed role pages like this one.

infrastructure

Posted 3 hours ago

Staff Software Engineer - Infrastructure Storage

at Lambdalabs

Remote

Responsibilities

  • - Author and review design documents for new storage systems, protocols, and integrations; raise the technical bar across the team.
  • - Implement and optimize storage protocol APIs across file (NFS, SMB, Lustre), block (NVMe-oF, iSCSI, Fibre Channel), and object (S3) access patterns.
  • - Develop distributed systems for managing and orchestrating storage resources across multiple solutions and redundant arrays.
  • - Collaborate with hardware and system architects to integrate software with storage solutions including NVMe, GPU-direct storage, and DPU-accelerated data paths.
  • - Troubleshoot and resolve complex issues in production data center environments, including performance regressions, protocol mismatches, and hardware failures.
  • - Build and maintain tooling for storage benchmarking, performance profiling, and capacity planning.
  • - Evaluate and prototype new storage solutions, protocols, and hardware integrations - from open-source distributed filesystems to vendor-specific accelerated storage products.

Requirements

  • Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers.
  • Our customers range from AI researchers to enterprises and hyperscalers.
  • If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco/San Jose/Bellevue office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.
  • In the world of distributed AI, raw GPU and CPU horsepower is just a part of the story.
  • High-performance networking and storage are the critical components that enable and unite these systems, making groundbreaking AI training and inference possible.
  • The Lambda Infrastructure Engineering organization forges the foundation of high-performance AI clusters by welding together the latest in AI storage, networking, GPU and CPU hardware.
  • Our expertise lies at the intersection of: - High-Performance Distributed Storage Solutions and Protocols: We engineer the protocols and systems that serve massive datasets at the speeds demanded by modern clustered GPUs.
  • - Dynamic Networking: We design advanced networks that provide multi-tenant security and intelligent routing without compromising performance, using the latest in AI networking hardware.
  • - Compute Virtualization: We enable cutting-edge virtualization and clustering that allows AI researchers and engineers to focus on AI workloads, not AI infrastructure, unleashing the full compute bandwidth of clustered GPUs.
  • experience designing and deploying storage protocol solutions at scale across object, block, and file paradigms.
  • This is a unique opportunity to work at the intersection of large-scale distributed systems and the rapidly evolving field of artificial intelligence infrastructure.
  • This is an opportunity to have a significant impact on the future of AI.
  • You will be building the foundational infrastructure that powers some of the most advanced AI research and products in the world.
  • - Partner with the control plane and Kubernetes teams to meet customer and product
  • - Coordinate with Networking, Compute, and Storage Engineering teams to deploy high-performance distributed storage solutions that serve AI/ML workloads.
  • - Innovate: - Stay current with the latest research and developments in AI and HPC storage technologies, and bring relevant advances into Lambda's infrastructure.
  • - Work with the Lambda product team to identify emerging trends in AI inference and training that will shape next-generation storage requirements.
  • - Optimize storage protocol solutions for AI workloads, including checkpoint I/O for training, high-throughput dataset serving, and latency-sensitive inference pipelines. You Have: -
  • experience in storage systems engineering, with at least 5 years in a technical lead or Staff+ IC role. - Proven track record designing and operating storage infrastructure at scale (multi-petabyte environments preferred) in production data center or cloud settings. -
  • Experience leading technical projects end-to-end, from architecture through delivery with cross-functional stakeholders. - Background working in high-performance computing, AI/ML infrastructure, or large-scale cloud storage environments. - Systems-Level Programming - Strong proficiency in one or more low-level systems programming languages: C, C++, Rust, or Go. - Demonstrated ability to write high-performance, concurrent, production-grade systems code and conduct thorough code reviews. -
  • Experience with kernel-level storage drivers, user-space I/O frameworks, or storage daemon development is a strong plus. - Familiarity with DPDK and SPDK and their role in building high-performance, kernel-bypass storage and networking data paths. - Storage Protocol & API Expertise - Deep hands-on
  • experience with two or more storage protocols: object (S3 or similar), block (iSCSI, Fibre Channel, NVMe-oF), or file (NFS, SMB, Lustre, DAOS). -
  • Experience implementing or maintaining storage protocol servers or clients in production, not just consuming them. - Familiarity with storage API performance characteristics such as latency, throughput, IOPS and the ability to diagnose and resolve bottlenecks at the protocol level. - Storage Performance Optimization -
  • - Familiarity with tools such as fio, blktrace, perf, eBPF/bpftrace, or equivalent for storage performance analysis.
  • - Understanding of I/O scheduling, caching layers, write amplification, and related performance tradeoffs.
  • - Modern Storage Technologies - Familiarity with NVMe, NVMe-oF, and RDMA (RoCE or InfiniBand) and their impact on storage system architecture.
  • - Working knowledge of DPUs (e.g., NVIDIA BlueField) and their role in offloading storage and networking data paths. -
  • Experience with GPU-direct storage or similar zero-copy data paths is a plus. - Physical Infrastructure & Operational Acumen - Comfort working in a physical data center environment — understanding rack-scale infrastructure, storage array hardware, cabling, and failure domains. -
  • Experience building and operating storage systems with strong reliability expectations: designing for failure, building runbooks, and driving incident response. - Familiarity with storage observability tooling — metrics pipelines (Prometheus, Grafana), log aggregation, and tracing in distributed storage environments. Nice to Have -
  • Experience with NVIDIA BlueField DPUs or SuperNICs for accelerated storage data paths, including GPUDirect Storage implementation. - Deep production
  • experience with enterprise or HPC storage platforms: Vast Data, Weka, NetApp, or IBM Spectrum Scale. -
  • Experience deploying and operating Ceph at scale (100PB+) in an HPC or AI infrastructure environment. - Familiarity with emerging storage technologies such as CXL memory pooling, computational storage, or ZNS (Zoned Namespace) SSDs. -
  • Experience contributing to or maintaining open-source storage projects (e.g., Ceph, DAOS, Lustre, MinIO).
  • About Lambda - Founded in 2012, with 500+ employees, and growing fast - Our investors notably include TWG Global, US Innovative Technology Fund (USIT), Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, Gradient Ventures, Mercato Partners, SVB, 1517, and Crescent Cove - We have research papers accepted at top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG - Our

Experience

  • Experience - 10+ years of

Benefits

  • Salary Range Information The annual salary range for this position has been set based on market data and other factors.
  • However, a salary higher or lower than this range may be appropriate for a candidate whose

Additional details

  • Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence.
  • What You’ll Do - Technical Leadership: - Set technical direction for storage software architecture across the Infrastructure Engineering organization, influencing decisions that span petabyte-scale deployments.
  • - Serve as a technical anchor for cross-functional initiatives involving storage, networking, compute, and control plane teams.
  • - Represent the storage software team in architectural reviews, roadmap planning, and customer-facing technical discussions where needed.
  • - Execution: - Design, develop, and maintain high-performance storage systems software with a focus on performance, scalability, reliability, and operational simplicity.
  • - Contribute across the full software development lifecycle — from
  • requirements gathering and system design through deployment, monitoring, and long-term maintenance.
  • - Collaboration - Work closely with storage software and networking teams to execute cross-functional infrastructure initiatives and new data center deployments, including integration of storage protocols across a variety of on-prem solutions.
  • requirements for usability, reliability, and telemetry.
  • - Work with the observability team to define, build, and track SLOs/SLIs for storage systems.

Find more real-time jobs on JobLoom.