jobloom

JobLoom finds jobs directly from company career sites before many job boards, then routes you into detailed role pages like this one.

infrastructure

Posted 3 hours ago

Senior Software Engineer - Infrastructure Storage

at Lambdalabs

Remote

Responsibilities

  • - Implement and optimize storage protocol APIs for file (e.g., NFS, SMB), block (e.g., Fibre Channel), and object (e.g., S3) access.
  • - Develop distributed systems for managing and orchestrating storage resources across multiple storage solutions and redundant arrays.
  • - Collaborate with hardware and system architects to integrate software with various storage solutions, including NVMe and GPU-direct storage.

Requirements

  • Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers.
  • Our customers range from AI researchers to enterprises and hyperscalers.
  • If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.
  • In the world of distributed AI, raw GPU and CPU horsepower is just a part of the story.
  • High-performance networking and storage are the critical components that enable and unite these systems, making groundbreaking AI training and inference possible.
  • The Lambda Infrastructure Engineering organization forges the foundation of high-performance AI clusters by welding together the latest in AI storage, networking, GPU and CPU hardware.
  • Our expertise lies at the intersection of: - High-Performance Distributed Storage Solutions and Protocols: We engineer the protocols and systems that serve massive datasets at the speeds demanded by modern clustered GPUs.
  • - Dynamic Networking: We design advanced networks that provide multi-tenant security and intelligent routing without compromising performance, using the latest in AI networking hardware.
  • - Compute Virtualization: We enable cutting-edge virtualization and clustering that allows AI researchers and engineers to focus on AI workloads, not AI infrastructure, unleashing the full compute bandwidth of clustered GPUs.
  • experience designing and deploying various storage protocol solutions at scale (object, block, and file).
  • This is a unique opportunity to work at the intersection of large-scale distributed systems and the rapidly evolving field of artificial intelligence infrastructure.
  • This is an opportunity to have a significant impact on the future of AI.
  • You will be building the foundational infrastructure that powers some of the most advanced AI research and products in the world.
  • - Work closely with Networking, Compute, and Storage Software Engineering teams to deploy high-performance distributed storage solutions to serve AI/ML workloads.
  • - Innovate: - Stay current with the latest trends and research into AI and HPC storage technologies.
  • - Work with the Lambda product team to uncover new trends in the AI inference and training product category that will inform emerging storage solutions.
  • - Optimize protocol solutions for the AI product vertical exploring optimizations for AI Inference, training, and scientific computing applications. You -
  • experience in storage engineering with at least 5+ years in a management or lead role. - Systems-Level Programming and Architecture - Storage Protocol and API Mastery: - Storage Performance Optimization - DPKD SPKD - Physical Infrastructure Knowledge - Operational Acumen - Technical Skills: -
  • Experience in serving one or more of the following storage protocols: object storage (e.g., S3), block storage (e.g., iSCSI), or file storage (e.g., NFS, SMB, Lustre). - Professional individual contributor
  • experience as a storage engineer or storage SRE. - Familiarity with modern storage technologies (e.g., NVMe, RDMA, DPUs) and their role in optimizing performance. - People Management: -
  • Experience building a high-performance team through deliberate hiring, upskilling, planned skills redundancy, performance-management, and expectation setting. Nice to Have - Experience: -
  • Experience driving cross-functional engineering management initiatives (coordinating events, strategic planning, coordinating large projects). -
  • Experience with NVidia SuperNIC DPUs for edge-caching (such as implementing GPUDirect Storage). - Technical Skills: - Deep
  • experience with Vast, Weka and/or NetApp in an HPC or AI Infrastructure environment. - Deep
  • experience implementing CEPH in an HPC or AI infrastructure environment at a scale greater than 100PB. - People Management: -
  • About Lambda - Founded in 2012, with 500+ employees, and growing fast - Our investors notably include TWG Global, US Innovative Technology Fund (USIT), Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, Gradient Ventures, Mercato Partners, SVB, 1517, and Crescent Cove - We have research papers accepted at top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG - Our

Experience

  • Experience: - 10+ years of

Benefits

  • Salary Range Information The annual salary range for this position has been set based on market data and other factors.
  • However, a salary higher or lower than this range may be appropriate for a candidate whose

Additional details

  • Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence.
  • What You’ll Do - Technical Leadership: - At the Senior Level - Execution: - Systems-Level Programming and Architecture - Design, develop, and maintain software for storage systems, focusing on performance, scalability, and reliability.
  • - Contribute to the full software development lifecycle, from
  • requirements gathering and design to deployment and maintenance. - Collaboration - Work closely with the storage software teams and networking teams to execute on cross-functional infrastructure initiatives and new data-center deployments including integration of storage protocols across a variety of on-prem storage solutions. - Work closely with the control plane and MK8s teams to meet customer/product
  • requirements for usability, reliability, and telemetry.
  • - Work with the observability team to build/track SLOs/SLIs.
  • - Partner with the fleet engineering team to ensure seamless deployment, monitoring, and maintenance of the distributed storage solutions.
  • qualifications differ meaningfully from those listed in the job description.
  • We are committed to building a team with a variety of backgrounds, experiences, and skills.
  • Equal Opportunity Employer Lambda is an Equal Opportunity employer.

Find more real-time jobs on JobLoom.