engineering
Posted 5 days agoHPC Operations Engineer
at Lambdalabs
United StatesRemote
Requirements
- Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers.
- Our customers range from AI researchers to enterprises and hyperscalers.
- If you'd like to build the world's best AI cloud, join us. *Note: This position requires presence in our San Francisco/San Jose/Bellevue office location 4 days per week; Lambda’s designated work from home day is currently Tuesday.
- Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance.
- What You’ll Do - Remotely deploy and configure large-scale HPC clusters for AI workloads (up to many thousands of nodes) - Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools - Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site - Provide clear and detailed
- requirements back to other engineering teams on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency - Contribute to the creation of and maintenance of Standard Operating Procedures - Provide regular and well-communicated updates to project leads throughout each deployment - Mentor and assist less experienced team members - Stay up-to-date on the latest HPC/AI technologies and best practices You - Are a deeply experienced HPC engineer comfortable
- experience in deploying and configuring HPC clusters for AI workloads - Have an innate attention to detail - Are in expert in configuring and troubleshooting: - SFP+ fiber, Infiniband (IB), and 100 GbE network fabrics - Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod environments - Linux based compute nodes, firmware updates, driver installation - SLURM, Kubernetes, or other job scheduling systems - Work well under deadlines and structured project plans also knowing when and how
- Experience with machine learning and deep learning frameworks (PyTorch, Tensorflow) and benchmarking tools (DeepSpeed, MLPerf) -
- Experience with containerization technologies ( Docker, Kubernetes) -
- Experience working with the technologies that underpin our cloud business ( GPU acceleration, virtualization, and cloud computing) - Keen situational awareness in customer situations, employing diplomacy and tact - Bachelors degree in EE, CS, Physics, Mathematics, or equivalent work
- About Lambda - Founded in 2012, with 500+ employees, and growing fast - Our investors notably include TWG Global, US Innovative Technology Fund (USIT), Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, Gradient Ventures, Mercato Partners, SVB, 1517, and Crescent Cove - We have research papers accepted at top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG - Our