jobloom

JobLoom finds jobs directly from company career sites before many job boards, then routes you into detailed role pages like this one.

infrastructure

Posted Mar 9

Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)

at Deepgram

United StatesHybrid

Responsibilities

  • - Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments.
  • - Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning.
  • - Automate the life cycle of single-tenant, managed deployments You’ll Love This Role If You - Are passionate about building platforms that empower developers and researchers.

Requirements

  • COMPANY OVERVIEW Deepgram is the leading platform underpinning the emerging trillion-dollar Voice AI economy, providing real-time APIs for speech-to-text (STT), text-to-speech (TTS), and building production-grade voice agents at scale.
  • COMPANY OPERATING RHYTHM At Deepgram, we expect an AI-first mindset—AI use and comfort aren’t optional, they’re core to how we operate, innovate, and measure performance.
  • Every team member who works at Deepgram is expected to actively use and experiment with advanced AI tools, and even build your own into your everyday work.
  • We measure how effectively AI is applied to deliver results, and consistent, creative use of the latest AI capabilities is key to success here.
  • Candidates should be comfortable adopting new models and modes quickly, integrating AI into their workflows, and continuously pushing the boundaries of what these technologies can do.
  • Additionally, we move at the pace of AI.
  • Opportunity: We're looking for an experienced Site Reliability Engineer to build and operate the hybrid infrastructure foundation for our advanced AI/ML research and product development.
  • You'll architect, build, and run the platform spanning AWS and our bare metal data centers, empowering our teams to train and deploy complex models at scale.
  • This role is focused on creating a robust, self-service environment using Kubernetes, AWS, and Infrastructure-as-Code (Terraform), and orchestrating high-demand GPU workloads using schedulers like Slurm.
  • What You’ll Do - Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services.
  • - Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources.
  • - Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing.
  • - Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle.
  • - Are excited to work at the intersection of modern platform engineering and cutting-edge AI.
  • experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE). - Proven, hands-on
  • experience building and managing production infrastructure with Terraform. - Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment. -
  • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads. -
  • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management. - Strong scripting and automation skills (e.g., Python, Go, Bash).
  • Experience with CI/CD systems (e.g., GitLab CI, Jenkins, ArgoCD) and building developer tooling. - Familiarity with FinOps principles and cloud cost optimization strategies. - Knowledge of Kubernetes networking (e.g., Calico, Cilium) and storage (e.g., Ceph, Rook) solutions. -
  • Experience in a multi-region or hybrid cloud environment. BENEFITS & PERKS
  • If you're looking to work on cutting-edge technology and make a significant impact in the AI industry, we'd love to hear from you! Deepgram is an equal opportunity employer.

Experience

  • It’s Important To Us That You Have - 5+ years of

Benefits

  • HOLISTIC HEALTH - Medical, dental, vision
  • benefits - Annual wellness stipend - Mental health support - Life, STD, LTD Income Insurance Plans WORK/LIFE BLEND - Unlimited PTO - Parental leave - Flexible schedule - 12 Paid US company holidays - Quarterly personal productivity stipend - One-time stipend for home office upgrades - 401(k) plan with company match - Tax Savings Programs CONTINUOUS LEARNING - Learning / Education stipend - Participation in talks and conferences - Employee Resource Groups - AI enablement workshops / sessions *For candidates
  • Backed by prominent investors including Y Combinator, Madrona, Tiger Global, Wing VC and NVIDIA, Deepgram has raised over $215M in total funding.

Additional details

  • More than 200,000 developers and 1,300+ organizations build voice offerings that are ‘Powered by Deepgram’, including Twilio, Cloudflare, Sierra, Decagon, Vapi, Daily, Cresta, Granola, and Jack in the Box.
  • Deepgram’s voice-native foundation models are accessed through cloud APIs or as self-hosted and on-premises software, with unmatched accuracy, low latency, and cost efficiency.
  • Backed by a recent Series C led by leading global investors and strategic partners, Deepgram has processed over 50,000 years of audio and transcribed more than 1 trillion words.
  • There is no organization in the world that understands voice better than Deepgram.
  • Change is rapid, and you can expect your day-to-day work to evolve just as quickly.
  • This may not be the right role if you’re not excited to experiment, adapt, think on your feet, and learn constantly, or if you’re seeking something highly prescriptive with a traditional 9-to-5.
  • - Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated.
  • - Enjoy creating elegant, automated solutions for complex infrastructure challenges in both cloud and data center environments.
  • - Thrive on optimizing hybrid infrastructure for performance, cost, and reliability.
  • - Love to treat infrastructure as a product, continuously improving the developer experience.

Find more real-time jobs on JobLoom.