- Develop embedding systems that cleanly factorize the codec latent space into interpretable dimensions of speaker, content, style, environment, and channel effects -- enabling precise control over each aspect and the ability to massively amplify an existing seed dataset through “latent recombination”.
- Design model architectures, training schemes, and inference algorithms that are adapted for hardware at the bare metal enabling cost efficient training on billion-hour datasets and powering real-time inference for hundreds of millions of concurrent conversations.
Requirements
COMPANY OVERVIEW Deepgram is the leading platform underpinning the emerging trillion-dollar Voice AI economy, providing real-time APIs for speech-to-text (STT), text-to-speech (TTS), and building production-grade voice agents at scale.
COMPANY OPERATING RHYTHM At Deepgram, we expect an AI-first mindset—AI use and comfort aren’t optional, they’re core to how we operate, innovate, and measure performance.
Every team member who works at Deepgram is expected to actively use and experiment with advanced AI tools, and even build your own into your everyday work.
We measure how effectively AI is applied to deliver results, and consistent, creative use of the latest AI capabilities is key to success here.
Candidates should be comfortable adopting new models and modes quickly, integrating AI into their workflows, and continuously pushing the boundaries of what these technologies can do.
Additionally, we move at the pace of AI.
However, current sequence modeling paradigms based on jointly scaling model and data cannot deliver voice AI capable of universal human interaction.
We believe that entirely new paradigms for audio AI are needed to overcome these challenges and make voice interaction accessible to everyone.
THE ROLE As a Member of the Research Staff, you will pioneer the development of Latent Space Models (LSMs), a new approach that aims to solve the fundamental data, scale, and cost challenges associated with building robust, contextualized voice AI.
THE CHALLENGE We are seeking researchers who: - See "unsolved" problems as opportunities to pioneer entirely new approaches - Can identify the one critical experiment that will validate or kill an idea in days, not months - Have the vision to scale successful proofs-of-concept 100x - Are obsessed with using AI to automate and amplify your own impact If you find yourself energized rather than daunted by these expectations—if you're already thinking about five ideas to try while reading this—you might be the
IT'S IMPORTANT TO US THAT YOU HAVE - Strong mathematical foundation in statistical learning theory, particularly in areas relevant to self-supervised and multimodal learning - Deep expertise in foundation model architectures, with an understanding of how to scale training across multiple modalities - Proven ability to bridge theory and practice—someone who can both derive novel mathematical formulations and implement them efficiently - Demonstrated ability to build data pipelines that can process and
Experience optimizing models for real-world deployment, including knowledge of hardware constraints and efficiency techniques - History of open-source contributions or research publications that have advanced the state of the art in speech/language AI HOW WE GENERATED THIS JOB DESCRIPTION This job description was generated in two parts.
The “It’s Important to Us” section was automatically derived from a multi-stage LLM analysis (using o1) of key foundational deep learning papers related to our research goals.
The LLM analysis culminates in an “Ideal Researcher Profile”, which is reproduced below along with the list of foundational papers.
STATISTICAL & MATHEMATICAL FOUNDATIONS Mastery of Core Concepts Many papers, like Scaling Laws for Neural Language Models and Neural Discrete Representation Learning (VQ-VAE), reflect the importance of power-law analyses, derivation of novel losses, or adaptation of fundamental equations (e.g., in VQ-VAE's commitment loss or rectified flows in Scaling Rectified Flow Transformers).
Combining Existing Theories in Novel Ways Papers such as Moshi (combining text modeling, audio codecs, and hierarchical generative modeling) and Finite Scalar Quantization (FSQ's adaptation of classic scalar quantization to replace vector-quantized representations) show how reusing but reimagining known techniques can yield breakthroughs.
Turning Theory into Practice Whether it's the direct preference optimization (DPO) for alignment in phi-3 or the residual vector quantization in SoundStream, these works show that bridging design insights with implementable prototypes is essential.
Clear Impact Through Prototypes & Open-Source Many references (Whisper, neural discrete representation learning, Mamba-2) highlight releasing code or pretrained models, enabling the broader community to replicate and build upon new methods.
DATA-DRIVEN & SCALABLE SYSTEMS Emphasis on Large-Scale Data and Efficient Pipelines Papers such as Robust Speech Recognition via Large-Scale Weak Supervision (Whisper) and BASE TTS demonstrate that collecting and processing hundreds of thousands of hours of real-world audio can unlock new capabilities in zero-shot or low-resource domains.
Whisper trains on multilingual tasks, BASE TTS uses subsets/stages for pretraining on speech tokens, and phi-3 deploys multiple training phases (web data, then synthetic data).
HARDWARE & SYSTEMS UNDERSTANDING Efficient Implementations at Scale Many works illustrate how researchers tune architectures for modern accelerators: the In-Datacenter TPU paper exemplifies domain-specific hardware design for dense matrix multiplications, while phi-3 leverages blocksparse attention and custom Triton kernels to run advanced LLMs on resource-limited devices.
Real-Time & On-Device Constraints SoundStream shows how to compress audio in real time on a smartphone CPU, demonstrating that knowledge of hardware constraints (latency, limited memory) drives design choices.
Similarly, Moshi's low-latency streaming TTS and phi-3-mini's phone-based inference highlight that an ideal researcher must adapt algorithms to resource limits while maintaining robustness.
Architectural & Optimization Details Papers like Mamba-2 in Transformers are SSMs and the In-Datacenter TPU work show how exploiting specialized matrix decomposition, custom memory hierarchies, or quantization approaches can lead to breakthroughs in speed or energy efficiency. 5.
Multifold Evaluation Metrics From MUSHRA listening tests (SoundStream, BASE TTS) to FID in image synthesis (Scaling Rectified Flow Transformers, FSQ) to perplexity or zero-shot generalization in language (phi-3, Scaling Laws for Neural Language Models), the works demonstrate the value of comprehensive, carefully chosen metrics.
Stress Tests & Edge Cases Whisper's out-of-distribution speech benchmarks, SoundStream's evaluation on speech + music, or Mamba-2's performance on multi-query associative recall demonstrate the importance of specialized challenge sets.
SUMMARY Overall, an ideal researcher in deep learning consistently demonstrates: - A solid grounding in theoretical and statistical principles - A talent for proposing and validating new algorithmic solutions - The capacity to orchestrate data pipelines that scale and reflect real-world diversity - Awareness of hardware constraints and system-level trade-offs for efficiency - Thorough and transparent experimental practices These qualities surface across research on speech (Whisper, BASE TTS), language
If you're looking to work on cutting-edge technology and make a significant impact in the AI industry, we'd love to hear from you! Deepgram is an equal opportunity employer.
Benefits
HOLISTIC HEALTH - Medical, dental, vision
benefits - Annual wellness stipend - Mental health support - Life, STD, LTD Income Insurance Plans WORK/LIFE BLEND - Unlimited PTO - Parental leave - Flexible schedule - 12 Paid US company holidays - Quarterly personal productivity stipend - One-time stipend for home office upgrades - 401(k) plan with company match - Tax Savings Programs CONTINUOUS LEARNING - Learning / Education stipend - Participation in talks and conferences - Employee Resource Groups - AI enablement workshops / sessions *For candidates
Backed by prominent investors including Y Combinator, Madrona, Tiger Global, Wing VC and NVIDIA, Deepgram has raised over $215M in total funding.
Additional details
More than 200,000 developers and 1,300+ organizations build voice offerings that are ‘Powered by Deepgram’, including Twilio, Cloudflare, Sierra, Decagon, Vapi, Daily, Cresta, Granola, and Jack in the Box.
Deepgram’s voice-native foundation models are accessed through cloud APIs or as self-hosted and on-premises software, with unmatched accuracy, low latency, and cost efficiency.
Backed by a recent Series C led by leading global investors and strategic partners, Deepgram has processed over 50,000 years of audio and transcribed more than 1 trillion words.
There is no organization in the world that understands voice better than Deepgram.
Change is rapid, and you can expect your day-to-day work to evolve just as quickly.
This may not be the right role if you’re not excited to experiment, adapt, think on your feet, and learn constantly, or if you’re seeking something highly prescriptive with a traditional 9-to-5.
THE OPPORTUNITY Voice is the most natural modality for human interaction with machines.
The challenges are rooted in fundamental data problems posed by audio: real-world audio data is scarce and enormously diverse, spanning a vast space of voices, speaking styles, and acoustic conditions.
Even if billions of hours of audio were accessible, its inherent high dimensionality creates computational and storage costs that make training and deployment prohibitively expensive at world scale.
Your research will focus on solving one or more of the following problems: - Build next-generation neural audio codecs that achieve extreme, low bit-rate compression and high fidelity reconstruction across a world-scale corpus of general audio.