data
Posted Apr 20Machine Learning Researcher, Audio
at Bland AI
San Francisco, United StatesOn-site
Responsibilities
- WHAT YOU WILL DO Build and Scale Next-Generation TTS Systems - Design and train large scale text-to-speech models capable of expressive, controllable, human-sounding output. - Develop neural audio codec-based TTS architectures for efficient, high-fidelity generation. - Improve prosody modeling, question inflection, emotional expression, and multi-speaker robustness. - Optimize for real-time, low-latency inference in production.
- Advance Speech-to-Text Modeling - Build and fine-tune large scale ASR systems robust to accents, noise, telephony artifacts, and code switching. - Leverage self-supervised pretraining and large-scale weak supervision. - Improve transcription accuracy for real-world enterprise scenarios, including structured extraction and conversational nuance.
- Pioneer Neural Audio Codecs - Research and implement neural audio codecs that achieve extreme compression with minimal perceptual loss. - Explore discrete and continuous latent representations for scalable speech modeling. - Design codec architectures that enable downstream generative modeling and controllable synthesis.
- Develop Scalable Training Pipelines - Curate and process massive audio datasets across languages, speakers, and environments. - Design staged training curricula and data filtering strategies. - Scale training across distributed GPU clusters focusing on cost, throughput, and reliability.
- Run Rigorous Experiments - Design ablation studies that isolate the impact of architectural changes. - Measure improvements using both objective metrics and perceptual evaluations. - Validate ideas quickly through focused experiments that confirm or eliminate hypotheses.
Requirements
- MACHINE LEARNING RESEARCHER, AUDIO Location: San Francisco, CA or Remote (US) ABOUT BLAND At Bland.com, our mission is to empower enterprises to build AI phone agents at scale.
- THE ROLE: MACHINE LEARNING RESEARCHER, AUDIO As a Machine Learning Researcher at Bland, you'll be working on foundational research and development across the core components of our voice stack: speech-to-text, large language models, neural audio codecs, and text-to-speech.
- You will take ideas from theory to large-scale training to production inference systems serving millions of calls per day.
- Experience with self-supervised learning, multimodal modeling, or generative modeling. - Ability to derive new formulations and implement them efficiently. Expertise in Voice Modeling - Hands-on
- experience building or scaling TTS, STT, or neural audio codec systems. - Familiarity with large scale speech datasets and real-world audio variability. - Strong intuition for audio quality, prosody, and conversational dynamics.
- Experience training and serving large models on modern accelerators. - Knowledge of inference optimization techniques, including quantization, kernel optimization, and memory efficiency. - Understanding of real-time constraints in telephony or streaming environments.
- Experience with large scale distributed training. - Research publications or open source contributions in speech or language AI. - Background in real-time speech systems or telephony. - PhD in ML, AI, or a related field, or equivalent research impact.
Benefits
- HOW YOU SHOW UP - You treat unsolved problems as opportunities to invent new paradigms. - You identify the single experiment that can validate an idea in days, not months. - You measure everything and let data drive decisions. - You are obsessed with making voice agents sound truly human. - You use AI tools aggressively to amplify your own impact and accelerate research cycles. BONUS POINTS -
- BENEFITS AND COMPENSATION - Healthcare, dental, vision, all the good stuff - Meaningful equity in a fast-growing company - Every tool you need to succeed - Beautiful office in Jackson Square, SF with rooftop views - Competitive salary: $160,000 to $250,000 If you are energized by building and scaling TTS models, pioneering neural audio codecs, and pushing the boundaries of speech-to-text systems, we would love to hear from you.
Additional details
- Based in San Francisco, we are a fast-growing team reimagining how customers interact with businesses through voice.
- We have raised $65 million from leading Silicon Valley investors, including Emergence Capital, Scale Venture Partners, Y Combinator, and founders of Twilio, Affirm, and ElevenLabs.
- Voice is quickly becoming the primary interface between businesses and their customers.
- We are building the models and infrastructure that make those interactions feel natural, reliable, and genuinely human.
- Your work will define how our agents understand, reason, and speak in real time at enterprise scale.
- You will design new modeling approaches, validate them with rigorous experimentation, and collaborate with engineering teams to deploy them into real customer environments.
- WHAT MAKES YOU A GREAT FIT Deep Research Foundations -
- Experimental Rigor - Track record of designing controlled experiments and meaningful ablations. - Comfortable working with both offline benchmarks and live production metrics. - Ability to move quickly from hypothesis to validation.
- Builder Mentality - Comfortable in fast-moving startup environments. - Strong ownership mindset from research through deployment. - Excited by ambiguous, unsolved problems.