other
Posted Nov 26, 2025Member of Technical Staff, Synthetic Data
at Cohere
Toronto, CanadaRemote
Responsibilities
- - Conduct data ablations to assess data quality and experiment with data mixtures to enhance model performance.
- - Research and implement innovative synthetic data curation methods, leveraging Cohere’s infrastructure to drive advancements in natural language processing.
- - Collaborate with cross-functional teams, including researchers and engineers, to ensure data pipelines meet the demands of cutting-edge language models.
Requirements
- We’re training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences like content generation, semantic search, RAG, and agents.
- We believe that our work is instrumental to the widespread adoption of AI.
- Join us on our mission and shape the future! Why this role? As a Machine Learning Engineer specializing in synthetic data, you will play a pivotal role in developing the synthetic data pipeline that is crucial to Cohere’s advanced language models. Your
- By combining research and engineering, you will bridge the gap between raw data and cutting-edge AI models, directly contributing to improvements in critical training metrics like throughput and accelerator utilization.
- Your work will be essential to Cohere’s mission of delivering efficient and reliable language understanding and generation capabilities, driving innovation in natural language processing.
- If you are passionate about transforming data into the foundation of AI systems, this role offers a unique opportunity to make a meaningful impact.
- you will: - Design and build scalable inference pipelines that run on large GPU clusters.
- You may be a good fit if you have: - Strong software engineering skills, with proficiency in Python and
- experience building data pipelines. - Familiarity with data processing frameworks such as Apache Spark, Apache Beam, Pandas, or similar tools. -
- Experience working with LLMs through work projects, open-source contributions or personal experimentation. - Familiarity with LLM inference frameworks such as vLLM and TensorRT. -
- Experience working with large-scale datasets, including web data, code data, and multilingual corpora. - A passion for bridging research and engineering to solve complex data-related challenges in AI model training.