infrastructure
Posted Apr 8Research Engineer, Infrastructure
at Cognition
San Francisco, United StatesOn-site
Responsibilities
- Identify bottlenecks across data loading, communication overhead, memory utilization, and compute efficiency.
- Implement solutions that meaningfully improve step time and MFU at scale. - Experiment Orchestration and Tooling: Design and maintain the systems researchers use to launch, track, and analyze experiments.
- Reduce friction in the research loop so that more time is spent on ideas and less on waiting. - Data Pipeline Engineering: Build high-throughput, reliable data pipelines for training and evaluation.
- Ensure data quality, reproducibility, and efficiency at the scale our training runs demand. - Debugging and Reliability: Diagnose and resolve training failures across GPUs, networking, numerics, and data.
Requirements
- WHO WE ARE We are an applied AI lab building end-to-end software agents.
- We're the team behind Devin, the first AI software engineer, and Windsurf, an AI-native IDE.
- These products represent our vision for AI that doesn't just assist engineers, but works alongside them as a genuine teammate.
- Our team is small and talent-dense: world-class competitive programmers, former founders, and researchers from the frontier of AI, including Scale AI, Palantir, Cursor, Google DeepMind, and others.
- You will work directly alongside researchers, understand the science deeply enough to anticipate what they need next, and build systems that hold up under the pressure of training jobs running across thousands of GPUs.
- WHAT YOU'LL ACCOMPLISH - Distributed Training Infrastructure: Build and own the systems that run large-scale training jobs reliably across GPU clusters.
- Maintain detailed understanding of failure modes and build systems that fail gracefully and recover fast. - Parallelism and Systems Research: Implement and optimize parallelism strategies: data, tensor, pipeline, and sequence parallelism.
- experience building and operating distributed training systems for large models; comfortable owning infrastructure end to end from the cluster level down to the training loop - Strong systems engineering fundamentals: distributed systems, networking, storage, and the ability to reason about performance across the full hardware-software stack - Proficiency in Python and C++;
- experience with PyTorch or equivalent deep learning frameworks at a systems level, not just API usage - Hands-on