AI Engineer, Quality

at Doctor Droid

San Francisco, California, United StatesRemote

React TypeScript Python accounting postgres

Responsibilities

Translate customer problems into concrete agent behaviors and workflows Integrate and orchestrate LLMs, tools, retrieval systems, and logic into cohesive, reliable agent experiences Rapid Model Evaluation Build automated pipelines that evaluate new models against all critical workflows within hours of release Design evaluation harnesses for our most complex Agentic systems and workflows Implement comparison frameworks that measure effectiveness, consistency, latency, and cost across model versions Design
Design prompts, retrieval pipelines, and agent orchestration systems that perform reliably at scale Ownership of Quality and Large Product Areas Define and document evaluation standards, best practices, and processes for the engineering organization Advocate for evaluation-driven development and make it easy for the team to write and run evals Partner with product and ML engineers to integrate evaluation

Requirements

About the Role Fieldguide is building AI agents for the most complex audit and advisory workflows.
We're a San Francisco-based Vertical AI company building in a $100B+ market undergoing rapid transformation.
As an AI Engineer, Quality , you will own the evaluation infrastructure that ensures our AI agents perform reliably at enterprise scale.
You'll work at the intersection of ML engineering, observability, and quality assurance to ensure our agents meet the rigorous standards our customers demand.
What You'll Own Measurable AI Agents Design and build a unified evaluation platform that serves as the single source of truth for all of our agentic systems and audit workflows Build observability systems that surface agent behavior, trace execution, and failure modes in production, and feedback loops that turn production failures into first-class evaluation cases Own the evaluation infrastructure stack including integration with LangSmith and LangGraph.
requirements into agent development from day one Take full ownership of large product areas rather than executing on narrow tasks Who You Are You are an engineer who believes that evaluations are foundational to building reliable AI systems, not a nice-to-have.
The following operating principles should resonate with you: Evaluation-first mindset: You understand that for an AI company, not being able to evaluate a new model quickly is unacceptable AI-native instincts: You treat LLMs, agents, and automation as fundamental building blocks and parts of the craft of engineering Data-driven rigor: You make decisions based on metrics and are obsessed with measuring what matters Production-oriented: You understand that evaluations must work on real production behavior,

AI Engineer, Quality

Responsibilities

Requirements

Browse by category

Browse by skills

Browse by role

Benefits

Additional details

Browse by location