Machine Learning Engineer, Evaluation

IndiaHybrid

Responsibilities

Build LLM-powered evaluation pipelines that assess AI usage skills consistently, fairly, and at production scale.
Own the evaluation methodology end to end- what the rubric is, how the model applies it, how you measure whether it is being applied correctly, and how you audit for bias.
Design and run experiments to determine what good evaluation actually looks like. The answer is not known. You will be finding it.
Build RAG pipelines and fine-tuning workflows that make evaluation models adhere reliably to the rules we set for them.
Define the benchmarking infrastructure: how we know when our evaluation quality has improved, and how we catch regressions before candidates do.

Software has entered an era where humans and AI build side by side.
How developers are being evaluated now is whether they can orchestrate AI to accomplish the task while still having the fundamentals underneath.
How do you measure skill when AI is already in the room?
Software engineering has moved from writing code to using AI to solve problems.
It spans live interviews, async assessments, AI-assisted coding environments, pair programming with agents, and every other context in which someone is trying to figure out how good a developer actually is.
Nobody has cracked how to fairly assess human skill in a world where AI assistance is ambient and invisible, where the question is no longer "can you write this function" but "how effectively do you use AI to solve a real problem."
You have shipped LLM-powered systems in production where consistency and reliability were hard constraints, not nice-to-haves.
You can defend ML judgment in plain language to people who are not ML engineers, because the translation layer is part of the job. Even better if you have •
Experience building evaluation frameworks for generative or conversational AI systems.
Background in educational assessment, psychometrics, or human-in-the-loop evaluation at scale.
Publications or open-source contributions in LLM evaluation, benchmarking, or alignment.