data
Posted 1 weeks agoMachine Learning Engineer, Evaluation
at HackerRank
IndiaHybrid
Responsibilities
- Build LLM-powered evaluation pipelines that assess AI usage skills consistently, fairly, and at production scale.
- Own the evaluation methodology end to end- what the rubric is, how the model applies it, how you measure whether it is being applied correctly, and how you audit for bias.
- Design and run experiments to determine what good evaluation actually looks like. The answer is not known. You will be finding it.
- Build RAG pipelines and fine-tuning workflows that make evaluation models adhere reliably to the rules we set for them.
- Define the benchmarking infrastructure: how we know when our evaluation quality has improved, and how we catch regressions before candidates do.
Requirements
- Software has entered an era where humans and AI build side by side.
- How developers are being evaluated now is whether they can orchestrate AI to accomplish the task while still having the fundamentals underneath.
- How do you measure skill when AI is already in the room?
- Software engineering has moved from writing code to using AI to solve problems.
- It spans live interviews, async assessments, AI-assisted coding environments, pair programming with agents, and every other context in which someone is trying to figure out how good a developer actually is.
- Nobody has cracked how to fairly assess human skill in a world where AI assistance is ambient and invisible, where the question is no longer "can you write this function" but "how effectively do you use AI to solve a real problem."
- You have shipped LLM-powered systems in production where consistency and reliability were hard constraints, not nice-to-haves.
- You can defend ML judgment in plain language to people who are not ML engineers, because the translation layer is part of the job. Even better if you have •
- Experience building evaluation frameworks for generative or conversational AI systems.
- Background in educational assessment, psychometrics, or human-in-the-loop evaluation at scale.
- Publications or open-source contributions in LLM evaluation, benchmarking, or alignment.