User Researcher, AI Evaluations

at Notion

San Francisco, United StatesRemote

Python recruiting sql

Requirements

We're building one place where your knowledge, projects, meetings, and AI tools live side by side, so work is faster, clearer, and less fragmented.
Each and every team of Notinos is working to set the standard for how humans work together in the AI era.
From building a business’s system of record to making and managing AI agents to automating away the busy work, we care deeply about giving our customers more time for their life’s work.
ABOUT THE ROLE: We’re seeking an experienced UX Researcher to define and scale how we evaluate Notion’s AI-powered experiences—focusing on what “good” looks like not only for model output quality, but for the end-to-end product
experience where people discover, set goals, delegate work, review results, and build trust over time with AI.
Lead qualitative studies, side-by-side comparisons, and human-in-the-loop evaluation efforts to deepen understanding of where experiences break down and how they can improve.
SKILLS YOU'LL NEED TO BRING: - Ability to operationalize insight into measurement: You’re comfortable turning “soft” user expectations (trust, tone, usefulness, clarity) into concrete rubrics, scoring guidelines, and observable metrics. - AI fluency and systems thinking: You’re curious and hands-on with AI products, and can reason about how model behavior, uncertainty, and system constraints shape user experience. You also have
experience evaluating AI-enabled products (LLMs, agents, generative UI/workflow automation) and working with Data Science/ML partners on measurement strategy and evaluation tooling. - Clear communication and impact orientation: You can align diverse partners around shared definitions of quality and create artifacts that enable teams to act consistently.
Experience: 5+ years doing UX research in industry NICE TO HAVES: - Familiarity with LLM-as-judge methods, prompt design for evaluators, or “golden dataset” creation -
Experience using AI research tooling for rapid synthesis and communication (e.g., Dovetail, Listen Labs, Maze, Outset, etc.), as well as AI observability tooling like Braintrust -
Experience using data querying languages (e.g., SQL), scripting languages (e.g., Python), or statistical/mathematical software (e.g., R, SAS, Matlab, etc.) - Master’s or PhD in HCI, Psychology, Behavioral Science, Anthropology, Sociology, or a related field - You’re familiar with the work of computing heroes like Douglas Engelbart, Alan Kay, Bret Victor, etc. — and understand why we're big fans.

Benefits

Notion is committed to providing highly competitive cash compensation, equity, and benefits.
The compensation offered for this role will be based on multiple factors such as location, the role’s scope and complexity, and the candidate’s

Additional details

Millions of individuals, small teams, and large companies run their work on Notion.
Notinos (our employees) are customer zero in bringing this future of work to life.
We care about craft, building things that last, and the belief that great work is still fundamentally human.
This role sits at the intersection of research craft and evaluation operations: you’ll run studies that uncover user mental models, expectations, and failure/recovery behaviors, then translate those insights into reusable rubrics, workflows, and measurement approaches that product, design, engineering, and data science can apply consistently.
This role can be based in either San Francisco or New York City.
We work from our offices on Mondays, Tuesdays and Thursdays (our Anchor Days) because we do our best thinking and building together in person.
We’re looking for someone who’s excited to work alongside the team during those days.
WHAT YOU'LL ACHIEVE: - Define what “good” looks like (frameworks & rubrics): Establish clear, reusable evaluation criteria that reflect real user expectations—helpfulness, trust, tone, control, and transparency.
You’ll translate qualitative insight into scoring guidance that can be applied consistently across teams and over time. - Run recurring evals (longitudinal & feature-specific): Run recurring longitudinal and feature-specific surveys and studies to measure
experience quality over time against defined rubrics.

User Researcher, AI Evaluations

Requirements

Benefits

Additional details

Browse by category

Browse by skills

Browse by role

Browse by location