research
Posted 8 hours agoUser Researcher, AI Evaluations
at Notion
San Francisco, United StatesRemote
Requirements
- We're building one place where your knowledge, projects, meetings, and AI tools live side by side, so work is faster, clearer, and less fragmented.
- Each and every team of Notinos is working to set the standard for how humans work together in the AI era.
- From building a business’s system of record to making and managing AI agents to automating away the busy work, we care deeply about giving our customers more time for their life’s work.
- ABOUT THE ROLE: We’re seeking an experienced UX Researcher to define and scale how we evaluate Notion’s AI-powered experiences—focusing on what “good” looks like not only for model output quality, but for the end-to-end product
- experience where people discover, set goals, delegate work, review results, and build trust over time with AI.
- Lead qualitative studies, side-by-side comparisons, and human-in-the-loop evaluation efforts to deepen understanding of where experiences break down and how they can improve.
- SKILLS YOU'LL NEED TO BRING: - Ability to operationalize insight into measurement: You’re comfortable turning “soft” user expectations (trust, tone, usefulness, clarity) into concrete rubrics, scoring guidelines, and observable metrics. - AI fluency and systems thinking: You’re curious and hands-on with AI products, and can reason about how model behavior, uncertainty, and system constraints shape user experience. You also have
- experience evaluating AI-enabled products (LLMs, agents, generative UI/workflow automation) and working with Data Science/ML partners on measurement strategy and evaluation tooling. - Clear communication and impact orientation: You can align diverse partners around shared definitions of quality and create artifacts that enable teams to act consistently.
- Experience: 5+ years doing UX research in industry NICE TO HAVES: - Familiarity with LLM-as-judge methods, prompt design for evaluators, or “golden dataset” creation -
- Experience using AI research tooling for rapid synthesis and communication (e.g., Dovetail, Listen Labs, Maze, Outset, etc.), as well as AI observability tooling like Braintrust -
- Experience using data querying languages (e.g., SQL), scripting languages (e.g., Python), or statistical/mathematical software (e.g., R, SAS, Matlab, etc.) - Master’s or PhD in HCI, Psychology, Behavioral Science, Anthropology, Sociology, or a related field - You’re familiar with the work of computing heroes like Douglas Engelbart, Alan Kay, Bret Victor, etc. — and understand why we're big fans.
Benefits
- Notion is committed to providing highly competitive cash compensation, equity, and benefits.
- The compensation offered for this role will be based on multiple factors such as location, the role’s scope and complexity, and the candidate’s
Additional details
- Millions of individuals, small teams, and large companies run their work on Notion.
- Notinos (our employees) are customer zero in bringing this future of work to life.
- We care about craft, building things that last, and the belief that great work is still fundamentally human.
- This role sits at the intersection of research craft and evaluation operations: you’ll run studies that uncover user mental models, expectations, and failure/recovery behaviors, then translate those insights into reusable rubrics, workflows, and measurement approaches that product, design, engineering, and data science can apply consistently.
- This role can be based in either San Francisco or New York City.
- We work from our offices on Mondays, Tuesdays and Thursdays (our Anchor Days) because we do our best thinking and building together in person.
- We’re looking for someone who’s excited to work alongside the team during those days.
- WHAT YOU'LL ACHIEVE: - Define what “good” looks like (frameworks & rubrics): Establish clear, reusable evaluation criteria that reflect real user expectations—helpfulness, trust, tone, control, and transparency.
- You’ll translate qualitative insight into scoring guidance that can be applied consistently across teams and over time. - Run recurring evals (longitudinal & feature-specific): Run recurring longitudinal and feature-specific surveys and studies to measure
- experience quality over time against defined rubrics.