Reading Notes

Hamel Husain and Shreya Shankar on Evals

Source: Hamel Husain and Shreya Shankar on Evals

Notes — Hamel Husain and Shreya Shankar on Evals

Four questions [Adler frame]

Q1 — What is it about?
A live demo of the error analysis methodology for AI products. The answer to “how do I improve my AI product?” is: look at your traces, write notes on what fails (open coding), cluster failures into categories (axial coding), count them, build targeted binary LLM judges. Grounded in 30+ years of social science methodology; not an AI-specific invention.

Q2 — How is it argued?
Via live demo, not theory. A real AI deployment (Nurture Boss, apartment leasing chatbot across voice/email/text) is the running example. Hamel and Shreya share screens and walk through actual traces, a real spreadsheet, pivot tables, and a real LLM judge prompt. Research backing: Shreya’s 2024 paper “Who validates the validated?” on criteria drift.

Q3 — Is it true?
High practical credibility. Their course is the highest-grossing on Maven; the methodology is grounded in established social science (Andrew Ng referenced an 8-year-old video using the same error analysis process). The core claim — you discover more by looking at traces than by building rubrics upfront — is supported by the criteria drift research and consistent with observation across multiple wiki sources.

Q4 — What of it?
The methodology is immediately actionable. The time investment is modest (3–4 days upfront, 30 min/week). The most important implication: AI teams are systematically underinvesting in looking at their own data, and this is the single highest-ROI correction available to them. The process also reframes PRDs: the LLM judge prompt is the product requirements document, derived from real failure data.


Glossary

Open coding — Manually reading traces one at a time, writing a one-line note on the first upstream error observed per trace. One person only (benevolent dictator). Stop at first upstream error; do not catalogue everything. Stops when no new failure modes are appearing (theoretical saturation).

Axial coding — Using an LLM to cluster the open codes from multiple traces into named, actionable failure mode categories. Human reviews and edits the output. The result is a taxonomy of failure modes grounded in actual trace data.

Theoretical saturation — The stopping condition for open coding: stop when you are not uncovering new failure modes. In practice, ~100 traces as a minimum habit; the real signal is conceptual exhaustion, not a number.

Benevolent dictator — The principle that open coding should be done by one domain expert, not a committee. Preserves a coherent categorisation scheme and keeps the process tractable.

LLM-as-judge — An LLM prompted to evaluate one specific, named failure mode from a trace. Output is binary (true/false). Must be calibrated against human labels before deployment.

Criteria drift — Shreya’s research finding: people’s definition of a good LLM output changes as they review more outputs. Failure modes only become visible after seeing unexpected examples. Consequence: rubrics written entirely upfront are necessarily incomplete. [§ Criteria drift: “Who validates the validated?”]

Confusion matrix (judge calibration) — The method for validating an LLM judge. Examines false positive rate (judge flags error, human says none) and false negative rate (judge misses error human caught) separately. Raw % agreement is misleading when errors are rare.

“Who validates the validated?” — Shreya’s 2024 research paper (with collaborators) on criteria drift in LLM output validation.


Open coding in practice [§ Error analysis]

The Nurture Boss demo: an AI chatbot that handles apartment leasing inquiries across voice, email, and text channels. Hamel reads individual traces and writes notes. Example first-upstream-error note: “handoff failure — agent did not transfer to human despite walk-in request.”

The instruction for open coding: write what you see, do not try to organise it, do not try to be comprehensive. The LLM handles clustering later. A beginner’s instinct is to skip straight to building evaluators; open coding forces ground-level contact with the actual failure modes.

Why first upstream error only: later errors in a trace are often consequences of the first. Coding the first error maximises signal per annotation hour.


Axial coding in practice [§ Error analysis]

The open codes from 100 traces feed into an LLM prompt asking it to cluster them. Hamel iterates on the clustering — editing categories, merging near-duplicates, splitting over-broad ones. The output for Nurture Boss included: handoff failures, conversational flow issues, tool call errors.

The pivot table: failure mode × count. Immediately shows priority order. For Nurture Boss, handoff failures were the dominant mode — worth building an evaluator for. Some failure modes were fixed by editing the system prompt; no evaluator needed for those.


LLM judge design [§ LLM-as-judge design]

The judge prompt for “handoff failure” specifies:

  • Output: true or false only.
  • Criteria: a list of conditions that should trigger a handoff (explicit human request, same-day walk-in, sensitive issue, tool unavailability, etc.).
  • Phrasing: precise, unambiguous, condition-by-condition.

Shreya: between 4 and 7 judges is enough for most applications. Many failure modes are fixable by prompt edits and do not need a judge. Only build judges for “pesky” failures — ones that persist despite prompt improvement.

The anti-pattern: a 1–7 Likert scale. Nobody knows what 4.2 means. Forces the model to occupy a middle ground that obscures whether the product is actually passing or failing. Forces a decision: is this good enough, yes or no?


Confusion matrix: calibrating a judge [§ Judge calibration]

Before deploying a judge, run it against the labelled traces (the axial coding output is the human label). Build a 2×2 confusion matrix.

The naive metric — “my judge agrees with humans 75% of the time” — is not sufficient. If a failure mode occurs in 10% of traces, a judge that always says “pass” achieves 90% agreement. The confusion matrix distinguishes: is the judge missing real errors (false negatives) or generating false alarms (false positives)? Iterate on the judge prompt until both rates are low.

Practical tip for PMs: when someone reports judge agreement %, always ask for the confusion matrix.


Online monitoring [§ LLM-as-judge design]

LLM judges are not only for pre-deployment CI. Sampled production traces run through the judge daily give a continuous failure rate measurement. This is the difference between a test suite (what you catch before shipping) and a quality monitor (what is actually happening in the real world).

Teams doing this have a precise, ongoing measure of application quality. They don’t talk about it because it is a moat.


The coding agent exception [§ Evals debate]

The “Claude Code doesn’t do evals” narrative: Hamel’s analysis is that coding agents are a special case that does not generalise. Two reasons:

  1. The developer is the domain expert. No gap between building and evaluating — the same person does both.
  2. Developers are power users who dogfood intensively. The feedback loop is short and visceral.

Most AI products (medical, real estate, customer service) do not have this property. Domain experts are not the developers. The feedback loop must be constructed deliberately.

The deeper point: “dogfooding” is often claimed but rarely practised at the visceral level needed. A team that says it dogfoods but does not regularly read traces is not closing the feedback loop.


Criteria drift and research grounding [§ Criteria drift]

Shreya’s “Who validates the validated?” (2024): a user study of developers writing LLM judges and validating LLM outputs. Key finding: experts’ definition of what is “good” changed as they reviewed outputs. Failure modes they would never have anticipated appeared after 10 outputs. This held even for experienced practitioners.

Practical consequence: product requirements documents (PRDs) written before looking at data are necessarily incomplete. The LLM judge prompt derived from error analysis is better than a PRD written upfront, because it was shaped by real failure data. The two should inform each other iteratively.


See also