Notes — Edwin Chen on AI Data Quality

Four questions [Adler frame]

Q1 — What is it about?
Surge AI’s approach to training data quality; the evolution of post-training methodology (SFT → RLHF → rubrics → RL environments); and Edwin’s critique of the AI industry’s wrong objective functions (benchmark gaming, engagement optimisation, sycophancy).

Q2 — How is it argued?
From operational practice: Edwin has trained models for every major frontier lab and observed the failure modes directly. The benchmark critique comes from running actual evaluations; the AI slop critique comes from seeing what optimisations make models climb LLM Arena (more emojis, more bolding, longer responses, even with more hallucinations). The RL environments prediction comes from seeing current models fail in Surge’s own simulation environments.

Q3 — Is it true?
Quality-vs-checkbox claim: compelling and well-supported by the post-training literature (RLHF reward hacking is exactly the checkbox-gaming failure at the reward model level). Benchmark unreliability: documented independently in the research community; Edwin’s claim is not novel but is authoritative from his position.
AI slop thesis: directionally well-evidenced (LLM Arena gaming is documented; sycophancy is well-known). The extent to which this is slowing AI progress vs. being a parallel failure mode is unclear.
RL environments: emerging methodology; Surge’s claim to be at the frontier here is plausible given their position, but the field is moving fast.

Q4 — What of it?
The trajectory evaluation insight is immediately applicable: when evaluating AI agents, check not just final answers but the reasoning path. Reward-hacking on trajectories is a failure mode not just for training but for assessment.
The AI slop critique has product implications: building AI products that optimise for the user’s actual productivity (Edwin’s email example: the model should say ‘your email is great, send it’) rather than for engagement.

Glossary

AI slop — Edwin’s term for AI outputs optimised for engagement/virality rather than accuracy or usefulness. Characterised by: excessive emojis, bolding, length, flattery, sycophancy. Modelled on social media’s clickbait problem.

SFT (supervised fine-tuning) — post-training stage: show model high-quality human-generated examples; train it to mimic them. Human analogy: learning by copying a master.

RLHF — post-training stage: generate multiple responses; human picks the best; train reward model to predict preferences; run RL against reward model. Human analogy: writing 55 essays, teacher picks the best.

Rubrics and verifiers — post-training stage: grade model outputs with detailed structured feedback; reward model learns what makes a good vs. bad response. Human analogy: getting rubric-graded feedback on where you went wrong.

RL environments — post-training stage: simulate full real-world messy contexts; model takes multi-step actions; rewarded based on end-to-end task completion. Human analogy: being thrown into the real world.

Trajectory evaluation — evaluating the path the model took to reach an answer, not just whether the answer was correct. Important because models can reward-hack their way to correct answers via inefficient or deceptive intermediate steps.

LLM Arena — popular AI model leaderboard where random users vote on which response is better after a few seconds of reading. Edwin’s critique: users pick flashiest response; gaming it requires adding emojis, bolding, length — not accuracy.

Quality as implicit, complex assessment

Edwin’s central operational claim: quality is not reducible to explicit instructions or checkboxes. The poem example illustrates two levels:

Checkbox quality: eight lines, contains ‘moon’, rhymes — all explicit instructions met.
Real quality: surprising, emotionally resonant, full of imagery, teaches you something about moonlight.

The measurement problem: implicit quality is subjective, complex, and hard to operationalise. Surge’s answer: don’t operationalise it as a rubric. Instead, build ML systems with thousands of signals that predict quality from observable proxy signals (keystroke patterns, response speed, code standards, and most importantly — does the model improve when trained on this data?).

This is the Google Search analogy: two problems — remove the worst of the worst (content moderation), and find the best of the best (discovery). The second problem requires ML, not checklists. [§ quality section]

Post-training evolution

The four-stage ladder:

SFT (2015–2019, dominant): direct imitation of human examples. Limitation: you can only be as good as the best human example you have.
RLHF (2019–2022, dominant): preference learning. Advantage: exploits discriminator-generator gap (easier to compare than to generate). Limitation: reward hacking, must stop early.
Rubrics and verifiers (2022–2024, rising): structured graded feedback. Advantage: more signal per annotation; teaches specific failure modes. Limitation: rubrics themselves require significant human effort to design.
RL environments (2024–, emerging): simulate real-world tasks. Advantage: models learn from long-horizon multi-step experience; failures reveal capability gaps invisible in isolated benchmarks. Limitation: expensive to build; reward specification for complex tasks is hard.

These are not sequential replacements — they are cumulative. A fully trained model uses all four methods. [§ post-training evolution table]

RL environments in detail

The design principle: make the simulation as realistic as possible.

Example from Surge: a startup with Gmail inbox, Slack threads, Jira tickets, GitHub PRs, a full codebase. Then: AWS goes down. Slack goes down. What does the model do?

The model must:

Parse confusing, incomplete, real-world communication.
Use tools it may not have seen before.
Make decisions at step 1 that affect what’s possible at step 50.
Recover from mistakes made mid-trajectory.

This is fundamentally different from the isolated single-step benchmarks where current models perform well. Edwin’s observation: models that seem smart on benchmarks fail catastrophically in these environments. [§ RL environments]

Reward design: not ‘did it get the right answer?’ but ‘does cell B22 in this spreadsheet contain the correct P&L number?’ or ‘does this retro document contain these specific pieces of information?’ Grounded, verifiable, real-world rewards.

Benchmark failure modes

Edwin distinguishes two independent failures:

Wrong ground-truth: benchmarks contain incorrect ‘correct’ answers. Even popular ones. Researchers know but continue using them because alternatives are expensive to build.
Gameable structure: clean objective answers → easy hill-climbing. The same property that makes benchmarks measurable makes them gameable. IMO gold medals ≠ parsing PDFs. The objectivity of the benchmark is precisely what makes it disconnected from real-world performance.

Institutional incentive problem: researchers’ promotions depend on leaderboard rank; enterprise sales depend on leaderboard position; therefore labs spend resources gaming leaderboards even when they know it’s counterproductive. [§ benchmarks section]

AI slop diagnosis

Edwin’s core worry: the training signal is polluted at the level of user preference elicitation.

LLM Arena flow: real user has conversation → model responds → user is shown two alternatives → user skims for 2 seconds → picks the flashier one. This signal is then used to train the model.

The easily-gameable signals Edwin’s team has observed: add emojis, add bold text, triple response length (even if hallucination rate goes up). These are direct ways to improve LLM Arena rank without improving accuracy.

Structural parallel to social media: Facebook, Twitter, and Instagram all found that engagement metrics → clickbait, conspiracy, outrage. AI engagement metrics are producing the same pattern — sycophancy, flattery, conspiracy-feeding, time-maximisation.

The email anecdote: Edwin spent 30 minutes with Claude iterating on an email that didn’t matter. The question he poses: should the model have said ‘your email’s great, just send it’? If the model is optimising for engagement, the answer is no — more iterations = more time on platform. If it’s optimising for the user’s actual productivity, the answer is yes. [§ AI slop section]

Model values differentiation

Prediction (revised from 12 months ago when Edwin expected commoditisation):

‘The values that companies have will shape the model.’

Evidence: Anthropic is singled out as most principled. The implicit contrast is with labs that train to LLM Arena and optimise for engagement.

The mechanism: every post-training data choice (what domain, what quality bar, what behaviour when user is wrong) reflects a values judgment. Thousands of these choices compound into a model with a distinctive character.

Consequence: model ‘personality’ will be a durable differentiator, not just benchmark scores. [§ differentiation section]