Notes — Karina Nguyen on Model Training and AI Product

Four questions [Adler frame]

Q1 — What is it about?
A practitioner’s view from inside both frontier AI labs: how models are actually trained, how Canvas and Tasks were built using synthetic data, why soft skills outlast hard skills, and where things are going (personal model). Grounded in direct hands-on experience with Claude 3 training, 100K context launches, and ChatGPT Canvas.

Q2 — How is it argued?
Concrete examples from direct experience — the Claude 3 self-knowledge contradiction bug, the three Canvas behaviours trained synthetically, the spreadsheet method for evals, early Claude-in-Slack prototypes. No theory; all practitioner observation.

Q3 — Is it true?
Strong credibility: Karina worked on Claude 3 post-training and the Canvas/Tasks launches firsthand. Her observations about synthetic data for product behaviours are corroborated by the general RL in post-training literature and other sources in the wiki (Edwin Chen’s post-training ladder).

Q4 — What of it?
The synthetic data for product behaviours insight is practically valuable: any team building AI features can use a stronger model to generate behavioural training data for a weaker/targeted model. The form follows function insight is generalisable: ask not ‘what can the model do?’ but ‘what form factor makes that capability usable?‘

Glossary

Post-training — training that happens after pre-training; includes SFT, RLHF, rubrics, and RL environments. Where Karina’s work lived at both labs.

Synthetic data — training data generated by a model (often a stronger/larger model) rather than collected from humans. For product features: use o1 to simulate user conversations → generate ideal model responses → train target model on these examples.

Deterministic eval — a binary pass/fail evaluation based on an exact match criterion. Example: if user says ‘7:00 PM’, model output must contain ‘7:00 PM’. No subjectivity.

Win-rate eval — human raters compare model completions from two model versions; the better one ‘wins.’ Used continuously to ensure new model versions always exceed previous ones.

Trigger behaviour — the decision about when a product feature should activate (Canvas opens) versus when to respond inline. Trained as a binary classifier.

Form factor — the UI/UX pattern that makes a model capability accessible to users. Example: file upload makes 100K context window accessible; notification/reminder makes scheduled agent tasks accessible.

Self-knowledge contradiction — Karina’s bug: training data that teaches the model it has no physical body, combined with training data about tool calls for physical actions (set alarm), caused over-refusal on legitimate tasks.

Frontier Product Research — Karina’s team at OpenAI: trains models, develops new methods, but oriented toward product outcomes rather than pure research.

Synthetic data for product behaviours [§ Synthetic data]

Canvas behaviours were defined as three decision boundaries:

Trigger — when to open the Canvas panel. Ground truth labels: ‘Write me a long essay’ → trigger; ‘Who was the first US president?’ → do not trigger. Labels → deterministic eval → synthetic training.
Edit — how to modify a document section when asked. Two sub-decisions: (a) find the right section; (b) full rewrite vs. targeted in-place edit. When launched, biased toward rewrites (higher quality); shifted based on user feedback.
Comment — how to make inline annotations. Pipeline: o1 generates document → inject ‘critique this’ prompt → o1 annotates specific spans → train target model to reproduce this comment-placement behaviour.

Process: spec the behaviours → write ground truth labels (spreadsheet) → feed spreadsheet to o1 → o1 generates training examples → train model → measure via deterministic evals → iterate.

Why synthetic data: cheaper than human annotation, more scalable, generalises well once core behaviour is defined. After beta launch, real user data shifts the distribution.

The data wall argument [§ No data wall]

The ‘data wall’ applies to pre-training (predicting next tokens on internet text). Post-training has no comparable wall:

Any task is trainable: ‘how to search the web’, ‘how to schedule a meeting’, ‘how to write a sci-fi story’.
Infinite tasks = infinite training data.
Evidence: benchmarks like GPQA are saturating — models exceed PhD-level on many tasks. The bottleneck shifts to developing harder evals, not to finding more data.

Small models beating large old models (e.g., Claude 3 Haiku > Claude 2): distillation research. The cost of intelligence is declining.

Anthropic vs. OpenAI [§ Anthropic vs. OpenAI culture]

Not enemies — one community. Key difference Karina observed: Anthropic is more focused and craft-oriented (meticulous about model character, careful prioritisation); OpenAI is more bottoms-up and experimental (more things get tried, more creative research freedom at scale). [?] This was Karina’s subjective observation, not a formal comparison.

Claude’s personality is a reflection of Anthropic’s team. The model is the output of the process; the process is shaped by the people.

Form follows function [§ Form follows function]

The 100K context window was a raw capability. File uploads was the form factor that made it useful. The interface change — a familiar button to upload a PDF — unlocked enterprise use cases (financial analysis, research) that the same capability did not unlock when presented as raw API tokens.

The implication for product development: ‘What can the model do?’ is less important than ‘What form factor makes this usable?’ Canvas makes collaborative document editing accessible. Tasks makes scheduled agents accessible via notification.

Skills for the future [§ Skills for the future]

Karina’s career decision is a data point: she switched from front-end engineering to research when she saw Claude getting good at coding. Practical question: what can models not do well, and for how long?

Hard to automate:

Aesthetic/taste judgments (good visual design).
Creative writing (bottlenecked by creative reasoning, not knowledge).
Research management: allocating constrained compute to highest-conviction research paths.
Deriving human intent correctly in agentic tasks.

Already good:

Coding and front-end.
Data synthesis across sources.
Strategy recommendations given sufficient context.