Notes — Nathan Lambert and Sebastian Raschka on State of AI in 2026

Lex Fridman Podcast #490. 2026. Note: partial extraction.

Four questions [Adler frame]

Q1 — What is it about?
A year-in-review of AI progress co-authored by two ML researchers: Nathan Lambert (post-training lead at Allen Institute for AI, author of the definitive RLHF book) and Sebastian Raschka (ML educator, author of Build a Large Language Model from Scratch). Topics: the three scaling axes (pre-training, RL, inference-time); the open-weight model explosion driven by China; the pre-training vs mid-training vs post-training pipeline; data quality as the new differentiator; and what to expect in 2026.

Q2 — How is it argued?
Primarily practitioner testimony — both guests work directly on training pipelines and open-weight models (OLMo at AI2, mlxtend). Arguments are grounded in operational experience rather than theory. Nathan draws on AI2’s OLMo 3 work for pre-training data insights; Sebastian draws on his books and direct experimentation. The discussion is structured as enumerated claims (‘three scaling dimensions’), not a single thesis.

Q3 — Is it true?
The three-scaling-axes claim is empirically well-supported and echoed across the field. The claim that data quality, not quantity, drove OLMo 3’s improvements is specific and credible. The ‘pre-training is not dead, just less attractive short-term’ analysis reconciles competing narratives. The China/US framing is balanced — neither triumphalist nor dismissive. The AGI-as-task-completion-rate framing (Sebastian: ‘30–40% reliable completion today, 90–95% = practically AGI’) is operationally useful if philosophically deflating.

Q4 — What of it?
The most important structural insight: AI training is now a multi-axis optimisation problem, not a single scaling lever. Pre-training is expensive but permanent; inference scaling is cheap per capability unit but accrues per query; RL with verifiable rewards unlocks capabilities without adding knowledge. The AI lab that correctly prices these trade-offs — not necessarily the one with the most raw compute — wins. This framing reframes the ‘scaling is dead’ debate as a question of ROI per axis rather than ceiling.

Glossary

Pre-training — next-token-prediction training on massive internet/book/code corpora. Cross-entropy loss. Produces broad world knowledge and language capability. Permanent investment: trained once, served forever.

Mid-training — continuation of pre-training on higher-quality or domain-specific data (long-context documents, premium text sources). Avoids catastrophic forgetting by preserving general knowledge while adding specialised capability. Positioned between raw pre-training and RLHF post-training.

Post-training — refinement stages after pre-training: supervised fine-tuning (SFT), DPO, RLHF. Relatively cheap. Unlocks capabilities from the base model rather than adding new knowledge. Includes RLVR for verifiable domains.

RLVR (Reinforcement Learning with Verifiable Rewards) — RL post-training where the reward signal comes from checking model outputs against known correct answers (maths, code, logic). No human preference labelling required. Fully automated feedback loop. Enabled the o1 and DeepSeek R1 reasoning breakthroughs.

Mixture-of-experts (MoE) — transformer architecture where the feedforward layer is replaced by multiple specialised sub-networks (‘experts’), with a router selecting which experts to activate per token. Packs more knowledge into fewer active parameters per forward pass. Used by DeepSeek, GPT-5, and others for efficient serving.

Scaling laws — power-law relationships predicting model loss from compute and data. Now framed as three axes: pre-training (model/data size), RL (verifiable-reward steps), and inference-time compute (chain-of-thought budget). All three remain active.

Inference-time scaling — improving model output quality by spending more compute at inference (longer chain-of-thought, more search). The ‘thinking tokens’ axis. Claude Opus 4.5’s dominant advantage, per Sebastian.

Synthetic data — training data generated or enhanced beyond raw web scraping: AI-rephrased text, Q&A from Wikipedia, OCR-extracted PDFs, LLM-generated examples. Quality of curation determines value. AI2’s insight: OCR from openly accessible scientific PDFs (via Semantic Scholar) provides high-signal pre-training data.

Catastrophic forgetting — tendency of neural networks to lose previously learned capabilities when trained on new data. Mid-training addresses this by carefully curating data mixtures to preserve general knowledge.

Three scaling axes [§ AI Scaling Laws]

Nathan Lambert’s clearest contribution: reframing ‘scaling laws’ from a single axis (pre-training compute) to three independent axes.

Axis	Signal	Cost structure	Status
Pre-training	Model and dataset size → loss	Fixed training cost, permanent capability	Slowing ROI; not dead
RL with verifiable rewards	Correct-answer checks on maths/code/logic	Cheaper per capability unit; unlocks not adds	Active; drove 2025 breakthroughs
Inference-time compute	Thinking tokens budget → output quality	Accrues per query; infinite in principle	Active; dominant in frontier models

The key implication: pre-training is now the foundation, not the only lever. Future labs will balance ROI across all three axes based on deployment economics (training cost amortised over user lifetime × query volume).

Data quality as the differentiator [§ How AI is Trained]

OLMo 3 (AI2) outperformed some competitors despite using less total training data. Nathan’s explanation: data quality, not quantity.

The Common Crawl data refinement process:

Download hundreds of trillions of raw tokens from the web
Apply scientific classifiers to filter quality
Remap from domains (GitHub, arXiv, Stack Exchange, Reddit) at optimal ratios
Train small ‘data selection’ models on mixes, measure downstream evals, use linear regression to find optimal blend

Changing evaluation criteria changes optimal data mix substantially — ‘coding’ evals shift the mix toward GitHub/Stack Exchange; ‘reasoning’ evals shift toward arXiv and maths sources.

High-signal sources identified: arXiv papers, PDFs from Semantic Scholar (openly accessible scientific papers), Reddit (surprisingly), Stack Exchange. Frontier labs consumed these years before smaller labs.

Open-weight landscape [§ Open Source vs Closed Source LLMs]

Nathan’s categorisation (as of 2026):

Chinese (larger, MoE-heavy, higher peak performance):

DeepSeek, Kimi, MiniMax, Z.ai, Moonshot, Qwen

Western transparency-focused (smaller, fully open):

AI2 (OLMo), Hugging Face (SmolLM), NVIDIA (Nemotron), Stanford (MiniCommunity)
OpenAI’s gpt-oss — ‘first open-weight model trained with tool use in mind’; notable return

Nathan on LLaMA: ‘RIP.’ — implying Meta’s LLaMA has been eclipsed.

The business model for Chinese open releases: use open-weight distribution as a wedge into markets where US companies face security/payment barriers. Government incentives maintain this strategy for years regardless of commercial viability.

Sebastian’s insight: tool use built into the model architecture (gpt-oss) solves hallucination through verification rather than memorisation. A model that can search for ‘Who won the 1998 World Cup?’ doesn’t need to memorise the answer.

The ‘no winner’ thesis [§ China vs US]

Both guests reject winner-takes-all framing. Key argument: no company currently holds exclusive technology. Ideas diffuse rapidly through researcher mobility, open papers, and open-weight models. The differentiator will be budget and hardware access, not proprietary ideas.

The leapfrogging pattern: competitors implement DeepSeek’s innovations → DeepSeek responds → cycling continues. The most recent model is typically best, but leadership is temporary.

Company building [§ Coding and Agents]

Both guests use AI coding tools daily with different preferences:

Sebastian: Codeium + VS Code — prefers control, dislikes fully agentic
Nathan: multi-session parallel queries with thinking models
Lex: Claude Code as a ‘macroscopic programming environment’ — designs in English, understands deeply

Nathan’s observation: Claude Code performs measurably better than alternatives for agentic coding. The differentiation is visible when running identical prompts across tools.

AGI as task completion rate [§ Future of AI and AGI]

Sebastian’s reframing: rather than debating AGI definitions, track the percentage of complex multi-step tasks completed reliably. Current estimate: 30–40%. At 90–95%, the philosophical distinction between ‘AGI’ and ‘very powerful AI’ becomes practically meaningless.

Nathan’s position: ‘AGI’ is semantically awkward. Systems dramatically more capable than today across most cognitive tasks will exist within years. Whether that’s AGI is definitional. The economic pressure to deploy autonomous agents is already intense enough that capability and deployment will co-evolve rather than capability preceding deployment by years.