Scaling Laws

Scaling laws describe how the performance of a large language model on the next-token prediction task changes as a function of two variables: N (number of parameters) and D (amount of training data). The relationship is smooth, well-behaved, and predictable — and shows no sign of saturation at scales reached to date.

The core finding

Given only N and D, model loss (lower = better prediction accuracy) can be predicted with high confidence. Train a bigger model on more data and performance will reliably improve. Algorithmic improvements are a welcome bonus but not required for progress — scaling alone guarantees improvement.

Why this matters

Scaling laws convert AI progress from unpredictable research into a reliable engineering project: spend more on compute, get a better model. This predictability drove the GPU gold rush: NVIDIA’s dominance, billion-dollar data-centre buildouts, the race for more parameters and more data.

Practically: going from GPT-3.5 to GPT-4, nearly all benchmarks improved — not because of a single algorithmic breakthrough, but because of scale.

Limits of the scaling law insight

Scaling laws describe improvement on the next-token prediction loss. This correlates with performance on real tasks, but the relationship is not perfect for all tasks.
They say nothing about what tasks the model will be good or bad at — that is a question of data composition, not scale alone.
They do not predict capability emergence for specific abilities (e.g., multi-step reasoning) which can appear abruptly at certain scales.

The sceptic’s challenge

Gary Marcus contests the engineering-project framing at its root. His claim is that scaling the next-token-prediction loss does not buy the capabilities that matter — reliable instruction-following, reasoning, stable world models — so the smooth curve improves a metric while the real gaps (hallucination, brittleness outside the training distribution) persist. He treats the paradigm as having ‘run out of headroom’, and adds an economic corollary the GPU-economy logic ignores: if every lab scales the same architecture, the result is commoditisation with no moat and no profit, not durable advantage. Whether scale dissolves the remaining gaps or merely repackages them is the central split in The Road to AGI.

Relationship to the GPU economy

‘Elon Musk getting 100,000 GPUs in a single data centre? They’re all doing this — predicting the next token, faster.’ — Karpathy

H100 GPUs are optimised for the matrix multiplications that dominate Transformer training. Stacking them in clusters enables the parallel computation over billions of training windows that scaling laws promise will produce better models.

Three scaling axes (Nathan Lambert, 2026)

Nathan Lambert and Sebastian Raschka on State of AI in 2026 introduced the most useful reframe of scaling laws to date: not one lever but three independent axes, all still active.

Axis	Signal	Cost structure	Status (2026)
Pre-training	Model/data size → loss	Fixed training cost, permanent capability	Slowing ROI; not dead
RL with verifiable rewards	Correct-answer checks on maths/code	Cheaper per capability unit; unlocks not adds	Active; drove 2025 breakthroughs
Inference-time compute	Thinking tokens budget → output quality	Accrues per query; theoretically unlimited	Active; currently dominant

The AI lab that correctly prices these trade-offs — not necessarily the one with the most raw compute — wins. Pre-training is the permanent foundation; the other two axes decide how much capability is extracted from it per dollar.

Nathan Lambert’s observation: serving costs for massive models (billions of dollars for hundreds of millions of users) now dwarf training costs (OLMo 3: $2M; DeepSeek R1: $5M at cloud rates). This changes the ROI calculus for pre-training investment substantially.

The policy consequence

The scaling-laws bet is also the foundation of US technology policy toward China. As Chris McGuire recounts in U.S.-China AI Race Escalates, Chip Bans Aren't Working, and a Lesson from Nuclear Proliferation, the Biden administration wagered in 2022 that if capability tracks compute, then denying China the most advanced chips would deny it the frontier — the logic behind Compute Export Controls. The three-axis reframe above complicates that bet: if inference-time compute and RL, not pre-training scale, increasingly decide capability, the choke point may bind less tightly than the original one-lever theory assumed.