Scaling Laws
Scaling laws describe how the performance of a large language model on the next-token prediction task changes as a function of two variables: N (number of parameters) and D (amount of training data). The relationship is smooth, well-behaved, and predictable — and shows no sign of saturation at scales reached to date.
The core finding
Given only N and D, model loss (lower = better prediction accuracy) can be predicted with high confidence. Train a bigger model on more data and performance will reliably improve. Algorithmic improvements are a welcome bonus but not required for progress — scaling alone guarantees improvement.
Why this matters
Scaling laws convert AI progress from unpredictable research into a reliable engineering project: spend more on compute, get a better model. This predictability drove the GPU gold rush: NVIDIA’s dominance, billion-dollar data-centre buildouts, the race for more parameters and more data.
Practically: going from GPT-3.5 to GPT-4, nearly all benchmarks improved — not because of a single algorithmic breakthrough, but because of scale.
Limits of the scaling law insight
- Scaling laws describe improvement on the next-token prediction loss. This correlates with performance on real tasks, but the relationship is not perfect for all tasks.
- They say nothing about what tasks the model will be good or bad at — that is a question of data composition, not scale alone.
- They do not predict capability emergence for specific abilities (e.g., multi-step reasoning) which can appear abruptly at certain scales.
Relationship to the GPU economy
“Elon Musk getting 100,000 GPUs in a single data centre? They’re all doing this — predicting the next token, faster.” — Karpathy
H100 GPUs are optimised for the matrix multiplications that dominate Transformer training. Stacking them in clusters enables the parallel computation over billions of training windows that scaling laws promise will produce better models.
Three scaling axes (Nathan Lambert, 2026)
Nathan Lambert and Sebastian Raschka on State of AI in 2026 introduced the most useful reframe of scaling laws to date: not one lever but three independent axes, all still active.
| Axis | Signal | Cost structure | Status (2026) |
|---|---|---|---|
| Pre-training | Model/data size → loss | Fixed training cost, permanent capability | Slowing ROI; not dead |
| RL with verifiable rewards | Correct-answer checks on maths/code | Cheaper per capability unit; unlocks not adds | Active; drove 2025 breakthroughs |
| Inference-time compute | Thinking tokens budget → output quality | Accrues per query; theoretically unlimited | Active; currently dominant |
The AI lab that correctly prices these trade-offs — not necessarily the one with the most raw compute — wins. Pre-training is the permanent foundation; the other two axes decide how much capability is extracted from it per dollar.
Nathan Lambert’s observation: serving costs for massive models (billions of dollars for hundreds of millions of users) now dwarf training costs (OLMo 3: $2M; DeepSeek R1: $5M at cloud rates). This changes the ROI calculus for pre-training investment substantially.