Notes — Dr. Fei Fei Li on World Models

Four questions [Adler frame]

Q1 — What is it about?
The history of modern AI through the lens of Fei-Fei’s career (ImageNet → deep learning → LLMs), and the argument that spatial intelligence and world models are the necessary next step beyond language: AI must understand and reason in 3D physical space, not only in text.

Q2 — How is it argued?
From research practice and first-principles reasoning about what intelligence requires. The ImageNet argument (1996–2012) was inductive from perceptual science: humans learn from vast perceptual experience, therefore AI needs equivalent data. The world models argument is deductive from capability gaps: list what language models cannot do (count chairs in a video, derive Newton’s laws, navigate a disaster scene), observe that these require spatial reasoning not linguistic reasoning, conclude that spatial intelligence is missing.

Q3 — Is it true?
The ImageNet claim: validated decisively by history. The big-data + neural-network + GPU recipe underlies every current frontier model.
The world models claim: compelling at the level of principle. Current AI demonstrably fails at tasks requiring spatial reasoning. Whether world models (as currently formulated) are the right path is genuinely open; other approaches (robotics, simulation, video pretraining) are also being explored. The Marble product is early-stage.
The bitter lesson in robotics claim: structurally sound. The objective–data alignment argument is well-reasoned and consistent with the broader bitter lesson literature. [§ Bitter lesson in robotics]

Q4 — What of it?
The practical implication for a builder: language is not the ceiling of AI capability; spatial reasoning, physical planning, and world understanding are the next frontier. Products that required 3D understanding (architecture, robotics, simulation, design) will be unlocked in the next AI wave in a way analogous to how language products were unlocked post-ChatGPT.

Glossary

World model — an AI foundation system that creates navigable 3D environments from prompts and enables reasoning and interaction within them. Not a video generator (2D temporal sequence) but a 3D spatial structure. See World Models.

Spatial intelligence — Fei-Fei’s term for the AI capability to create, reason in, and interact with 3D physical environments. The complement to linguistic intelligence.

ImageNet — dataset of 15M labelled images across 22,000 object concepts, open-sourced by Fei-Fei’s lab (2006–2009). The data breakthrough that enabled the 2012 deep learning explosion.

AlexNet — the 2012 ImageNet Challenge winner, built by Geoff Hinton’s Toronto group using 2 NVIDIA GPUs. Validated the big-data + neural-network + GPU recipe that underlies all modern AI.

Bitter lesson — Richard Sutton’s observation that simple models with large data always outperform complex models with small data. See Bitter Lesson. Fei-Fei notes this is harder to apply to robotics because of the objective–data alignment problem.

HAI — Human-Centered AI Institute at Stanford. Co-founded by Fei-Fei Li in 2018. Focus: research, education, policy impact on AI. Involves hundreds of faculty across all 8 Stanford schools.

Marble — World Labs’ first product. Prompt-to-navigable-3D-world. Real-time generation on a single H100. 40× production time reduction in VFX use case.

Objective–data alignment — the match between what a model is trained to predict and what it is expected to output. LLMs have perfect alignment (predict next token → output tokens). Robots need to output actions in 3D worlds but cannot be trained on actions-in-3D-worlds data from the web.

ImageNet insight in detail

The pain point Fei-Fei identified c.2006: AI researchers were building increasingly sophisticated models (neural networks, Bayesian networks, support vector machines) but the models had no data. The field was model-obsessed and data-blind.

The insight: human learning is a big-data process. Evolution is a big-data process. The mathematical models were not wrong; they were starved. So the right investment was not a better model but a better dataset.

The execution challenge: 15M images at internet scale required automation. Fei-Fei’s lab used Amazon Mechanical Turk for labelling — one of the earliest large-scale crowdsourcing operations in AI research.

The taxonomy came from WordNet (a linguistic dictionary structure), applied to visual categories. 22,000 concepts, hierarchically organised.

The annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) created competitive pressure that drove the whole field toward the dataset. 2012 was the culmination. [§ ImageNet history]

Three-ingredient recipe

Big data + neural networks + GPUs = modern AI. Fei-Fei’s framing:

Big data: ImageNet provided the first instance at scale for visual intelligence.
Neural networks: Hinton’s group used them in 2012; transformer is a more complex evolution of the same paradigm.
GPUs: 2 gaming-grade NVIDIA GPUs in 2012; hundreds of thousands of data-centre GPUs in 2024.

ChatGPT and every current frontier model use these same three ingredients. The architecture (transformer vs. earlier networks) evolved; the recipe did not. [§ AI history section]

World models vs. language models

The capability gap Fei-Fei highlights:

‘Ask AI today to count the chairs in an office-room video — something a toddler can do. AI cannot. Ask it to derive Newtonian mechanics from celestial body data it wasn’t trained on. Impossible.’

These failures are not language failures. They require spatial reasoning, visual attention, and physical world understanding that current language models lack by architecture.

The Plato cave analogy: vision is about reconstructing 3D from 2D. Spatial intelligence is deeper than video generation — it is the ability to model the underlying 3D world from which any 2D projection is derived.

Watson–Crick / DNA example: the double helix was deduced from Rosalind Franklin’s 2D X-ray diffraction photo. That deduction required 3D spatial reasoning that current AI cannot replicate. [§ World models section]

Bitter lesson in robotics

Sutton’s bitter lesson: simple model + big data > complex model + small data. Fei-Fei’s nuanced response:

Big data will play a role in robotics. The principle is directionally right.
But the alignment problem is real: you can’t just throw web videos at a robot. Web videos don’t contain 3D-grounded actions. Teleoperation data is expensive; synthetic data is approximate.
Hardware maturation is a separate bottleneck. Self-driving cars (2D surfaces, goal is not touching things) took 20 years (DARPA 2005 → commercial Waymo). Robots are 3D and must manipulate — harder on both dimensions.
World models are one path to synthetic 3D action data, potentially resolving the alignment problem.

Conclusion: bitter lesson is the right heuristic but the robotics problem needs new data strategies before the heuristic can fully apply. [§ Bitter lesson section]

Human-centred AI

Fei-Fei’s consistent position across her 25+ year career: AI is a tool made by people, for people. Its trajectory is determined by human choices, not by the technology itself.

The ‘nothing artificial’ formulation is not a rhetorical trick — it is a policy stance: if AI impacts are human choices, humans can be held responsible. This is the philosophical foundation of HAI and of her Congressional testimony.

Everyone has a role: creators should use AI as a storytelling tool. Farmers should exercise civic voice. Nurses should expect AI augmentation. The frame rejects both utopian passivity (‘AI will fix everything’) and dystopian fatalism (‘AI will replace everyone’). [§ Humanist section]