Ilya Sutskever on the Age of Research, Continual Learning, and Alignment — Notes

Source-grounded literature notes. Citations point to the episode’s timestamped sections, e.g. [§ What are we scaling?]. Own words unless quoted.

Four questions [Adler frame]

Q1 — What is it about? A state-of-the-field conversation with Ilya Sutskever (OpenAI co-founder and former chief scientist, now CEO of SSI) making three linked claims: the field has moved from an age of scaling back to an age of research; the central unsolved problem is generalisation — models generalise dramatically worse than humans; and the path to safe superintelligence runs through continual learning plus aligning AI to care for sentient life.

Q2 — How is it argued? Dialectically and by analogy. Ilya reasons mostly through analogies (the two competitive-programming students; the stroke patient who lost emotion and could no longer decide; the ‘superintelligent 15-year-old’) and introspective conjecture about human learning, evolution, and the brain. He is conspicuously guarded — he repeatedly declines to detail his actual research ideas because ‘we live in a world where not all machine learning ideas are discussed freely.’

Q3 — Is it true? These are calibrated conjectures from arguably the field’s best research taste, not results. The generalisation-gap claim is empirically anchored (the eval/real-world disconnect is observable). The proposed solutions are explicitly withheld. The evolutionary and neuro-speculations he flags as speculation and even refutes himself (the brain-region ‘GPS coordinates’ hypothesis [§ Alignment]). The argument that SSI’s research compute is ‘comparable’ to rivals’ is self-interested. The 5–20-year timeline is a wide band. Read this as a leading researcher’s top-down beliefs, deliberately non-specific on mechanism.

Q4 — What of it? A reweighting of where value now lies: ideas over compute again (‘more companies than ideas’); generalisation and continual learning over scaling; alignment reframed as ‘care for sentient life’ plus a capability cap; and research taste understood as a top-down aesthetic conviction that sustains you when experiments contradict you. Plus a sociological prediction: as AI becomes visibly powerful, labs and governments will turn paranoid and start cooperating on safety.

Glossary

Age of scaling → age of research — Ilya’s periodisation: 2012–2020 research, 2020–2025 scaling (the recipe known, compute the differentiator), 2025+ back to research ‘just with big computers’. See Age of Research.
Generalisation gap — the crux: models generalise far worse than people, in both sample efficiency and robustness, even in domains (maths, coding) too recent for an evolutionary prior. See Generalization Gap.
Value function — in RL, an estimate that lets you assign credit before a trajectory finishes (lose a chess piece → know it was bad without playing on). Ilya argues human emotions are an evolution-hardcoded value function — simple, robust, and necessary for effective agency.
Reward-hacking the evals — Ilya’s organisational-level reward hacking: researchers, wanting good launch numbers, build RL environments inspired by the benchmarks, so eval scores climb while real-world performance lags. See Evals.
Pre-training — the scaling recipe (compute + all the data + a sized net → predictable gains); low-risk for companies, but the data is finite and ‘will run out’. See Pretraining.
Continual learning — learning from deployment, the way a human learns on the job; Ilya’s reframing of superintelligence as a fast learner, not a finished omniscient artefact. See Continual Learning.
AGI / narrow AI — terms Ilya says shaped (and distorted) thinking: ‘AGI’ was a reaction to ‘narrow AI’, and with pre-training it ‘overshot’ — a human is not an AGI; we rely on continual learning.
Straight-shot superintelligence — SSI’s strategy of building superintelligence away from market pressure and releasing only when ready; Ilya now hedges it toward gradual deployment.
Care for sentient life — Ilya’s proposed alignment target: easier than aligning to humans alone because the AI will itself be sentient, so empathy-like properties (cf. mirror neurons) may emerge.
Feel / feel the AGI — Ilya’s recurring point that future AI is too different from today’s to imagine, so you must show it; even most AI workers can’t ‘feel the AGI’.
Research taste — choosing directions by an aesthetic — beauty, simplicity, correct inspiration from the brain — held as a top-down belief. See Research Taste.

Claims by section

§ Explaining model jaggedness

The opening puzzle: models ace hard evals yet their economic impact lags, and they make absurd mistakes — the vibe-coding loop where the model fixes bug A by reintroducing bug B and oscillates forever. Two explanations. The whimsical one: RL training makes models too single-minded and narrowly aware. The structural one: pre-training never forced a choice of data (‘the answer was everything’), but RL forces teams to hand-build environments, and there are so many degrees of freedom that they ‘inadvertently’ take inspiration from the evals. Dwarkesh’s gloss, which Ilya likes: ‘the real reward hacking is the human researchers who are too focused on the evals.’ Combine eval-shaped RL with genuinely inadequate generalisation and you can explain the eval/real-world disconnect ‘that we don’t today even understand’.

§ Emotions and value functions

A value function lets you learn before a task ends — the chess example: you know losing a piece was bad without finishing the game. Today’s RL (o1, R1 ‘ostensibly’) often learns only from the final graded solution; a value function would short-circuit that. Ilya connects this to a stroke patient who lost emotional processing: still articulate and able to solve puzzles, but unable to decide — hours to pick socks, terrible financial choices. The lesson: human emotions are an evolved value function, ‘modulated by emotions in some important way that’s hardcoded by evolution’, simple enough to perhaps map out yet robust across a world utterly unlike the ancestral one (though imperfect — hunger misfires amid abundant food). He waves off the DeepSeek-R1 worry that intermediate-trajectory value is too hard to learn as ‘such lack of faith in deep learning’.

§ What are we scaling?

‘Scaling’ is ‘one word, but such a powerful word because it informs people what to do’ — an example of language shaping thought. Pre-training was a recipe: mix compute, data, and a sized net and you reliably improve — which companies love because it’s a low-risk way to deploy capital, unlike telling researchers to ‘go forth and research’. But pre-training data is finite and will run out (he notes rumours that Gemini squeezed more from pre-training). Now compute is huge and the question is no longer ‘scale more’ but ‘is this the most productive use of compute?’ RL already consumes more compute than pre-training at some labs (long rollouts, little learning per rollout). So we are ‘back to the age of research again, just with big computers’. See Age of Research, Scaling Laws, Pretraining, Bitter Lesson.

§ Why humans generalise better than models

The crux. Two sub-questions: sample efficiency (why so much more data?) and teachability/robustness (why is it so hard to teach a model what a mentored human picks up without verifiable rewards?). Evolution may explain priors for vision, hearing, locomotion — but not maths and coding, which are too recent; that humans are reliable and robust there ‘is more an indication that people might have just better machine learning, period’. A five-year-old recognises cars from very low-diversity data; a teenager drives after ~10 hours using their own internal value function, no external teacher. ‘The robustness of people is really staggering.’ How humans do it is the question he has ‘a lot of opinions about’ but won’t share; he flags one possible blocker — human neurons may do more compute than assumed. See Generalization Gap, Jagged Intelligence.

§ Straight-shotting superintelligence

Back in an age of research, the bottleneck shifts from compute to ideas: ‘more companies than ideas by quite a bit’, and ‘if ideas are so cheap, how come no one’s having any ideas?’ He recalls the compute history — AlexNet on two GPUs, the transformer on 8–64 GPUs of 2017 (‘two GPUs of today’) — to argue research does not require the absolute largest cluster, only enough to convince. On SSI’s $3bn: rivals’ bigger raises are largely earmarked for inference and product staff, so the gap in research compute is ‘a lot smaller’. SSI’s straight-shot (build superintelligence insulated from the ‘rat race’, release when ready) may bend for two reasons: timelines might be long, and there’s real value in powerful AI being visible in the world.

§ SSI’s model will learn from deployment

Two words ‘shaped everyone’s thinking’: AGI (a reaction to ‘narrow AI’) and pre-training (which made people expect uniform, general gains from one phase). Both overshot: ‘a human being is not an AGI’ — we have a foundation of skills plus heavy reliance on continual learning. So superintelligence is better framed as a ‘superintelligent 15-year-old’ — knows little, learns fast, eager — deployed to learn on the job through trial and error, ‘a process, as opposed to you dropping the finished thing’. Even a straight-shot would be released gradually. Dwarkesh draws out the implication: a single fast-learning model, instances deployed across the economy and amalgamating their learnings (as humans cannot), could become functionally superintelligent without recursive self-improvement; Ilya expects rapid but hard-to-rate economic growth, faster where regulation is friendlier. See Continual Learning.

§ Alignment

Ilya’s mind has shifted toward incremental, advance deployment because ‘it’s very hard to feel the AGI’ — like imagining being old while young; even AI researchers can’t picture it, so ‘you’ve got to be showing the thing’. Predictions as AI visibly gains power: fierce competitors will start collaborating on safety (the early OpenAI–Anthropic step, which he forecast ~three years prior); governments and the public will demand action; and labs will grow far more paranoid once the AI ‘starts to feel powerful’ through its capability rather than its mistakes. ‘The whole problem is the power.’

His positive proposal: instead of the field’s single locked-in idea (self-improving AI, locked in because there are fewer ideas than companies), build an AI robustly aligned to care for sentient life — plausibly easier than caring for humans alone, because the AI will itself be sentient and empathy may emerge from modelling others with the self-model (mirror neurons). He concedes the criterion may be wrong (most sentient beings would be AIs), wants a ‘short list’ of alignment ideas labs can reach for, and thinks capping the most powerful system’s power ‘would be materially helpful’. A long digression marvels that evolution reliably hard-codes high-level social desires (status, standing) — not just chemical drives like smell — and he refutes his own neat hypothesis (that evolution targets fixed brain-region ‘GPS coordinates’) with cases of cortical remapping and hemispherectomy. Alignment difficulty, he notes, may itself be ‘instances of unreliable generalisation’.

§ Age of research company

SSI’s differentiator is its technical approach — some ideas around understanding generalisation he wants to test; ‘we are squarely an age of research company’ and have made ‘quite good progress over the past year’. On his co-founder Daniel Gross leaving for Meta: he reframes it factually — SSI was fundraising at a $32bn valuation, Meta offered to acquire SSI, Ilya said no, his co-founder ‘in some sense said yes’, took near-term liquidity, and was the only one to go. He predicts convergence, first on alignment strategy and probably later on technical approach: as AI grows more powerful it will become clear to everyone that the goal is a first superintelligence that is aligned, cares for sentient life, and is democratic.

§ Self-play and multi-agent

On why models from different labs are so similar: pre-training on near-identical data; differentiation now emerges in RL/post-training. Self-play interested him as a way to make models from compute alone, without data (compelling if data is the bottleneck), but classic competitive self-play is too narrow — good only for negotiation, conflict, strategising. It found a home in a different form: debate and prover-verifier / LLM-as-judge adversarial setups. More generally, competition between agents creates a natural incentive to differentiate — a route to genuine diversity of approaches, which raising temperature (‘just results in gibberish’) does not give. He doubts massed copies of one mind help: ‘you want people who think differently’, so ‘a million Ilyas’ would hit diminishing returns.

§ Research taste

His final, personal account: taste is guided by ‘an aesthetic of how AI should be, by thinking about how people are, but thinking correctly’. The artificial neuron, distributed representations, learning from experience — each a correct, simplifying inspiration from the brain (the folds probably don’t matter; the many neurons do). The test is ‘beauty, simplicity, elegance, correct inspiration from the brain’, all present at once. Crucially, this yields a top-down belief that ‘sustains you when the experiments contradict you’ — because a contradicting result might just be a bug, and conviction that ‘something like this has to work’ is what tells you to keep debugging rather than abandon the direction. See Research Taste.