Notes — Intro to Large Language Models
Source: raw/llm-intro-script.md | Author: Andrej Karpathy | Nov 2023
Four questions [Adler frame]
Q1 — What is it about as a whole? A one-hour busy-person introduction to LLMs: what they are, how they’re built, what they can do, what’s coming, and what security challenges they face. The LLM OS framing is the organising metaphor.
Q2 — How is it argued? Concrete examples throughout: Llama 2 70B (two-file model), Scale AI valuation demo (tool use), Greg Brockman website sketch (vision), AlphaGo (RL analogy), GPT App Store (customisation). The OS analogy provides structure for the final synthesis.
Q3 — Is it true, in whole or part? The technical pipeline description is accurate and consistent with published papers. Some forward-looking claims (System 2, self-improvement) were aspirational in 2023; the RL sections in “Deep Dive” (2025) show how they have since partially materialised. Scaling law claims remain empirically supported.
Q4 — What of it? The OS framing remains generative: it correctly predicted the direction of development (tool use, multi-agent, customisation). The scaling law observation continues to drive the GPU economy. The reward-criterion problem for self-improvement was correctly identified as the key bottleneck.
Glossary
- Parameters file — the trained weights; for Llama 2 70B, 140 GB (70B × 2 bytes/float16).
- Run file — the code to execute the network; ~500 lines of C.
- Lossy compression — the training process compresses the internet into the weights; unlike lossless ZIP, information is lost. ~100× compression ratio.
- Base model — network after pretraining; internet document simulator.
- Assistant model — network after fine-tuning; follows Q&A format.
- Fine-tuning — continuing training on human-labelled conversations; cheap relative to pretraining.
- Scaling laws — model performance as a smooth, predictable function of N (parameters) and D (data).
- RLHF — human preference rankings used as a proxy reward signal.
- Reversal curse — knowledge appears directional: A → B known; B → A unknown.
- System 1 / System 2 — Kahneman’s framework; LLMs (2023) are System 1 only.
- GPT App Store — OpenAI’s customisation layer for creating task-specific GPTs.
- LLM OS — LLM as kernel process of a new computing paradigm.
- RAG — retrieval-augmented generation; fetching relevant documents into context.
Key claims by section
LLM inference [§ LLM Inference]
- Two files: parameters file (140 GB for Llama 2 70B at float16) and run file (~500 lines C). [§ LLM Inference]
- Fully self-contained; runs offline on a MacBook. No internet required. [§ LLM Inference]
LLM training [§ LLM Training]
- ~10 TB text, ~6,000 GPUs, ~12 days, ~$2M for Llama 2 70B. [§ LLM Training]
- ~100× lossy compression of training text into parameters. [§ LLM Training]
- State-of-the-art models (GPT-4, Claude): multiply these figures by 10+. [§ LLM Training]
LLM dreams [§ LLM Dreams]
- Inference = sampling from the learned distribution → generates internet-like documents.
- ISBN in hallucinated product listing: model knows ISBN format; fills in a plausible number. [§ LLM Dreams]
- Some output is memorised; some is novel generation consistent with training distribution. [§ LLM Dreams]
Mechanistic interpretability [§ How Do They Work?]
- We understand the mathematical operations exactly; we don’t know what the parameters are doing. [§ How Do They Work?]
- Reversal curse: GPT-4 knows Tom Cruise → Mary Lee Pfeiffer; fails Mary Lee Pfeiffer → son. [§ How Do They Work?]
Fine-tuning [§ Fine-Tuning into an Assistant]
- Switch dataset: internet documents → ~100,000 human-labelled Q&A conversations. [§ Fine-Tuning into an Assistant]
- Quality over quantity; human labellers (Upwork, Scale AI) with detailed instructions. [§ Comparisons, Labeling Instructions, RLHF, and Synthetic Data]
- Pre-training: knowledge. Fine-tuning: alignment. [§ Fine-Tuning into an Assistant]
Scaling laws [§ LLM Scaling Laws]
- Performance = f(N, D) where N = parameters, D = training tokens. Smooth and predictable. [§ LLM Scaling Laws]
- No sign of saturation. More compute → more reliably better model. [§ LLM Scaling Laws]
- Driving the GPU gold rush and NVIDIA’s valuation. [§ LLM Scaling Laws]
Tool use [§ Tool Use]
- ChatGPT demo: Scale AI valuation table → plot → trend line → image. Entirely in natural language. [§ Tool Use]
- Tool use = integrating existing computing infrastructure with language. [§ Tool Use]
Multimodality [§ Multimodality]
- Vision: Brockman sketch → working HTML/JS website. [§ Multimodality]
- Speech-to-speech: iOS app; purely conversational interface. [§ Multimodality]
System 1 vs System 2 [§ Thinking: System 1 and System 2]
- LLMs (2023) are System 1: each token gets the same compute budget. [§ Thinking: System 1 and System 2]
- System 2 aspiration: “take 30 minutes, think it through.” Not yet available (2023). [§ Thinking: System 1 and System 2]
- Goal: accuracy monotonically increasing with time allowed. [§ Thinking: System 1 and System 2]
AlphaGo / self-improvement [§ Self-Improvement and the AlphaGo Analogy]
- Stage 1 (SL): plateaus below best human level.
- Stage 2 (RL): surpasses best human level in 40 days. [§ Self-Improvement and the AlphaGo Analogy]
- LLM bottleneck: no automatic reward function for general language. Narrow domains (with reward functions) may be tractable. [§ Self-Improvement and the AlphaGo Analogy]
LLM OS [§ The LLM OS]
- LLM = kernel process. Context window = RAM. Disk/internet = storage. Tools = peripherals. [§ The LLM OS]
- Proprietary (GPT, Claude, Gemini) : open-source (Llama) :: Windows/macOS : Linux. [§ The LLM OS]