Notes — Intro to Large Language Models

Notes on Andrej Karpathy — standalone talk, November 2023.

Four questions [Adler frame]

Q1 — What is it about as a whole? A one-hour busy-person introduction to LLMs: what they are, how they’re built, what they can do, what’s coming, and what security challenges they face. The LLM OS framing is the organising metaphor.

Q2 — How is it argued? Concrete examples throughout: Llama 2 70B (two-file model), Scale AI valuation demo (tool use), Greg Brockman website sketch (vision), AlphaGo (RL analogy), GPT App Store (customisation). The OS analogy provides structure for the final synthesis.

Q3 — Is it true, in whole or part? The technical pipeline description is accurate and consistent with published papers. Some forward-looking claims (System 2, self-improvement) were aspirational in 2023; the RL sections in ‘Deep Dive’ (2025) show how they have since partially materialised. Scaling law claims remain empirically supported.

Q4 — What of it? The OS framing remains generative: it correctly predicted the direction of development (tool use, multi-agent, customisation). The scaling law observation continues to drive the GPU economy. The reward-criterion problem for self-improvement was correctly identified as the key bottleneck.

Glossary

Parameters file — the trained weights; for Llama 2 70B, 140 GB (70B × 2 bytes/float16).
Run file — the code to execute the network; ~500 lines of C.
Lossy compression — the training process compresses the internet into the weights; unlike lossless ZIP, information is lost. ~100× compression ratio.
Base model — network after pretraining; internet document simulator.
Assistant model — network after fine-tuning; follows Q&A format.
Fine-tuning — continuing training on human-labelled conversations; cheap relative to pretraining.
Scaling laws — model performance as a smooth, predictable function of N (parameters) and D (data).
RLHF — human preference rankings used as a proxy reward signal.
Reversal curse — knowledge appears directional: A → B known; B → A unknown.
System 1 / System 2 — Kahneman’s framework; LLMs (2023) are System 1 only.
GPT App Store — OpenAI’s customisation layer for creating task-specific GPTs.
LLM OS — LLM as kernel process of a new computing paradigm.
RAG — retrieval-augmented generation; fetching relevant documents into context.

Key claims by section

LLM inference [§ LLM Inference]

Two files: parameters file (140 GB for Llama 2 70B at float16) and run file (~500 lines C). [§ LLM Inference]
Fully self-contained; runs offline on a MacBook. No internet required. [§ LLM Inference]

LLM training [§ LLM Training]

~10 TB text, ~6,000 GPUs, ~12 days, ~$2M for Llama 2 70B. [§ LLM Training]
~100× lossy compression of training text into parameters. [§ LLM Training]
State-of-the-art models (GPT-4, Claude): multiply these figures by 10+. [§ LLM Training]

LLM dreams [§ LLM Dreams]

Inference = sampling from the learned distribution → generates internet-like documents.
ISBN in hallucinated product listing: model knows ISBN format; fills in a plausible number. [§ LLM Dreams]
Some output is memorised; some is novel generation consistent with training distribution. [§ LLM Dreams]

Mechanistic interpretability [§ How Do They Work?]

We understand the mathematical operations exactly; we don’t know what the parameters are doing. [§ How Do They Work?]
Reversal curse: GPT-4 knows Tom Cruise → Mary Lee Pfeiffer; fails Mary Lee Pfeiffer → son. [§ How Do They Work?]

Fine-tuning [§ Fine-Tuning into an Assistant]

Switch dataset: internet documents → ~100,000 human-labelled Q&A conversations. [§ Fine-Tuning into an Assistant]
Quality over quantity; human labellers (Upwork, Scale AI) with detailed instructions. [§ Comparisons, Labeling Instructions, RLHF, and Synthetic Data]
Pre-training: knowledge. Fine-tuning: alignment. [§ Fine-Tuning into an Assistant]

Scaling laws [§ LLM Scaling Laws]

Performance = f(N, D) where N = parameters, D = training tokens. Smooth and predictable. [§ LLM Scaling Laws]
No sign of saturation. More compute → more reliably better model. [§ LLM Scaling Laws]
Driving the GPU gold rush and NVIDIA’s valuation. [§ LLM Scaling Laws]

Tool use [§ Tool Use]

ChatGPT demo: Scale AI valuation table → plot → trend line → image. Entirely in natural language. [§ Tool Use]
Tool use = integrating existing computing infrastructure with language. [§ Tool Use]

Multimodality [§ Multimodality]

Vision: Brockman sketch → working HTML/JS website. [§ Multimodality]
Speech-to-speech: iOS app; purely conversational interface. [§ Multimodality]

System 1 vs System 2 [§ Thinking: System 1 and System 2]

LLMs (2023) are System 1: each token gets the same compute budget. [§ Thinking: System 1 and System 2]
System 2 aspiration: ‘take 30 minutes, think it through.’ Not yet available (2023). [§ Thinking: System 1 and System 2]
Goal: accuracy monotonically increasing with time allowed. [§ Thinking: System 1 and System 2]

AlphaGo / self-improvement [§ Self-Improvement and the AlphaGo Analogy]

Stage 1 (SL): plateaus below best human level.
Stage 2 (RL): surpasses best human level in 40 days. [§ Self-Improvement and the AlphaGo Analogy]
LLM bottleneck: no automatic reward function for general language. Narrow domains (with reward functions) may be tractable. [§ Self-Improvement and the AlphaGo Analogy]

LLM OS [§ The LLM OS]

LLM = kernel process. Context window = RAM. Disk/internet = storage. Tools = peripherals. [§ The LLM OS]
Proprietary (GPT, Claude, Gemini) : open-source (Llama) :: Windows/macOS : Linux. [§ The LLM OS]