Reading Notes

Intro to Large Language Models

Source: Intro to Large Language Models

Notes — Intro to Large Language Models

Source: raw/llm-intro-script.md | Author: Andrej Karpathy | Nov 2023


Four questions [Adler frame]

Q1 — What is it about as a whole? A one-hour busy-person introduction to LLMs: what they are, how they’re built, what they can do, what’s coming, and what security challenges they face. The LLM OS framing is the organising metaphor.

Q2 — How is it argued? Concrete examples throughout: Llama 2 70B (two-file model), Scale AI valuation demo (tool use), Greg Brockman website sketch (vision), AlphaGo (RL analogy), GPT App Store (customisation). The OS analogy provides structure for the final synthesis.

Q3 — Is it true, in whole or part? The technical pipeline description is accurate and consistent with published papers. Some forward-looking claims (System 2, self-improvement) were aspirational in 2023; the RL sections in “Deep Dive” (2025) show how they have since partially materialised. Scaling law claims remain empirically supported.

Q4 — What of it? The OS framing remains generative: it correctly predicted the direction of development (tool use, multi-agent, customisation). The scaling law observation continues to drive the GPU economy. The reward-criterion problem for self-improvement was correctly identified as the key bottleneck.


Glossary

  • Parameters file — the trained weights; for Llama 2 70B, 140 GB (70B × 2 bytes/float16).
  • Run file — the code to execute the network; ~500 lines of C.
  • Lossy compression — the training process compresses the internet into the weights; unlike lossless ZIP, information is lost. ~100× compression ratio.
  • Base model — network after pretraining; internet document simulator.
  • Assistant model — network after fine-tuning; follows Q&A format.
  • Fine-tuning — continuing training on human-labelled conversations; cheap relative to pretraining.
  • Scaling laws — model performance as a smooth, predictable function of N (parameters) and D (data).
  • RLHF — human preference rankings used as a proxy reward signal.
  • Reversal curse — knowledge appears directional: A → B known; B → A unknown.
  • System 1 / System 2 — Kahneman’s framework; LLMs (2023) are System 1 only.
  • GPT App Store — OpenAI’s customisation layer for creating task-specific GPTs.
  • LLM OS — LLM as kernel process of a new computing paradigm.
  • RAG — retrieval-augmented generation; fetching relevant documents into context.

Key claims by section

LLM inference [§ LLM Inference]

  • Two files: parameters file (140 GB for Llama 2 70B at float16) and run file (~500 lines C). [§ LLM Inference]
  • Fully self-contained; runs offline on a MacBook. No internet required. [§ LLM Inference]

LLM training [§ LLM Training]

  • ~10 TB text, ~6,000 GPUs, ~12 days, ~$2M for Llama 2 70B. [§ LLM Training]
  • ~100× lossy compression of training text into parameters. [§ LLM Training]
  • State-of-the-art models (GPT-4, Claude): multiply these figures by 10+. [§ LLM Training]

LLM dreams [§ LLM Dreams]

  • Inference = sampling from the learned distribution → generates internet-like documents.
  • ISBN in hallucinated product listing: model knows ISBN format; fills in a plausible number. [§ LLM Dreams]
  • Some output is memorised; some is novel generation consistent with training distribution. [§ LLM Dreams]

Mechanistic interpretability [§ How Do They Work?]

  • We understand the mathematical operations exactly; we don’t know what the parameters are doing. [§ How Do They Work?]
  • Reversal curse: GPT-4 knows Tom Cruise → Mary Lee Pfeiffer; fails Mary Lee Pfeiffer → son. [§ How Do They Work?]

Fine-tuning [§ Fine-Tuning into an Assistant]

  • Switch dataset: internet documents → ~100,000 human-labelled Q&A conversations. [§ Fine-Tuning into an Assistant]
  • Quality over quantity; human labellers (Upwork, Scale AI) with detailed instructions. [§ Comparisons, Labeling Instructions, RLHF, and Synthetic Data]
  • Pre-training: knowledge. Fine-tuning: alignment. [§ Fine-Tuning into an Assistant]

Scaling laws [§ LLM Scaling Laws]

  • Performance = f(N, D) where N = parameters, D = training tokens. Smooth and predictable. [§ LLM Scaling Laws]
  • No sign of saturation. More compute → more reliably better model. [§ LLM Scaling Laws]
  • Driving the GPU gold rush and NVIDIA’s valuation. [§ LLM Scaling Laws]

Tool use [§ Tool Use]

  • ChatGPT demo: Scale AI valuation table → plot → trend line → image. Entirely in natural language. [§ Tool Use]
  • Tool use = integrating existing computing infrastructure with language. [§ Tool Use]

Multimodality [§ Multimodality]

  • Vision: Brockman sketch → working HTML/JS website. [§ Multimodality]
  • Speech-to-speech: iOS app; purely conversational interface. [§ Multimodality]

System 1 vs System 2 [§ Thinking: System 1 and System 2]

  • LLMs (2023) are System 1: each token gets the same compute budget. [§ Thinking: System 1 and System 2]
  • System 2 aspiration: “take 30 minutes, think it through.” Not yet available (2023). [§ Thinking: System 1 and System 2]
  • Goal: accuracy monotonically increasing with time allowed. [§ Thinking: System 1 and System 2]

AlphaGo / self-improvement [§ Self-Improvement and the AlphaGo Analogy]

  • Stage 1 (SL): plateaus below best human level.
  • Stage 2 (RL): surpasses best human level in 40 days. [§ Self-Improvement and the AlphaGo Analogy]
  • LLM bottleneck: no automatic reward function for general language. Narrow domains (with reward functions) may be tractable. [§ Self-Improvement and the AlphaGo Analogy]

LLM OS [§ The LLM OS]

  • LLM = kernel process. Context window = RAM. Disk/internet = storage. Tools = peripherals. [§ The LLM OS]
  • Proprietary (GPT, Claude, Gemini) : open-source (Llama) :: Windows/macOS : Linux. [§ The LLM OS]