Notes — Deep Dive into LLMs like ChatGPT

Notes on Andrej Karpathy — standalone talk, February 2025.

Four questions [Adler frame]

Q1 — What is it about as a whole? A comprehensive technical walkthrough of the full LLM pipeline — from internet data collection through pretraining, post-training (SFT + RL + RLHF), deployment, and security — aimed at a general audience. The goal is mental models for using LLMs effectively.

Q2 — How is it argued? Concrete examples throughout: FineWeb (data), GPT-2 (training demo), Llama 3.1 405B (base model inference), DeepSeek-R1 (RL), AlphaGo (RL analogy). The security section uses real-world attack examples and explains the mechanism of each. Technical claims are illustrated rather than proved.

Q3 — Is it true, in whole or part? The pipeline description matches the published literature (InstructGPT, DeepSeek-R1, AlphaGo). The security examples were accurate at time of recording; some specific exploits will have been patched. The RLHF reward hacking concern is well-established. The ‘models need tokens to think’ insight is empirically supported.

Q4 — What of it? The key practical upshot: know the model’s failure modes (hallucination, jagged intelligence, arithmetic errors) and compensate by using tools, putting information in context, and verifying outputs. The RL section suggests thinking models are qualitatively different from SFT-only models for hard reasoning tasks.

Glossary

Token — integer ID in a vocabulary of ~100,000; the atomic unit of text in an LLM.
Pretraining — training on internet text to predict the next token; produces a base model.
Post-training — SFT + RL + RLHF applied after pretraining to produce an assistant model.
Base model — internet document simulator; no assistant behaviour.
SFT (supervised fine-tuning) — training on human-labelled Q&A conversations.
RLHF — reinforcement learning using human preference rankings as a proxy reward signal.
Reward model — a neural network trained to simulate human preference; queried by RL.
Jailbreak — a prompt that circumvents safety training.
Prompt injection — malicious instructions embedded in content the model reads.
Data poisoning — backdoor attack via controlled training data.
Jagged intelligence — Swiss-cheese capability profile; brilliant in most places, arbitrary gaps.

Key claims by section

Pretraining data [§ Pretraining Data]

FineWeb (Hugging Face) is ~44 TB after filtering; representative of production datasets. [§ Pretraining Data]
Common Crawl has indexed 2.7 billion web pages since 2007. [§ Pretraining Data]
Pipeline: URL filtering → text extraction → language filtering (FineWeb: >65% English) → deduplication → PII removal. [§ Pretraining Data]

Tokenisation [§ Tokenisation]

BPE starts from 256 byte symbols; iteratively merges frequent pairs. [§ Tokenisation]
GPT-4 uses 100,277 tokens. [§ Tokenisation]
Model sees tokens, not characters; spelling and counting tasks are structurally hard without tools. [§ Tokenisation]

Training [§ Neural Network I/O]

15-trillion-token sequence from FineWeb. [§ Neural Network I/O]
Network learns to predict next token; loss decreases over billions of updates. [§ Neural Network I/O]
GPT-2 (1.6B params, 1,024-token context): originally ~$40,000; now ~$600 with llm.c. [§ GPT-2: Training and Inference]

Base model behaviour [§ Llama 3.1 Base Model Inference]

Base model is stochastic; same prompt → different result each time. [§ Llama 3.1 Base Model Inference]
High-quality sources (Wikipedia) are oversampled → near-verbatim recall. [§ Llama 3.1 Base Model Inference]
Hallucination beyond training cutoff: Llama 3 (cutoff end-2023) invented parallel-universe 2024 election outcomes. [§ Llama 3.1 Base Model Inference]

Post-training [§ Post-Training Data]

Conversation format: special tokens (<|im_start|>, <|im_sep|>, <|im_end|>). [§ Tokenizing Conversations]
InstructGPT (OpenAI, 2022): seminal paper on human-labelled assistant training. [§ What These Datasets Look Like]
What you’re talking to: ‘a statistical simulation of a human labeler’ — not a magical AI. [§ What You’re Actually Talking To]

Hallucination [§ Hallucinations]

Models learn the style of confident answers; the style applies even when knowledge is absent. [§ Hallucinations]
Llama 3 mitigation: generate Q&A from known documents; interrogate model; add refusal examples for consistently wrong answers. [§ Hallucinations]

Knowledge vs working memory [§ Knowledge vs. Working Memory]

Parametric knowledge: vague recollection of training corpus.
Context window knowledge: direct access to current content.
Implication: put specific information in context rather than asking the model to recall it. [§ Knowledge vs. Working Memory]

Tokens to think [§ Models Need Tokens to Think]

Each token gets a fixed compute budget (one forward pass).
Forcing an answer into the first token is impossible for anything requiring multi-step computation.
Chain-of-thought distributes computation; intermediate tokens are genuinely used. [§ Models Need Tokens to Think]

Reinforcement learning [§ Reinforcement Learning]

SFT teaches the model to imitate human solutions; RL lets the model discover its own.
DeepSeek-R1: RL on maths → accuracy increases, response length grows, chains of thought emerge without being programmed. [§ DeepSeek-R1]

AlphaGo analogy [§ AlphaGo]

SL model plateaus below top human level; RL surpasses it.
Move 37: 1-in-10,000 probability estimate by humans; in retrospect brilliant. Discovered, not taught. [§ AlphaGo]

RLHF [§ Reinforcement Learning from Human Feedback]

Ranking > scoring > generating (discriminator-generator gap). [§ RLHF]
Reward hacking: ‘the the the’ scores 1.0 on a reward model. Must stop early. [§ RLHF]
RLHF ≠ scalable RL. It polishes; true RL requires a rigid verifier. [§ RLHF]

Security [§ LLM Security]

Jailbreak examples: roleplay framing, base64 encoding, adversarial suffix, adversarial images. [§ Jailbreaks]
Prompt injection: white-on-white text in images; attacker-controlled web pages fetched by Bing; Google Doc exfiltration via Markdown image tag + Apps Script. [§ Prompt Injection]
Data poisoning: ‘James Bond’ trigger in fine-tuning data → misclassification. [§ Data Poisoning]
Each new modality (vision, audio) opens a new attack surface. [§ Security Conclusions]