JEPA
Joint-Embedding Predictive Architecture (JEPA) is an AI architecture proposed by Yann LeCun as an alternative to autoregressive generative models (including LLMs). The core idea: learn by predicting in representation space rather than output space, eliminating the need to model unpredictable low-level details.
The problem JEPA solves
Generative models (predict pixels, predict tokens) must model everything in the input — including high-entropy, unpredictable components like exact pixel values, leaf positions in wind, or transcription noise. This forces model capacity to be spent on irrelevant detail rather than structure.
Ten years of training video generation models to predict pixel-level frames produced poor representations. The error signal from predicting exact pixels is dominated by noise irrelevant to understanding.
How it works
- Full input → encoder → full representation
- Corrupted input (masked or degraded version) → encoder → partial representation
- Predictor: partial representation → predicted full representation
- Loss: distance in representation space between predicted and actual full representation
No pixel or token reconstruction required. The encoder learns to retain structure relevant to prediction and discard unpredictable details.
V-JEPA
JEPA applied to video (Meta’s V-JEPA). Masks temporal regions of video and trains to predict the representation of masked regions from unmasked ones. The system learns representations that:
- Capture physical plausibility and temporal dynamics
- Distinguish physically possible from impossible event sequences
- Form the basis of implicit world modelling
Contrast with LLMs
| Property | LLMs (autoregressive) | JEPA |
|---|---|---|
| Prediction target | Next token (output space) | Abstract representation (latent space) |
| High-entropy details | Must model | Discarded by encoder |
| World model | Implicit, unverified | Explicit objective |
| Planning support | Minimal | Designed for hierarchical planning |
Where mainstream views differ
LeCun’s claim that JEPA will outperform LLMs for world modelling is contested. Critics note:
- Large-scale video LLMs (e.g., Sora) have produced surprisingly strong world-model-like behaviour from generative training
- JEPA has not yet been demonstrated to produce better downstream task performance at comparable scale
- The bandwidth argument (language < sensory data) doesn’t rule out that language is sufficiently compressed to reconstruct world models
See also Large Language Models, World Models, Yann LeCun on Meta AI, LLMs, and the Path to AGI.