JEPA

Joint-Embedding Predictive Architecture (JEPA) is an AI architecture proposed by Yann LeCun as an alternative to autoregressive generative models (including LLMs). The core idea: learn by predicting in representation space rather than output space, eliminating the need to model unpredictable low-level details.

The problem JEPA solves

Generative models (predict pixels, predict tokens) must model everything in the input — including high-entropy, unpredictable components like exact pixel values, leaf positions in wind, or transcription noise. This forces model capacity to be spent on irrelevant detail rather than structure.

Ten years of training video generation models to predict pixel-level frames produced poor representations. The error signal from predicting exact pixels is dominated by noise irrelevant to understanding.

How it works

Full input → encoder → full representation
Corrupted input (masked or degraded version) → encoder → partial representation
Predictor: partial representation → predicted full representation
Loss: distance in representation space between predicted and actual full representation

No pixel or token reconstruction required. The encoder learns to retain structure relevant to prediction and discard unpredictable details.

V-JEPA

JEPA applied to video (Meta’s V-JEPA). Masks temporal regions of video and trains to predict the representation of masked regions from unmasked ones. The system learns representations that:

Capture physical plausibility and temporal dynamics
Distinguish physically possible from impossible event sequences
Form the basis of implicit world modelling

Contrast with LLMs

Property	LLMs (autoregressive)	JEPA
Prediction target	Next token (output space)	Abstract representation (latent space)
High-entropy details	Must model	Discarded by encoder
World model	Implicit, unverified	Explicit objective
Planning support	Minimal	Designed for hierarchical planning

Where mainstream views differ

LeCun’s claim that JEPA will outperform LLMs for world modelling is contested. Critics note:

Large-scale video LLMs (e.g., Sora) have produced surprisingly strong world-model-like behaviour from generative training
JEPA has not yet been demonstrated to produce better downstream task performance at comparable scale
The bandwidth argument (language < sensory data) doesn’t rule out that language is sufficiently compressed to reconstruct world models