Concept

JEPA

conceptai-architecturemetadeep-learning

JEPA

Joint-Embedding Predictive Architecture (JEPA) is an AI architecture proposed by Yann LeCun as an alternative to autoregressive generative models (including LLMs). The core idea: learn by predicting in representation space rather than output space, eliminating the need to model unpredictable low-level details.


The problem JEPA solves

Generative models (predict pixels, predict tokens) must model everything in the input — including high-entropy, unpredictable components like exact pixel values, leaf positions in wind, or transcription noise. This forces model capacity to be spent on irrelevant detail rather than structure.

Ten years of training video generation models to predict pixel-level frames produced poor representations. The error signal from predicting exact pixels is dominated by noise irrelevant to understanding.


How it works

  1. Full input → encoder → full representation
  2. Corrupted input (masked or degraded version) → encoder → partial representation
  3. Predictor: partial representation → predicted full representation
  4. Loss: distance in representation space between predicted and actual full representation

No pixel or token reconstruction required. The encoder learns to retain structure relevant to prediction and discard unpredictable details.


V-JEPA

JEPA applied to video (Meta’s V-JEPA). Masks temporal regions of video and trains to predict the representation of masked regions from unmasked ones. The system learns representations that:

  • Capture physical plausibility and temporal dynamics
  • Distinguish physically possible from impossible event sequences
  • Form the basis of implicit world modelling

Contrast with LLMs

PropertyLLMs (autoregressive)JEPA
Prediction targetNext token (output space)Abstract representation (latent space)
High-entropy detailsMust modelDiscarded by encoder
World modelImplicit, unverifiedExplicit objective
Planning supportMinimalDesigned for hierarchical planning

Where mainstream views differ

LeCun’s claim that JEPA will outperform LLMs for world modelling is contested. Critics note:

  • Large-scale video LLMs (e.g., Sora) have produced surprisingly strong world-model-like behaviour from generative training
  • JEPA has not yet been demonstrated to produce better downstream task performance at comparable scale
  • The bandwidth argument (language < sensory data) doesn’t rule out that language is sufficiently compressed to reconstruct world models

See also Large Language Models, World Models, Yann LeCun on Meta AI, LLMs, and the Path to AGI.