JEPA: Why Predicting in Pixel Space Was the Wrong Goal All Along
Self-supervised learning has been dominated by two ideas: reconstruct masked pixels (MAE), or force representations of different views to be similar (DINO, BYOL, SimCLR). JEPA (Joint-Embedding Predictive Architecture) rejects both. It predicts abstract representations of masked regions, not pixels. This single architectural choice produces richer semantic features with 10x less compute than MAE and zero hand-crafted augmentations. Yann LeCun has been arguing for this design for decades. The empirical results are now here.