Vision Research Tracker

********LeCun: Beyond LLMs — multimodal world-models, latent planning and video SSL** [developing]** [developing]** [developing]

********LeCun: Beyond LLMs — multimodal world-models, latent planning and video SSL** [developing]** [developing]** [developing]

Key Questions

What is Yann LeCun's vision for AI beyond large language models?

Yann LeCun advocates for multimodal world-models that integrate text, image, and video pretraining, using techniques like Mixture of Experts (MoE), conditional compute, and latent geometry for physical planning. Key advancements include joint-embedding predictive architectures and latent planning enabled by self-supervised learning on video data.

What is SIGReg/LeWorldModel?

SIGReg/LeWorldModel is a new 15M parameter Joint-Embedding Predictive Architecture (JEPA) model achieving 48x speedup on Push-T tasks, trained on YouTube and arXiv data. It focuses on world modeling for efficient latent planning.

What are Joint-Embedding Predictive World Models?

These models, highlighted in LeCun's repost, enable physical planning through predictive architectures that learn latent representations from multimodal data. They emphasize self-supervised learning for building internal world models.

What is V-JEPA 2.1 and ThinkJEPA?

V-JEPA 2.1 is an advanced video JEPA model, while ThinkJEPA extends it for reasoning. Both contribute to efficient video understanding and planning in world models.

What is Lyra 2.0?

Lyra 2.0 builds persistent 3D worlds from video generation, enabling consistent spatial and temporal structures. It is part of high-value reproductions integrating with MoE and benchmarks like GameplayQA.

Yann LeCun (Apr 2026) joint text+image+video pretraining, MoE/conditional compute, latent geometry. New: SIGReg/LeWorldModel (15M JEPA 48x speedup Push-T, YouTube/arXiv), Joint-Embedding Predictive World Models (physical planning LeCun repost), Temporal Straightening, delta tokens (CVPR26 Argoverse2 1-token video compression), HyDRA/Out-of-Sight dynamic memory (HM-World), V-JEPA 2.1, ThinkJEPA, Stereo WM/WorldAgents, WildWorld (108M-frame game), WorldCache/PackForcing, Omni-WorldBench/QuantiPhy/GameplayQA, Yilun Du, Pulse, DiT animal motion (300h dataset), TrackMAE motion-aware MAE (SOTA 6 datasets), VOID physics-aware editing, Phantom physics-infused video gen, CT-1 VLM-to-video control, Prompt Relay/Uni-ViGU unified gen/und, Lyra 2.0 persistent 3D worlds. High-value repro: ablations, V-JEPA/SIGReg/LeWM/ThinkJEPA/Pulse/HyDRA/Out-of-Sight/DiT/TrackMAE/Joint-Embedding/VOID/Phantom/CT-1/delta tokens/Prompt Relay/Uni-ViGU/Lyra 2.0 in MoE/TTT/GameplayQA w/ latency/power/QuantiPhy/WildBench.

Sources (5)
Updated Apr 17, 2026