******LeCun: Beyond LLMs — multimodal world-models, latent planning and video SSL [developing] [developing] [developing]

Key Questions

What is Yann LeCun's vision for AI beyond large language models?

Yann LeCun advocates for multimodal world-models that integrate text, image, and video pretraining, using techniques like Mixture of Experts (MoE), conditional compute, and latent geometry for physical planning. Key advancements include joint-embedding predictive architectures and latent planning enabled by self-supervised learning on video data.

What is SIGReg/LeWorldModel?

SIGReg/LeWorldModel is a new 15M parameter Joint-Embedding Predictive Architecture (JEPA) model achieving 48x speedup on Push-T tasks, trained on YouTube and arXiv data. It focuses on world modeling for efficient latent planning.

What are Joint-Embedding Predictive World Models?

These models, highlighted in LeCun's repost, enable physical planning through predictive architectures that learn latent representations from multimodal data. They emphasize self-supervised learning for building internal world models.

What is V-JEPA 2.1 and ThinkJEPA?

V-JEPA 2.1 is an advanced video JEPA model, while ThinkJEPA extends it for reasoning. Both contribute to efficient video understanding and planning in world models.

What is Lyra 2.0?

Lyra 2.0 builds persistent 3D worlds from video generation, enabling consistent spatial and temporal structures. It is part of high-value reproductions integrating with MoE and benchmarks like GameplayQA.

Yann LeCun (Apr 2026) joint text+image+video pretraining, MoE/conditional compute, latent geometry. New: SIGReg/LeWorldModel (15M JEPA 48x speedup Push-T, YouTube/arXiv), Joint-Embedding Predictive World Models (physical planning LeCun repost), Temporal Straightening, delta tokens (CVPR26 Argoverse2 1-token video compression), HyDRA/Out-of-Sight dynamic memory (HM-World), V-JEPA 2.1, ThinkJEPA, Stereo WM/WorldAgents, WildWorld (108M-frame game), WorldCache/PackForcing, Omni-WorldBench/QuantiPhy/GameplayQA, Yilun Du, Pulse, DiT animal motion (300h dataset), TrackMAE motion-aware MAE (SOTA 6 datasets), VOID physics-aware editing, Phantom physics-infused video gen, CT-1 VLM-to-video control, Prompt Relay/Uni-ViGU unified gen/und, Lyra 2.0 persistent 3D worlds. High-value repro: ablations, V-JEPA/SIGReg/LeWM/ThinkJEPA/Pulse/HyDRA/Out-of-Sight/DiT/TrackMAE/Joint-Embedding/VOID/Phantom/CT-1/delta tokens/Prompt Relay/Uni-ViGU/Lyra 2.0 in MoE/TTT/GameplayQA w/ latency/power/QuantiPhy/WildBench.

Sources (5)

Updated Apr 17, 2026

Vision Research Tracker

******LeCun: Beyond LLMs — multimodal world-models, latent planning and video SSL [developing] [developing] [developing]

Key Questions

What is Yann LeCun's vision for AI beyond large language models?

What is SIGReg/LeWorldModel?

What are Joint-Embedding Predictive World Models?

What is V-JEPA 2.1 and ThinkJEPA?

What is Lyra 2.0?

Lyra 2.0: Building Persistent 3D Worlds from Video Generation / Lyra 2.0：从视频生成构建持久3D世界 | Alan Hou

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

Phantom: Physics-Infused Video Generation - PLAN Lab

********LeCun: Beyond LLMs — multimodal world-models, latent planning and video SSL** [developing]** [developing]** [developing]

Key Questions

What is Yann LeCun's vision for AI beyond large language models?

What is SIGReg/LeWorldModel?

What are Joint-Embedding Predictive World Models?

What is V-JEPA 2.1 and ThinkJEPA?

What is Lyra 2.0?

Lyra 2.0: Building Persistent 3D Worlds from Video Generation / Lyra 2.0：从视频生成构建持久3D世界 | Alan Hou

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

Phantom: Physics-Infused Video Generation - PLAN Lab

******LeCun: Beyond LLMs — multimodal world-models, latent planning and video SSL [developing] [developing] [developing]