Embodied perception & temporally-coherent visual latents (V-JEPA 2.1, WorldAgents, CoS, benchmarks)

Key Questions

What is V-JEPA 2.1?

V-JEPA 2.1 improves grasping by +20% and includes advancements like ProbeFlow/Dream2Flow/LATENT/WorldCam/FASTER for temporally-coherent visual latents in embodied perception.

What are WorldAgents?

WorldAgents treat images as 3D agents for reasoning, part of new benchmarks in embodied AI alongside CoS (Chain of Sight) reasoning.

What is OmniVTA?

OmniVTA is a new visuo-tactile model operating at 60Hz, enhancing multimodal perception in robotics.

What is VideoDetective?

VideoDetective enables graph reasoning for long videos, improving understanding of extended sequences in embodied tasks.

What is ThinkJEPA?

ThinkJEPA empowers latent world models with large vision-language reasoning, generating coherent visual latents for better perception.

What is ABot-PhysWorld?

ABot-PhysWorld is an interactive world foundation model for robotic manipulation aligned with physics.

What is InfiniDepth?

InfiniDepth (CVPR 2026) advances monocular depth estimation for spatial perception in embodied AI.

What is Attend Before Attention?

Attend Before Attention uses autoregressive gazing for efficient, scalable video understanding in perception models.

New: OmniVTA visuo-tactile (60Hz), Temporal Straightening, Astrolabe video RL. WorldAgents (imgs as 3D agents), CoS reasoning, V-JEPA 2.1 +20% grasping/ProbeFlow/Dream2Flow/LATENT/WorldCam/FASTER; fresh: VideoDetective long-video graphs, WildWorld action dataset, ThinkJEPA VLM latents, ABot-PhysWorld physics manip, InfiniDepth depth, Perceptio spatial, Attend gazing, PrismAudio V2A, GameplayQA multi-video 3D agents. STEVO/MMOU/Ego2Web highlight fails; prototype in sims.

Sources (11)

Updated Mar 27, 2026

Applied AI Paper Radar

Embodied perception & temporally-coherent visual latents (V-JEPA 2.1, WorldAgents, CoS, benchmarks)

Key Questions

What is V-JEPA 2.1?

What are WorldAgents?

What is OmniVTA?

What is VideoDetective?

What is ThinkJEPA?

What is ABot-PhysWorld?

What is InfiniDepth?

What is Attend Before Attention?

Vega: Learning to Drive with Natural Language Instructions

@_akhaliq: The Pulse of Motion Measuring Physical Frame Rate from Visual Dynamics paper: https://t.co/oQ3KAPx...

EVA: MLLM Agents That Plan Before Watching Video

@EMostaque reposted: PrismAudio is open source👏👏👏 a 518M V2A model accepted at ICLR 2026, achieving S...

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

@_akhaliq: Perceptio Perception Enhanced Vision Language Models via Spatial Token Generation paper: https://t...

@jon_barron reposted: Excited to share our work InfiniDepth (CVPR 2026) — casting monocular depth esti...

ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

VideoDetective: Graph Reasoning for Long Videos