Applied AI Paper Radar

Embodied perception & temporally-coherent visual latents (V-JEPA 2.1, WorldAgents, CoS, benchmarks)

Embodied perception & temporally-coherent visual latents (V-JEPA 2.1, WorldAgents, CoS, benchmarks)

Key Questions

What is V-JEPA 2.1?

V-JEPA 2.1 improves grasping by +20% and includes advancements like ProbeFlow/Dream2Flow/LATENT/WorldCam/FASTER for temporally-coherent visual latents in embodied perception.

What are WorldAgents?

WorldAgents treat images as 3D agents for reasoning, part of new benchmarks in embodied AI alongside CoS (Chain of Sight) reasoning.

What is OmniVTA?

OmniVTA is a new visuo-tactile model operating at 60Hz, enhancing multimodal perception in robotics.

What is VideoDetective?

VideoDetective enables graph reasoning for long videos, improving understanding of extended sequences in embodied tasks.

What is ThinkJEPA?

ThinkJEPA empowers latent world models with large vision-language reasoning, generating coherent visual latents for better perception.

What is ABot-PhysWorld?

ABot-PhysWorld is an interactive world foundation model for robotic manipulation aligned with physics.

What is InfiniDepth?

InfiniDepth (CVPR 2026) advances monocular depth estimation for spatial perception in embodied AI.

What is Attend Before Attention?

Attend Before Attention uses autoregressive gazing for efficient, scalable video understanding in perception models.

New: OmniVTA visuo-tactile (60Hz), Temporal Straightening, Astrolabe video RL. WorldAgents (imgs as 3D agents), CoS reasoning, V-JEPA 2.1 +20% grasping/ProbeFlow/Dream2Flow/LATENT/WorldCam/FASTER; fresh: VideoDetective long-video graphs, WildWorld action dataset, ThinkJEPA VLM latents, ABot-PhysWorld physics manip, InfiniDepth depth, Perceptio spatial, Attend gazing, PrismAudio V2A, GameplayQA multi-video 3D agents. STEVO/MMOU/Ego2Web highlight fails; prototype in sims.

Sources (11)
Updated Mar 27, 2026