Next-gen multimodal models for vision, video, and embodied AI

From Seeing to Simulating Worlds

This cluster tracks rapid advances in multimodal and vision-language AI, especially around long, complex video understanding, controlled video generation, and 4D world and human–scene reconstruction for simulation. New benchmarks like MMOU, VET-Bench, SocialOmni, and gesture-based egocentric QA probe models’ limits in tracking entities, social cues, and extended temporal reasoning, while works like NanoVDR and redundancy-aware generation target efficiency and smaller deployable models. On the modeling side, Gaussian-splatting SLAM, physics-in-the-loop reconstruction, and omni / “transfusion” pretraining frameworks push from static perception toward interactive, embodied, and socially aware agents that can both understand and generate rich audiovisual environments. Together, these reposts highlight a shift from generic vision-language chatbots to specialized, evaluation-driven systems grounded in real-world video, documents, and physical scenes.

Sources (15)

Updated Mar 18, 2026

AI & ML Daily Digest

Next-gen multimodal models for vision, video, and embodied AI

Daily Papers - Hugging Face

M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Physics-in-the-Loop Reconstruction of Simulation-Ready Human–Scene ...

Beyond Language Modeling: Multimodal Pretraining & Transfusion Framework Explained

@omarsar0: Current vision-language models still struggle with simple diagrams. Feynman is a knowledge-infused ...

MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance

ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer

A contrastive learning foundation model based on perfectly aligned ...

Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering

VET-Bench: Testing VLM Entity Tracking Limits

NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval

Cheers: Unified Multimodal Vision and Gen

[PDF] EFFICIENT MULTIMODAL GENERATION VIA REDUNDANCY ...