Uni-1 Luma Labs Multimodal SOTA

Key Questions

What is the state-of-the-art in multimodal models like Uni-1?

Uni-1 from Luma Labs leads RISE/Omni benchmarks with Qwen3.5-Omni and PLUME embeddings. It excels in streaming video and VideoZeroBench.

What advancements address high-res video in VLMs?

Token warping helps MLLMs view nearby viewpoints; humans see high-res high-FPS, inspiring VLM improvements. Video-MME-v2 advances video understanding benchmarks.

What are Microsofts contributions to multimodal AI?

MS launches MAI foundational models rivaling OpenAI/Google, including MAI-Image-2. They support multimodal agentic frameworks.

How do 3D foundation models like Omni123 work?

Omni123 explores 3D native models for visual grounding. GaussianGPT enables autoregressive 3D Gaussian scene generation.

What agentic multimodal benchmarks exist?

Agentic-MME, MiroEval for deep research agents, and CoME-VL evaluate process and outcome. GEMS adds agent-native generation with memory/skills.

What continual learning challenges face video LLMs?

Benchmarking Continual Learning in Video LLMs highlights needs; Omni-SimpleMem uses autoresearch for lifelong multimodal memory.

How do vision-language-action models advance?

MMaDA-VLA unifies multi-modal instruction/generation. Think, Act, Build framework enables zero-shot 3D visual grounding.

What is LeCun's role in these developments?

LeCun's JEPA influences multimodal progress alongside PLUME embeddings and token warping for efficient video processing.

RISE/Omni leader; Qwen3.5-Omni; PLUME embeddings; streaming video; VideoZeroBench; Omni123 3D; MS MAI FMs; GaussianGPT; LeCun JEPA; token warping/CoME-VL/Agentic-MME.

Sources (15)

Updated Apr 8, 2026

LLM Innovation Tracker

Uni-1 Luma Labs Multimodal SOTA

Key Questions

What is the state-of-the-art in multimodal models like Uni-1?

What advancements address high-res video in VLMs?

What are Microsofts contributions to multimodal AI?

How do 3D foundation models like Omni123 work?

What agentic multimodal benchmarks exist?

What continual learning challenges face video LLMs?

How do vision-language-action models advance?

What is LeCun's role in these developments?

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Token Warping Helps MLLMs Look from Nearby Viewpoints

Microsoft Launches Three New AI Models to Rival OpenAI and Google

Microsoft takes on AI rivals with three new foundational models

[2604.02289] Omni123: Exploring 3D Native Foundation Models with ...

@Scobleizer reposted: Humans can see in high-res, high-FPS in real-time. Why can't VLMs? Introducing ...

New GEN 1 AI Robot Hits 3X Faster At 1,800+ Reps (AI NEWS)

Continual Learning in Large Language Models: Methods, Challenges, and Opportunities

Omni-SimpleMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory

Benchmarking Continual Learning in Video Large Language Models

Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation

@_akhaliq: GEMS Agent-Native Multimodal Generation with Memory and Skills paper: https://t.co/8XK2QSa490 http...