LLM Innovation Tracker

Uni-1 Luma Labs Multimodal SOTA

Uni-1 Luma Labs Multimodal SOTA

Key Questions

What is the state-of-the-art in multimodal models like Uni-1?

Uni-1 from Luma Labs leads RISE/Omni benchmarks with Qwen3.5-Omni and PLUME embeddings. It excels in streaming video and VideoZeroBench.

What advancements address high-res video in VLMs?

Token warping helps MLLMs view nearby viewpoints; humans see high-res high-FPS, inspiring VLM improvements. Video-MME-v2 advances video understanding benchmarks.

What are Microsofts contributions to multimodal AI?

MS launches MAI foundational models rivaling OpenAI/Google, including MAI-Image-2. They support multimodal agentic frameworks.

How do 3D foundation models like Omni123 work?

Omni123 explores 3D native models for visual grounding. GaussianGPT enables autoregressive 3D Gaussian scene generation.

What agentic multimodal benchmarks exist?

Agentic-MME, MiroEval for deep research agents, and CoME-VL evaluate process and outcome. GEMS adds agent-native generation with memory/skills.

What continual learning challenges face video LLMs?

Benchmarking Continual Learning in Video LLMs highlights needs; Omni-SimpleMem uses autoresearch for lifelong multimodal memory.

How do vision-language-action models advance?

MMaDA-VLA unifies multi-modal instruction/generation. Think, Act, Build framework enables zero-shot 3D visual grounding.

What is LeCun's role in these developments?

LeCun's JEPA influences multimodal progress alongside PLUME embeddings and token warping for efficient video processing.

RISE/Omni leader; Qwen3.5-Omni; PLUME embeddings; streaming video; VideoZeroBench; Omni123 3D; MS MAI FMs; GaussianGPT; LeCun JEPA; token warping/CoME-VL/Agentic-MME.

Sources (15)
Updated Apr 8, 2026