AI Innovation Radar

Multimodal and Audio Generation Breakthroughs

Multimodal and Audio Generation Breakthroughs

LLaVA-OneVision-2 achieves 74.9 JumpScore mAP (vs Qwen3-VL 30.1) via codec-stream tokenization for efficient long-video understanding. StepAudio 2.5 Realtime sweeps all five voice AI benchmarks, beating GPT Realtime 1.5 and Gemini Live. Stable Audio 3 released with open weights, 4096x downsampling SAME autoencoder, variable-length generation, three-stage training pipeline. LongAV-Compass benchmark for minute-scale audio-visual generation evaluation. LocateAnything parallel box decoding for VLM grounding (NVlabs).

Sources (2)
Updated May 27, 2026