Multimodal efficiency/long-video/speech (Tuna-2, World-R1, IAM)
Key Questions
What does WBench evaluate in video world models?
WBench provides a comprehensive multi-turn benchmark for interactive video world model evaluation. It tests capabilities in persistent visual memory and long-horizon multimodal tasks.
How does Pantheon360 advance 3D-aware video generation?
Pantheon360 introduces 3D-aware 360° video diffusion for improved digital twin generation. It unifies pixel and physics-based generation at multiple scales.
What roadmap does native multimodal modeling propose?
Toward Native Multimodal Modeling outlines a roadmap for unified handling of text, audio, and visual inputs. This includes gains in audio (+10%) and visual (+18%) benchmarks via models like Tuna-2.
How does adversarial flow distillation benefit video generation?
On-policy adversarial flow distillation improves autoregressive video generation by distilling flows more effectively. It pairs with ParaVT for parallel tool use in video RL scenarios.
What role does OceanPile play in multimodal training?
OceanPile serves as a large corpus supporting multimodal efficiency research alongside Persona and MATHNET. It aids persistent visual memory and unified CoPD frameworks for speech and long-video tasks.
CoPD unifies; Audio AKB +10%; Visual +18%; Persona>4o; MATHNET; UniVidX; Persistent Visual Memory LVLMs; OceanPile corpus. New: Pantheon360 3D-aware 360° video diffusion; Native Multimodal Modeling roadmap; WBench multi-turn world model eval; adversarial flow distillation for video; ParaVT parallel tool use in video RL. Pixel/physics gen unify scales.