Multimodal world-models + embodied planning probes

Key Questions

What is Nemotron-3 Nano Omni?

Nemotron-3 Nano Omni is a multimodal open-source model from NVIDIA with 256K context, supporting vision, audio, GUI, and demos on DGX/Jetson. It achieves 9x throughput, open-source status, and state-of-the-art on EVS/MediaPerf benchmarks.

How does Tuna-2 improve vision processing?

Tuna-2 uses pixel embeddings that outperform Vision Transformers (ViT) for multimodal tasks, as highlighted in recent announcements.

What is Meta Sapiens2?

Meta Sapiens2 is a high-resolution human-centric vision model for pose estimation, segmentation, normals, pointmap, and albedo, addressing challenges in motion capture.

What does CoPE-VideoLM achieve in video encoding?

CoPE-VideoLM efficiently encodes video in LLMs, slashing tokens by 93% while maintaining performance.

What are the key features of Gemma4?

Gemma4 is a 31B MoE multimodal model with demos on DGX/Jetson, supporting vision and other modalities.

What is PersonaVLM?

PersonaVLM is a framework for long-term personalized multimodal LLMs, enabling sustained personalization across interactions.

What progress has AGIBOT GO-2 made?

AGIBOT GO-2 achieves 98.5% on the LIBERO benchmark for embodied planning.

What is the status of multimodal world-models development?

The field is developing, with advances in models like Long-VITA (1M VL), Schmidhuber HY-World 2.0, and emerging video generation post-training tweaks.

Gemma4/Nemotron-3 Nano Omni 256K ctx vision/audio/GUI DGX/Jetson demos (9x throughput OSS, EVS/MediaPerf SOTA); Tuna-2 pixel embeddings beat ViT; Meta Sapiens2 human-centric; SketchVLM annotations; Long-VITA 1M VL; OmniHuman; OneVL; Schmidhuber HY-World 2.0; GlobalSplat/HiVLA; AGIBOT GO-2 98.5% LIBERO; JEPA; PersonaVLM; Semantic Progress Function; CoPE-VideoLM 93% token slash; emerging video gen post-train tweaks.

Sources (8)