Vision Research Tracker

**************DeepRoute.ai 40B VLA, NVIDIA Nemotron/GR00T & embodied VLM push**** [developing]

**************DeepRoute.ai 40B VLA, NVIDIA Nemotron/GR00T & embodied VLM push**** [developing]

Key Questions

What is DeepRoute.ai's 40B VLA?

DeepRoute.ai unveiled a 40B Vision-Language-Action (VLA) model at GTC for autonomous driving and embodied AI. It advances integration of vision, language, and actions.

What are Nemotron and GR00T?

Nemotron and GR00T from NVIDIA push embodied VLMs for robotics. They enable advanced humanoid and agentic tasks.

What is UniDriveVLA?

UniDriveVLA from Xiaomi uses Mixture of Transformers (MoT) to decouple spatial/semantic/action processing for self-driving cars. It is open-source.

What is OpenClaw/LaViRA?

OpenClaw/LaViRA enables zero-shot natural language mobile manipulation for navigation, grasping, and delivery. It provides arXiv/code but faces API blocks, pushing local OSS alternatives.

What is HandX?

HandX is a 54-hour bimanual interaction dataset scaling realistic two-hand tasks. It supports advanced robotic manipulation.

What is SMASH?

SMASH demonstrates humanoid ping-pong using egocentric onboard vision. It showcases real-time embodied AI capabilities.

What is QIVD?

QIVD is a real-time VLM benchmark for live QA using camera/mic inputs. It tests face-to-face question answering in the real world.

What benchmarks are key for VLAs?

Key agentic benchmarks include QIVD/VTC/ESPIRE/ProactiveBench/BrowseComp-VL/GameplayQA/LIBERO/AndroidWorld/MEDFLOWBENCH/HandX/OpenClaw/MiroEval, with focus on action/skill ablations.

DeepRoute.ai 40B VLA (GTC), Nemotron/GR00T, OpenEMMA, Levine FMs, Xiaomi UniDriveVLA (MoT for AV decoupling spatial/semantic/action, OSS). New: FASTER/AutoMoT/VLA-Adapter/ProbeFlow/Recurrent/SeqVLA/EVA/Unify-Agent/GEMS memory/skills (Vero open RL recipe visual reasoning), DiT4DiT (LIBERO SOTA), UI-Voyager (81% AndroidWorld), Trace2Skill trajectory skills, HandX (54h bimanual dataset), OpenClaw/LaViRA (zero-shot natlang mobile manip nav/grasp/deliver, arXiv/code; API blocks to local OSS), SMASH (humanoid ping-pong egocentric onboard vision), agentic 3D grounding, GLM-5V-Turbo GUI automation (free access/RL), QIVD (real-time VLM live QA camera/mic bench), VLA fine-tuning opts, action ablations, RoboClaw/Look Before Acting/ESPIRE/ProactiveBench/VTC/MiroEval, Dream2Flow/VLN, WebWatcher (BrowseComp-VL), MedOpenClaw 3D medical. Repro: weights, agentic benches (QIVD/VTC/ESPIRE/ProactiveBench/BrowseComp-VL/GameplayQA/LIBERO/AndroidWorld/MEDFLOWBENCH/HandX/OpenClaw/MiroEval), action/skill ablations.

Sources (14)
Updated Apr 8, 2026