Multimodal Vision Lab

9h ago

GUI-Libra: Action-Aware Supervision Meets Partially Verifiable RL for GUI Agents

GUI-Libra trains native GUI agents to reason and act using action-aware supervision and partially verifiable RL. Join the paper discussion for implementation insights.

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

arxiv.org

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

9h ago

NoLan: Dynamic Suppression of Language Priors to Curb VLM Hallucinations

NoLan tackles object hallucinations in large vision-language models by dynamically suppressing language priors, boosting reliability in visual reasoning.

NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

arxiv.org

NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

9h ago

18h ago

ActionReasoning: LLM Framework for Physics-Consistent Robotic Brick Stacking

Introduces ActionReasoning, an LLM-driven framework addressing gaps in robotic action planning
Enables explicit action reasoning for...

Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking

18h ago·

arxiv.org

2d ago

VLANeXt: Recipes for Strong VLA Models

VLANeXt paper delivers recipes for building strong VLA models – essential for extending VLMs to embodied agents. Join the discussion on the paper page!

VLANeXt: Recipes for Building Strong VLA Models

arxiv.org

VLANeXt: Recipes for Building Strong VLA Models

2d ago

Deploy Cosmos Reasoning 2B VLM on Jetson Edge Devices via vLLM

Hands-on guide for engineers: Run NVIDIA Cosmos Reasoning 2B on Jetson AGX Thor, Orin, and Orin Nano Super.

Prerequisites: JetPack 6/7, NVMe SSD...

Deploying Open Source Vision Language Models (VLM) on Jetson

huggingface.co

Deploying Open Source Vision Language Models (VLM) on Jetson

2d ago

Multimodal Vision Lab · Feb 24 Daily Digest

OCR VLM Releases

🔥 GutenOCR-7B: GutenOCR-7B is a grounded OCR front-end fine-tuned from Qwen2.5-VL-7B that exposes reading, detection, and...

2d ago

RoboFlamingo-Plus: Depth-RGB Fusion in VLMs

Key highlights for engineering depth-RGB in robotic VLMs:

RoboFlamingo-Plus enables fusion of depth and RGB perception with vision systems.
-...

RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision ...

2d ago·

ieeexplore.ieee.org

2d ago

WACV 2026 VLMs: Test-Time Consistency Gains & Human-Like Bias Insights

Emerging VLM trends at WACV 2026 focus on reliability and bias understanding—crucial for robust multimodal systems:

Test-time consistency: Post-hoc...

3d ago

Open-Source VLMs Trending for Robust Document Parsing

Emerging trend in compact, practical open-source VLMs for real-world document pipelines:

PaddleOCR-VL-1.5 (0.9B params) achieves 94.5% on...

4d ago

DSE Hallucination Filtering for Radiology VLMs

Discrete semantic entropy (DSE) rejects hallucination-prone questions to improve black-box VLM accuracy in radiology—a practical filter for medical imaging reliability.

Hallucination filtering in radiology vision-language models using ...

4d ago·

xrayinterpreter.com

5d ago

Multimodal Vision Lab · Feb 21 Daily Digest

VLM Training Innovations

🔥 Selective Training via Visual Grounding: Introduces VIG as a fine-grained measure of visual grounding by varying...

5d ago

ZwZ Distillation: 10x Faster Fine-Grained MLLM Perception via Region-to-Image Projection

Key innovation: Region-to-Image Distillation projects pre-enlarged details into full images during training, enabling single-pass precise recognition...

5d ago

VQA's Core Challenges: Alignment and Multi-Step Reasoning

VQA is a challenging task demanding accurate image-language alignment and multi-step reasoning – essential for context-aware multi-turn visual conversations.

Context-Aware Visual Multi-Turn Conversation Generation from ...

5d ago·

research.ibm.com

5d ago

VIG Metric Boosts Selective Training for Vision-Language Models

VIG offers a fine-grained measure of visual grounding in VLMs, enabling selective training by isolating visual input impact—varying only the image while fixing question-answer pairs. Key for efficient fine-tuning of large models.