**********MoE & compact high-performance VLMs — Phi-4 / Mistral / Nemotron / Molmo2 / Reka / Gemma 4 + SSM + prod opts + agentic accel [developing]

Key Questions

What are the main milestones in compact high-performance vision-language models (VLMs)?

Key models include Mistral Small 4, Nemotron, Phi-4 (15B high-res hybrid for STEM/OCR), Molmo2/Reka OSS, and DeepMind Gemma 4 (2B-31B+MoE with 256k context beating GPT-4o). These emphasize MoE, pruning, and token-efficiency.

What is Phi-4 and its capabilities?

Phi-4 is a 15B parameter high-resolution hybrid VLM excelling in STEM and OCR tasks. It represents a milestone in compact, high-performance VLMs.

How does Gemma 4 perform compared to GPT-4o?

Gemma 4 offers 2B-31B+MoE open-source vision LLMs with 256k context length, outperforming GPT-4o on various benchmarks. It supports reasoning, agentic workflows, coding, and multimodal understanding on consumer GPUs.

What is HopChain and its benefits?

HopChain, from Alibaba's Qwen team, fixes multi-step reasoning failures in vision models, improving performance by +20/24 on benchmarks. It enables better chain-of-thought for VLMs like Qwen-VL/3.5-Omni.

What is SteerViT?

SteerViT uses text-guided visual representations to enhance ViT saliency. It improves focus in vision models for better performance.

What is ViGoR-Bench?

ViGoR-Bench evaluates reasoning capabilities in visual models. It tests agentic and perceptual reasoning in VLMs.

What is LLaVA-DyMoE?

LLaVA-DyMoE addresses routing-drift issues in MoE models for VLMs. It improves dynamic routing stability.

What benchmarks are priorities for these VLMs?

Priorities include STEM/CapArena/ProactiveBench/MIRAGE/ViGoR, with ablations on SSM/pruning/foveation/self-evolution techniques, focusing on power/latency metrics.

MoE/pruning/token-efficiency for compact VLMs. Milestones: Mistral Small 4, Nemotron, Phi-4 (15B high-res + hybrid for STEM/OCR), Molmo2/Reka OSS, DeepMind Gemma 4 (2B-31B+MoE OSS vision LLMs, 256k ctx beats GPT-4o). New: PTQ LVLM, FALCON/MIRAGE/Falcon Perception, Attention Residuals/OutRo, ARA/logit anti-hallucination, SSM vs ViTs, AwaRes, Qwen-VL/3.5-Omni ROCm (omnimodal MoE; HopChain multi-step reasoning chains +20/24 benches), RubiCap, SpecEyes, Foveated Diffusion, self-evolution GRPO (UI-Voyager), ARTA ViT mixed-res dense (54.6 mIoU), ResAdapt dynamic res, Token's Dilemma drift-aware MoE continual, HISA sparse attn, LLaVA-DyMoE routing-drift, OptiMer vector merging, ICCTP cross-modal pruning, SteerViT text-guided ViT saliency, ViGoR-Bench reasoning. Priorities: checkpoints, STEM/CapArena/ProactiveBench/MIRAGE/ViGoR benches, SSM/pruning/sink/crop/spec/foveation/self-evo/ARTA/ResAdapt/SteerViT/HopChain ablations, power/latency.

Sources (14)

Updated Apr 8, 2026

Vision Research Tracker

**********MoE & compact high-performance VLMs — Phi-4 / Mistral / Nemotron / Molmo2 / Reka / Gemma 4 + SSM + prod opts + agentic accel [developing]

Key Questions

What are the main milestones in compact high-performance vision-language models (VLMs)?

What is Phi-4 and its capabilities?

How does Gemma 4 perform compared to GPT-4o?

What is HopChain and its benefits?

What is SteerViT?

What is ViGoR-Bench?

What is LLaVA-DyMoE?

What benchmarks are priorities for these VLMs?

Demystifying When Pruning Works via Representation Hierarchies

Frontier AI Models Are Doing Something Absolutely Bizarre When Asked to Diagnose Medical X-Rays

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

Alibaba's Qwen team built HopChain to fix how AI vision models fall apart during multi-step reasoning

Gemma 4 Explained: 31 Billion Parameters, 256K Context — Free GPT-4o Killer

SteerViT: Text-Guided Visual Representations

ViGoR-Bench: Evaluating Reasoning in Visual Models

Visual Grounding Method to Mitigate Object Hallucination in Large Vision ...

gemma-4-31b-it Model by Google

Gemma 4 model card | Google AI for Developers

DeepMind Releases Gemma 4 Vision-Capable Models

Instruction-Guided Cross-Modal Clustering for Training-Free Visual ...

OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

LLaVA-DyMoE: Solving Routing-Drift in MoE Models

**************MoE & compact high-performance VLMs — Phi-4 / Mistral / Nemotron / Molmo2 / Reka / Gemma 4 + SSM + prod opts + agentic accel**** [developing]

Key Questions

What are the main milestones in compact high-performance vision-language models (VLMs)?

What is Phi-4 and its capabilities?

How does Gemma 4 perform compared to GPT-4o?

What is HopChain and its benefits?

What is SteerViT?

What is ViGoR-Bench?

What is LLaVA-DyMoE?

What benchmarks are priorities for these VLMs?

Demystifying When Pruning Works via Representation Hierarchies

Frontier AI Models Are Doing Something Absolutely Bizarre When Asked to Diagnose Medical X-Rays

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

Alibaba's Qwen team built HopChain to fix how AI vision models fall apart during multi-step reasoning

Gemma 4 Explained: 31 Billion Parameters, 256K Context — Free GPT-4o Killer

SteerViT: Text-Guided Visual Representations

ViGoR-Bench: Evaluating Reasoning in Visual Models

Visual Grounding Method to Mitigate Object Hallucination in Large Vision ...

gemma-4-31b-it Model by Google

Gemma 4 model card | Google AI for Developers

DeepMind Releases Gemma 4 Vision-Capable Models

Instruction-Guided Cross-Modal Clustering for Training-Free Visual ...

OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

LLaVA-DyMoE: Solving Routing-Drift in MoE Models

**********MoE & compact high-performance VLMs — Phi-4 / Mistral / Nemotron / Molmo2 / Reka / Gemma 4 + SSM + prod opts + agentic accel [developing]