Evaluation, hallucinations, defenses and tooling (benchmarks & traceability)

Key Questions

What is Claw-Eval in multimodal context?

Claw-Eval advances trustworthy agent evaluations, including multimodal aspects. It pushes benchmarks for comprehensive understanding.

What does Video-MME-v2 introduce?

Video-MME-v2 is the next stage in video understanding benchmarks. It evaluates comprehensive multimodal capabilities.

What gaps do Agentic-MME and others expose?

Agentic-MME tests agentic multimodal intelligence; ViGoR, VideoZeroBench, MiroEval, and Dictatorship reveal evaluation gaps and hallucinations.

What is PRBench?

PRBench exposes illusions and hallucinations in agent evaluations. It benchmarks presentation and factuality issues.

How is XAI being formalized?

Explainable AI (XAI) needs formalization, as per npj Artificial Intelligence. This advances multimodal reasoning traceability.

What role do ICML watermarks play?

ICML watermarks reject illicit content in evaluations. They support defenses against hallucinations and misuse.

What is FactReview?

FactReview advances agent and multimodal evals by checking hallucinations. It integrates with trajectory retrieval for better traceability.

What benchmarks focus on multimodal agents?

Benchmarks like Agentic-MME, Video-MME-v2, and VideoZeroBench expose gaps in agentic multimodal reasoning and defenses.

Claw-Eval/FactReview/Video-MME-v2/XAI formalization advance agent/multimodal evals; Agentic-MME/ViGoR/VideoZeroBench/MiroEval/Dictatorship expose gaps; PRBench illusions; ICML watermarks reject illicit; trajectory retrieval.

Sources (22)

Updated Apr 8, 2026

AI Research & Policy Brief

Evaluation, hallucinations, defenses and tooling (benchmarks & traceability)

Key Questions

What is Claw-Eval in multimodal context?

What does Video-MME-v2 introduce?

What gaps do Agentic-MME and others expose?

What is PRBench?

How is XAI being formalized?

What role do ICML watermarks play?

What is FactReview?

What benchmarks focus on multimodal agents?

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

Explainable AI needs formalization | npj Artificial Intelligence

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

@jon_barron reposted: [1/6] We introduce GR3EN, a generative approach for relighting 3D environments. ...

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

@_akhaliq: Agentic-MME What Agentic Capability Really Brings to Multimodal Intelligence? paper: https://t.co/...

@GaryMarcus reposted: New evidence that LLMs memorise huge chunks of works they are trained on. Great...

@_akhaliq: Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows paper: https://t....

Preserving Offline AI Assistant Using Local Large Language Models for ...

@GaryMarcus: Folks, I gave a cute example of a hallucination earlier today because I thought it was funny. But ...

@_akhaliq: Generative World Renderer paper: https://t.co/VxvbWIfkZx https://t.co/VtVOCspoQx

@_akhaliq: VOID Video Object and Interaction Deletion paper: https://t.co/zgAZjL7mfL model: https://t.co/hOF...

@_akhaliq: BizGenEval A Systematic Benchmark for Commercial Visual Content Generation paper: https://t.co/Nge...

LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

Reasoning Shift: How Context Silently Shortens LLM Reasoning

MiroEval: Benchmarking Multimodal LLM Agents

Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants

MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation

ARC Engine: Deterministic and Verifiable LLMs

Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

HippoCamp: Benchmarking Contextual Agents on Personal Computers