AI Research & Policy Brief

Evaluation, hallucinations, defenses and tooling (benchmarks & traceability)

Evaluation, hallucinations, defenses and tooling (benchmarks & traceability)

Key Questions

What is Claw-Eval in multimodal context?

Claw-Eval advances trustworthy agent evaluations, including multimodal aspects. It pushes benchmarks for comprehensive understanding.

What does Video-MME-v2 introduce?

Video-MME-v2 is the next stage in video understanding benchmarks. It evaluates comprehensive multimodal capabilities.

What gaps do Agentic-MME and others expose?

Agentic-MME tests agentic multimodal intelligence; ViGoR, VideoZeroBench, MiroEval, and Dictatorship reveal evaluation gaps and hallucinations.

What is PRBench?

PRBench exposes illusions and hallucinations in agent evaluations. It benchmarks presentation and factuality issues.

How is XAI being formalized?

Explainable AI (XAI) needs formalization, as per npj Artificial Intelligence. This advances multimodal reasoning traceability.

What role do ICML watermarks play?

ICML watermarks reject illicit content in evaluations. They support defenses against hallucinations and misuse.

What is FactReview?

FactReview advances agent and multimodal evals by checking hallucinations. It integrates with trajectory retrieval for better traceability.

What benchmarks focus on multimodal agents?

Benchmarks like Agentic-MME, Video-MME-v2, and VideoZeroBench expose gaps in agentic multimodal reasoning and defenses.

Claw-Eval/FactReview/Video-MME-v2/XAI formalization advance agent/multimodal evals; Agentic-MME/ViGoR/VideoZeroBench/MiroEval/Dictatorship expose gaps; PRBench illusions; ICML watermarks reject illicit; trajectory retrieval.

Sources (22)
Updated Apr 8, 2026