**Targeted benchmarks & diversified evaluation

Key Questions

What is the core of Targeted benchmarks & diversified evaluation?

It features benchmarks like ClawArena, Agentic-MME, AgentHazard, VideoZeroBench, Y C-Bench for multimodal, physics, agentic, video, long-horizon, safety, web, math, med, and multilingual reliability.

What is ClawArena?

ClawArena benchmarks AI agents in evolving information environments.

What does VideoZeroBench test?

VideoZeroBench probes limits of video MLLMs with spatio-temporal evidence verification.

What is AgentHazard?

AgentHazard benchmark reveals high failure rates of computer-use agents in safety tests.

What is HippoCamp?

HippoCamp benchmarks contextual agents on personal computers.

What is MiroEval?

MiroEval benchmarks multimodal deep research agents in process and outcome.

What eval focuses on AI-written papers?

Paper Reconstruction Evaluation assesses presentation and hallucination in AI-generated papers.

What is the status of these benchmarks?

They are developing, tracing multimodal/physics/agentic/video/long-horizon/safety/web/math/med/multilingual reliability, including ARC-AGI-3, Omni-WorldBench, WR-Arena with math gen fails, and RLVR.

ClawArena/Reconstruction eval (AI papers)/Agentic-MME wild skills/AgentHazard/Video XAI/Vision2Web/Video-MME-v2; MDPBench/psychosis/VideoZeroBench/Y C-Bench/Dictatorship/MiroEval/HippoCamp/PerceptionComp/ViGoR/Real-3DQA/MonitorBench/GEditBench/ARC-AGI-3/Omni-WorldBench/WR-Arena incl math gen fails/RLVR. Trace multimodal/physics/agentic/video/long-horizon/safety/web/math/med/multilingual reliability.

Sources (15)

Updated Apr 8, 2026

AI Research Digest

**Targeted benchmarks & diversified evaluation

Key Questions

What is the core of Targeted benchmarks & diversified evaluation?

What is ClawArena?

What does VideoZeroBench test?

What is AgentHazard?

What is HippoCamp?

What is MiroEval?

What eval focuses on AI-written papers?

What is the status of these benchmarks?

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

@GaryMarcus reposted: Paper below tested a variety of base LLMs (no TTA) on generalization-focus math ...

ClawArena: Benchmarking AI Agents in Evolving Information Environments

@_akhaliq: Paper Reconstruction Evaluation Evaluating Presentation and Hallucination in AI-written Papers pap...

[PDF] Evaluating Large Language Models for Assessment of Psychosis Risk

@Miles_Brundage reposted: Today, I'm releasing the first eval meant to test whether frontier models will h...

@omarsar0: Can an AI agent run a startup for a year without going bankrupt? Turns out most can't. New benchma...

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

@MeganRisdal: Tune in now for insights from authors of DeepMind's latest AGI paper and tips for the community buil...

HippoCamp: Benchmarking Contextual Agents on Personal Computers

Paper page - MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

****************************************Targeted benchmarks & diversified evaluation**************************************

Key Questions

What is the core of Targeted benchmarks & diversified evaluation?

What is ClawArena?

What does VideoZeroBench test?

What is AgentHazard?

What is HippoCamp?

What is MiroEval?

What eval focuses on AI-written papers?

What is the status of these benchmarks?

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

@GaryMarcus reposted: Paper below tested a variety of base LLMs (no TTA) on generalization-focus math ...

ClawArena: Benchmarking AI Agents in Evolving Information Environments

@_akhaliq: Paper Reconstruction Evaluation Evaluating Presentation and Hallucination in AI-written Papers pap...

[PDF] Evaluating Large Language Models for Assessment of Psychosis Risk

@Miles_Brundage reposted: Today, I'm releasing the first eval meant to test whether frontier models will h...

@omarsar0: Can an AI agent run a startup for a year without going bankrupt? Turns out most can't. New benchma...

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

@MeganRisdal: Tune in now for insights from authors of DeepMind's latest AGI paper and tips for the community buil...

HippoCamp: Benchmarking Contextual Agents on Personal Computers

Paper page - MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

**Targeted benchmarks & diversified evaluation