****************************************Targeted benchmarks & diversified evaluation**************************************
Key Questions
What is the core of Targeted benchmarks & diversified evaluation?
It features benchmarks like ClawArena, Agentic-MME, AgentHazard, VideoZeroBench, Y C-Bench for multimodal, physics, agentic, video, long-horizon, safety, web, math, med, and multilingual reliability.
What is ClawArena?
ClawArena benchmarks AI agents in evolving information environments.
What does VideoZeroBench test?
VideoZeroBench probes limits of video MLLMs with spatio-temporal evidence verification.
What is AgentHazard?
AgentHazard benchmark reveals high failure rates of computer-use agents in safety tests.
What is HippoCamp?
HippoCamp benchmarks contextual agents on personal computers.
What is MiroEval?
MiroEval benchmarks multimodal deep research agents in process and outcome.
What eval focuses on AI-written papers?
Paper Reconstruction Evaluation assesses presentation and hallucination in AI-generated papers.
What is the status of these benchmarks?
They are developing, tracing multimodal/physics/agentic/video/long-horizon/safety/web/math/med/multilingual reliability, including ARC-AGI-3, Omni-WorldBench, WR-Arena with math gen fails, and RLVR.
ClawArena/Reconstruction eval (AI papers)/Agentic-MME wild skills/AgentHazard/Video XAI/Vision2Web/Video-MME-v2; MDPBench/psychosis/VideoZeroBench/Y C-Bench/Dictatorship/MiroEval/HippoCamp/PerceptionComp/ViGoR/Real-3DQA/MonitorBench/GEditBench/ARC-AGI-3/Omni-WorldBench/WR-Arena incl math gen fails/RLVR. Trace multimodal/physics/agentic/video/long-horizon/safety/web/math/med/multilingual reliability.