AI Breakthrough Tracker

New Benchmarks and Indexes

New Benchmarks and Indexes

Key Questions

What is WorldMemArena designed to evaluate?

WorldMemArena contains 400 tasks focused on diagnosing multimodal agent memory through action-world interactions. It helps identify memory-related failure modes in agents.

How does OmniInteract benchmark real-time assistants?

OmniInteract evaluates real-time streaming interaction for omnimodal assistants, with the best model reaching an IA-QTF1 score of 0.368. It emphasizes continuous, low-latency performance.

What environments does PhoneWorld provide for agents?

PhoneWorld offers scalable phone-use agent environments that simulate realistic mobile interactions. It supports large-scale training and evaluation of device agents.

What does TerminalWorld measure in agent performance?

TerminalWorld reports a maximum score of 62.5% for current models on terminal-based agent tasks. It serves as a challenging benchmark for coding and command-line agents.

How does automated benchmark auditing affect rankings?

Automated auditing shifts model rankings by approximately 10% on benchmarks such as SWE-bench and Terminal-Bench. It reveals inconsistencies in existing evaluation protocols.

What is unique about the WBench benchmark?

WBench is a comprehensive multi-turn benchmark for interactive video world models where no single model currently dominates. It stresses long-horizon consistency.

Which benchmark focuses on healthcare agents?

CHI-Bench contains 75 long-horizon healthcare tasks and is the first such benchmark hosted on Hugging Face. It evaluates agents on complex medical workflows.

What does Claw-Anything test in personal assistants?

Claw-Anything benchmarks always-on assistants with broad access to a user's digital world, where GPT-5.5 scores only 34.5% pass@1. It highlights gaps in persistent agent capabilities.

WorldMemArena (400 tasks, multimodal agent memory diagnosis), OmniInteract (real-time streaming interaction, best IA-QTF1 0.368), PhoneWorld (scalable phone-use agent environments). TerminalWorld (62.5% max), VGenST-Bench, MetaphorVU (ICML 2026 spotlight), Automated Benchmark Auditing (shifts rankings ~10% on SWE-bench/Terminal-Bench), WBench (no single model dominates), SkillEvolBench, Claw-Anything (GPT-5.5 only 34.5% pass@1), EvalVerse for cinematic video, CHI-Bench healthcare agent (75 tasks), LongAV-Compass (284 test cases), Trajel trajectory-level hallucination auditing. ResearchMath-14K dataset added.

Sources (21)
Updated May 29, 2026