New Benchmarks and Indexes
Key Questions
What is WorldMemArena designed to evaluate?
WorldMemArena contains 400 tasks focused on diagnosing multimodal agent memory through action-world interactions. It helps identify memory-related failure modes in agents.
How does OmniInteract benchmark real-time assistants?
OmniInteract evaluates real-time streaming interaction for omnimodal assistants, with the best model reaching an IA-QTF1 score of 0.368. It emphasizes continuous, low-latency performance.
What environments does PhoneWorld provide for agents?
PhoneWorld offers scalable phone-use agent environments that simulate realistic mobile interactions. It supports large-scale training and evaluation of device agents.
What does TerminalWorld measure in agent performance?
TerminalWorld reports a maximum score of 62.5% for current models on terminal-based agent tasks. It serves as a challenging benchmark for coding and command-line agents.
How does automated benchmark auditing affect rankings?
Automated auditing shifts model rankings by approximately 10% on benchmarks such as SWE-bench and Terminal-Bench. It reveals inconsistencies in existing evaluation protocols.
What is unique about the WBench benchmark?
WBench is a comprehensive multi-turn benchmark for interactive video world models where no single model currently dominates. It stresses long-horizon consistency.
Which benchmark focuses on healthcare agents?
CHI-Bench contains 75 long-horizon healthcare tasks and is the first such benchmark hosted on Hugging Face. It evaluates agents on complex medical workflows.
What does Claw-Anything test in personal assistants?
Claw-Anything benchmarks always-on assistants with broad access to a user's digital world, where GPT-5.5 scores only 34.5% pass@1. It highlights gaps in persistent agent capabilities.
WorldMemArena (400 tasks, multimodal agent memory diagnosis), OmniInteract (real-time streaming interaction, best IA-QTF1 0.368), PhoneWorld (scalable phone-use agent environments). TerminalWorld (62.5% max), VGenST-Bench, MetaphorVU (ICML 2026 spotlight), Automated Benchmark Auditing (shifts rankings ~10% on SWE-bench/Terminal-Bench), WBench (no single model dominates), SkillEvolBench, Claw-Anything (GPT-5.5 only 34.5% pass@1), EvalVerse for cinematic video, CHI-Bench healthcare agent (75 tasks), LongAV-Compass (284 test cases), Trajel trajectory-level hallucination auditing. ResearchMath-14K dataset added.