New Benchmarks and Indexes

Key Questions

What is WorldMemArena designed to evaluate?

WorldMemArena contains 400 tasks focused on diagnosing multimodal agent memory through action-world interactions. It helps identify memory-related failure modes in agents.

How does OmniInteract benchmark real-time assistants?

OmniInteract evaluates real-time streaming interaction for omnimodal assistants, with the best model reaching an IA-QTF1 score of 0.368. It emphasizes continuous, low-latency performance.

What environments does PhoneWorld provide for agents?

PhoneWorld offers scalable phone-use agent environments that simulate realistic mobile interactions. It supports large-scale training and evaluation of device agents.

What does TerminalWorld measure in agent performance?

TerminalWorld reports a maximum score of 62.5% for current models on terminal-based agent tasks. It serves as a challenging benchmark for coding and command-line agents.

How does automated benchmark auditing affect rankings?

Automated auditing shifts model rankings by approximately 10% on benchmarks such as SWE-bench and Terminal-Bench. It reveals inconsistencies in existing evaluation protocols.

What is unique about the WBench benchmark?

WBench is a comprehensive multi-turn benchmark for interactive video world models where no single model currently dominates. It stresses long-horizon consistency.

Which benchmark focuses on healthcare agents?

CHI-Bench contains 75 long-horizon healthcare tasks and is the first such benchmark hosted on Hugging Face. It evaluates agents on complex medical workflows.

What does Claw-Anything test in personal assistants?

Claw-Anything benchmarks always-on assistants with broad access to a user's digital world, where GPT-5.5 scores only 34.5% pass@1. It highlights gaps in persistent agent capabilities.

WorldMemArena (400 tasks, multimodal agent memory diagnosis), OmniInteract (real-time streaming interaction, best IA-QTF1 0.368), PhoneWorld (scalable phone-use agent environments). TerminalWorld (62.5% max), VGenST-Bench, MetaphorVU (ICML 2026 spotlight), Automated Benchmark Auditing (shifts rankings ~10% on SWE-bench/Terminal-Bench), WBench (no single model dominates), SkillEvolBench, Claw-Anything (GPT-5.5 only 34.5% pass@1), EvalVerse for cinematic video, CHI-Bench healthcare agent (75 tasks), LongAV-Compass (284 test cases), Trajel trajectory-level hallucination auditing. ResearchMath-14K dataset added.

Sources (21)

Updated May 29, 2026

AI Breakthrough Tracker

New Benchmarks and Indexes

Key Questions

What is WorldMemArena designed to evaluate?

How does OmniInteract benchmark real-time assistants?

What environments does PhoneWorld provide for agents?

What does TerminalWorld measure in agent performance?

How does automated benchmark auditing affect rankings?

What is unique about the WBench benchmark?

Which benchmark focuses on healthcare agents?

What does Claw-Anything test in personal assistants?

PhoneWorld: Scaling Phone-Use Agent Environments

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

ResearchMath-14K: Scaling Research-Level Mathematics via Agents

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

@ClementDelangue reposted: Introducing CHI-Bench on @huggingface: the world’s first long-horizon healthcare...

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Automated Benchmark Auditing for AI Agents and Large Language Models

MetaphorVU: Towards Metaphorical Video Understanding

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

@fchollet reposted: We saw our first meaningful jump in the ARC-AGI-3 competition today @tufalabs w...

@GaryMarcus reposted: Essential context on OpenAI’s Erdos result

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

Benchmarking Spatial Intelligence under Visual Degradation

Benchmarking Large Language Models and Prompt Engineering ...

@huggingface reposted: 🌍Today we release Mosaic, a probabilistic weather model that shifts the Pareto f...

LLM Peer Reviewers Beat Humans on Nature Papers