Explosion of Agent Benchmarks and Architectures

Key Questions

What new agent benchmarks have been introduced recently?

MLPerf Mobile v6.0, CODA-BENCH, Orchestra-o1, VisualClaw, GameCraft-Bench, and RNG-Bench expand evaluation for on-device, data-intensive, omnimodal, physical, and memory-augmented agents. EdgeBench, AgenticDataBench, AgenticSTS, PACE, DiscoBench, and EvoPolicyGym further cover environmental learning, data agents, long-horizon tasks, and policy evolution.

How are multi-agent systems demonstrating real-world performance gains?

A collaboration experiment with over 100 agents optimizing Gemma 4 inference produced emergent social norms and 5x speed improvements. OpenAI/Codex data signals a broader industry shift from chatbots to agentic systems. Models like LoopCoder-v2 achieve 64.4% on SWE-bench with only 7B parameters.

What frameworks support skill discovery and orchestration in agents?

SkillOpt, OpenClaw-Skill, SkillWeaver, and ASPIRE enable differentiable skills, structured trees, compositional routing, and autonomous discovery with 77% gains on LIBERO-Pro. BioInsight and DiscoPER apply multi-agent orchestration to biomedical discovery and iterative scientific meta-reflection.

Which memory and world-model techniques improve long-horizon agent performance?

MemGraph-RAG augments graph-based retrieval while RNG-Bench tests multimodal LLM memory capabilities. AgenticSTS provides a bounded-memory testbed for extended tasks. These approaches address context retention in complex, sequential agent workflows.

How do local open-weight models compare to frontier systems on consumer hardware?

Sebastian Raschka demonstrated 30B MoE models reaching 40 tokens per second locally, matching GPT-5.5 Pro performance. Alibaba's Qwen-Image-Agent improves image generation by bridging context gaps. Such results highlight growing accessibility of high-performance agentic capabilities.

What benchmarks evaluate clarification and search behaviors in agents?

DiscoBench measures when search agents should request clarification during deep searches. It targets bounded-memory and long-horizon scenarios to improve decision-making. Related work like PACE serves as a proxy for overall agentic capability assessment.

How is autonomous post-training being applied to improve language models?

AutoTrainess uses agentic workflows for autonomous LM improvement without heavy human intervention. It pairs with systems like PaperOrchestra for research pipelines and Data Journalist Agent for multimodal storytelling. These reduce bottlenecks in iterative model refinement.

What robotics-focused agent benchmarks show sim-to-real transfer?

ASPIRE achieves 77% improvement on LIBERO-Pro via autonomous skill discovery with demonstrated zero-shot generalization. VisualClaw and GameCraft-Bench target real-time physical and game-building agents respectively. Empirical results indicate emerging social norms in large-scale agent collectives.

Rapid expansion of agent evaluation and frameworks. New today: MLPerf Mobile v6.0 with GenAI on-device benchmarks, CODA-BENCH for data-intensive code agents, Orchestra-o1 for omnimodal multi-agent orchestration, VisualClaw for real-time physical agents, PaperOrchestra for research paper pipelines, Data Journalist Agent for multimodal data stories, SkillOpt treating skills as differentiable, GameCraft-Bench for game-building agents, OpenClaw-Skill for structured skill trees, LoSoNA for LLM social norms. Also LoopCoder-v2 (7B, 64.4% on SWE-bench), MemGraph-RAG for memory-augmented graph RAG, RNG-Bench for multimodal LLM memory, VLM camera control benchmark, and SkillWeaver for compositional skill routing. New today: AgenticDataBench (344 tasks, 97 datasets), AgenticSTS (bounded-memory testbed, frontier LLMs get 0 wins), EdgeBench (long-horizon learning), DiscoBench (clarification-aware search), EvoPolicyGym (autonomous policy evolution), PACE (proxy evaluation with <4% MAE). A real-world multi-agent collaboration experiment with 100+ agents optimizing Gemma 4 inference showed emergent social norms and 5x speed improvement. Empirical data from OpenAI/Codex indicates a shift from chatbots to agentic systems. Sebastian Raschka tests local open-weight LLMs (30B MoE) at 40 tok/sec on consumer hardware, matching GPT 5.5 Pro. Alibaba's Qwen-Image-Agent bridges context gaps for image generation. New additions: AutoTrainess autonomously improves LMs via agentic post-training workflows; BioInsight multi-agent orchestration for interactive biomedical knowledge discovery; ASPIRE achieves 77% improvement on LIBERO-Pro via autonomous skill discovery for robotics; DiscoPER uses iterative meta-reflection for autonomous scientific discovery, recovering 8/9 ecological patterns.

Sources (8)