******Agent evaluation & traceability/self-improvement (OpenResearcher, WildWorld, ARC-AGI-3, MemFactory, Proactive Env, ADeLe, Xpertbench)******
Key Questions
What is test-time learnable adaptation for language agents?
Learning to Learn-at-Test-Time introduces adaptation policies enabling agents to learn during inference. It improves performance on dynamic tasks without retraining.
What does ByteRover achieve in long-horizon tasks?
ByteRover reaches 96.1% on long-horizon agent tasks, showcasing advanced evaluation and self-improvement. It highlights progress in agent benchmarks.
How does ARC-AGI-3 perform?
ARC-AGI-3 scores under 1%, indicating persistent challenges in general intelligence benchmarks. It tests agent reasoning limits.
What is Xpertbench?
Xpertbench uses rubrics-based evaluation for expert-level tasks, assessing agent capabilities rigorously. It focuses on traceability and self-improvement.
What are WildWorld and OpenResearcher?
WildWorld benchmarks agent skills in realistic settings, while OpenResearcher (with OmniMEM) evaluates research agents. They reveal wild performance gaps.
How effective are single vs. multi-agents on token budgets per Stanford?
Stanford finds single agents outperform multi-agents under token budgets, optimizing evaluation efficiency. This informs agent design trade-offs.
What is Cog-DRIFT?
Cog-DRIFT breaks exploration barriers in RLVR using drift correction, enhancing agent reasoning. It fixes inefficiencies in reinforcement learning for vision-reasoning.
What self-improvement techniques are highlighted?
Self-Exec simulation improves coding LLMs, LightThinker++ manages memory, and ThinkTwice optimizes reasoning/self-refinement. MemFactory and Proactive Env aid traceability.
Test-time learnable adaptation policies for language agents; Agent Harness survey highlights infra limits; ByteRover 96.1% long-horizon; OmniMEM; ARC-AGI-3 <1%; Xpertbench rubrics; WildWorld/OpenResearcher; MIT cloned workers 50%; AMA-Bench/BIGMAS; Cog-DRIFT RLVR exploration fix; Stanford single>multi agents on token budget; LightThinker++ memory mgmt; Self-Exec coding verification.