AI Research Radar

******Agent evaluation & traceability/self-improvement (OpenResearcher, WildWorld, ARC-AGI-3, MemFactory, Proactive Env, ADeLe, Xpertbench)******

******Agent evaluation & traceability/self-improvement (OpenResearcher, WildWorld, ARC-AGI-3, MemFactory, Proactive Env, ADeLe, Xpertbench)******

Key Questions

What is test-time learnable adaptation for language agents?

Learning to Learn-at-Test-Time introduces adaptation policies enabling agents to learn during inference. It improves performance on dynamic tasks without retraining.

What does ByteRover achieve in long-horizon tasks?

ByteRover reaches 96.1% on long-horizon agent tasks, showcasing advanced evaluation and self-improvement. It highlights progress in agent benchmarks.

How does ARC-AGI-3 perform?

ARC-AGI-3 scores under 1%, indicating persistent challenges in general intelligence benchmarks. It tests agent reasoning limits.

What is Xpertbench?

Xpertbench uses rubrics-based evaluation for expert-level tasks, assessing agent capabilities rigorously. It focuses on traceability and self-improvement.

What are WildWorld and OpenResearcher?

WildWorld benchmarks agent skills in realistic settings, while OpenResearcher (with OmniMEM) evaluates research agents. They reveal wild performance gaps.

How effective are single vs. multi-agents on token budgets per Stanford?

Stanford finds single agents outperform multi-agents under token budgets, optimizing evaluation efficiency. This informs agent design trade-offs.

What is Cog-DRIFT?

Cog-DRIFT breaks exploration barriers in RLVR using drift correction, enhancing agent reasoning. It fixes inefficiencies in reinforcement learning for vision-reasoning.

What self-improvement techniques are highlighted?

Self-Exec simulation improves coding LLMs, LightThinker++ manages memory, and ThinkTwice optimizes reasoning/self-refinement. MemFactory and Proactive Env aid traceability.

Test-time learnable adaptation policies for language agents; Agent Harness survey highlights infra limits; ByteRover 96.1% long-horizon; OmniMEM; ARC-AGI-3 <1%; Xpertbench rubrics; WildWorld/OpenResearcher; MIT cloned workers 50%; AMA-Bench/BIGMAS; Cog-DRIFT RLVR exploration fix; Stanford single>multi agents on token budget; LightThinker++ memory mgmt; Self-Exec coding verification.

Sources (21)
Updated Apr 8, 2026
What is test-time learnable adaptation for language agents? - AI Research Radar | NBot | nbot.ai