RL Advances for Agents and Reasoning

Key Questions

What new benchmarks evaluate agent capabilities?

EvoPolicyGym tests autonomous policy evolution with trajectory diagnostics, DiscoBench evaluates clarification-aware deep search, and PACE offers a cheap proxy for agentic capability with 4% MAE and 0.80+ Spearman correlation.

How does DiscoBench assess search agents?

It tests clarification-aware deep search and shows that repeated searching can sometimes perform worse than simple guessing on certain tasks.

What performance results are reported for SU-01?

SU-01, a 30B model, achieved IMO gold medal level performance on mathematical reasoning tasks.

What does research say about multi-agent teams?

Studies indicate that multi-agent teams can sometimes degrade overall performance compared to single agents, depending on task coordination.

How does Valdi combine world models with reinforcement learning?

Valdi integrates value diffusion with world models to enable faster model predictive control (MPC) for agent decision-making.

New benchmarks and tools: EvoPolicyGym evaluates autonomous policy evolution through feedback with trajectory diagnostics; DiscoBench tests clarification-aware deep search, finding repeated searching can be worse than guessing; PACE provides a proxy for agentic capability evaluation using cheap atomic evaluations (4% MAE, 0.80+ Spearman). Also: SU-01 30B IMO gold; scaling harness; multi-agent teams can degrade performance; AGORA (59.4%), Terminal-Bench 2.0, CODA-BENCH; Alibaba world model improves agents; SkillWeaver; Confidence-Aware Tool Orchestration (56.4%); Why Multi-Step Tool-Use RL collapses; Diagnosing Task Insensitivity in Language Agents; Task-Perturbed NLL Optimization; AutoTrainess automates LM post-training; Valdi combines value diffusion with world models for MPC.

Sources (6)