Agentic reasoning, self-evolution & scaling

Key Questions

What advances are seen in self-evolving AI agents?

Systems like MLEvolve, SkillOpt (+23.5 pts), and Retrospective Harness Optimization (59% to 78% on SWE-Bench Pro) demonstrate improved agent performance through self-evolution and harness optimization.

How do benchmarks like EdgeBench and AgenticSTS help?

EdgeBench studies long-run agent learning (12-72h) showing log-sigmoid curves, while AgenticSTS tests bounded-memory long-horizon agents, revealing frontier LLMs at 0% vs 16% human success.

What is DuoMem and its performance?

DuoMem uses dual-space distillation for on-device memory agents, enabling a 4B model to reach 77.9% success on ALFWorld at 3x faster speed than a 72B teacher model.

What does TAC achieve in RLVR training?

TAC provides an automated curriculum for multi-domain RLVR using gradient geometry, yielding 2.8 point gains on small models and better cross-domain transfer from code to math.

How does Sheaf-ADMM improve multi-agent systems?

Sheaf-ADMM from Sakana AI offers a sheaf-theoretic decentralized optimization method for multi-agent coordination, providing principled approaches for distributed systems.

What practical tips exist for coding agent infrastructure?

Using snapshots and forks can reduce costs, and harness choice significantly impacts efficiency, as Claude Code uses 2x tokens compared to Codex in some setups.

What new findings address RL generalization?

Research shows SFT introduces brittle specialized features while RL preserves base model representations, with causal experiments confirming better generalization from RL approaches.

How effective are proxies like PACE for agent evaluation?

PACE predicts agentic performance from cheap atomic benchmarks with under 4% MAE and over 0.80 Spearman correlation, enabling fast model selection without full runs.

Merged highlight covering autoresearch/self-evolving agents and verifiable reasoning scaling/agent harnesses. MLEvolve SOTA; Stanford study 2 agents 50% worse; SkillOpt +23.5 pts; Retrospective Harness Optimization 59%->78% SWE-Bench Pro. @rasbt finds local 30B MoE models match GPT 5.5 speed; harness choice matters — Claude Code burns 2x tokens vs Codex. New: EvoArena; Self-Evolving Multi-Agent Systems via Decentralized Memory; EP250 'subterranean agent' 100x cost reduction; Orchestra-o1 72.8% OmniGAIA; PreAct reduces redundant reasoning; SU-01 IMO/IPhO gold; MaxProof RL scaling; RandOpt rivals GRPO; MiniMax Sparse Attention; GLM-5.2; Schulman explains PPO; new agency definition framework. Practical tip: use snapshots/forks for coding agent infrastructure cost reduction. Practical blog on AI memory levels for engineers. NVIDIA Nemotron 3 Super with RLVR — practical deep-dive into scalable RL for domain-specific agents. Decision framework (SFT vs RLHF vs RLVR) and NeMo Gym infrastructure. 1.2M rollouts and 21 verifiers. 'Why Does Reinforcement Learning Generalize?' paper finds SFT introduces brittle specialized features while RL preserves base model representations; causal intervention experiments. Practical: Train Specialized AI Agents With Practical Reinforcement Learning guide (GRPO, NeMo Gym, iterative improvement). AutoTrainess automates LM post-training by externalizing human experience as explicit workflows, outperforming CLI-only baselines. DiscoPER (Autonomous Scientific Discovery via Iterative Meta-Reflection) recovers 8/9 known patterns on ecological benchmark using second-order meta-reflection. Ctrl-R introduces tractable trajectory control for RL-based reasoning exploration, enabling diverse reasoning patterns. Daily Papers feed from Hugging Face covers video retrieval, GUI automation, tool-use agents, and coding agent rewards. AI Native Daily Paper Digest (20260702) covers multimodal eval, reasoning, video serving, Seed2.0, state-prediction separation, robotics, biomedical discovery, and coding agents. PACE proxy for agentic capability evaluation predicts performance from cheap atomic benchmarks (<4% MAE, >0.80 Spearman). Practical for fast model selection without full agentic runs. AgenticDataBench comprehensive benchmark for data agents covering 15 domains with skill-level granularity, including real B2B cases. Practical for evaluating data agents. AgenticSTS bounded-memory testbed for long-horizon agents using Slay the Spire 2; isolates memory layers via typed retrieval. Frontier LLMs get zero wins vs 16% human. Practical for agent memory design. Flexion Reflect v1.0 shows RL dramatically improving long-horizon robot autonomy (38% SFT vs 90% SFT+RL). Off-the-shelf VLMs act too eagerly; RL needed for reliability. Practical for autonomous agents. Grounded autonomous research pipeline (fault-tolerant LLM) for computational physics — uses redundancy, fresh-context isolation, adversarial review. Practical for building reliable autonomous research agents. New: @polynoamial flags hidden test-time compute budgets in agent scores from AISecurityInst work; crucial for evaluation rigor. Practical for agent benchmarking. New: Gradient efficiency gains of 20-40% for CWM 32B with best accuracy-diversity tradeoff on LiveCodeBench; reasoning generalizes from code to math without math training. Practical for efficient training and cross-domain transfer. New: EdgeBench benchmark for agent learning over long runs (12-72h) reveals log-sigmoid performance curve; practical for understanding agent improvement dynamics. New: Sheaf-ADMM (Sakana AI, ICML 2026) for multi-agent coordination via sheaf-theoretic decentralized optimization; principled method for distributed multi-agent systems. New: Coding agents can replicate scientific ML papers (dair_ai); concrete signal of agentic research reproduction capability. New: TAC (Transferability for General Reasoning) — automated curriculum for multi-domain RLVR using gradient geometry; 2.8 point gain on small models, principled alternative to hand-tuning. Practical for scaling RLVR across math/code/science. New: DuoMem — dual-space distillation for on-device memory agents; 4B model achieves 77.9% success on ALFWorld, 3x faster than 72B teacher. Practical for edge deployment of capable agents.

Sources (19)