AI Innovation Tracker

RL unification + agent evals + long-horizon

RL unification + agent evals + long-horizon

Key Questions

What is TELBench and how does it improve agent evaluation?

TELBench enables span-level error localization in agent trajectories, delivering up to 30-point gains in identifying where deep-research agents fail.

How does MMG2Skill help LLM agents?

It distills web guides into self-evolving skills, yielding 12-25% performance improvements for agents operating in open-world settings.

What is NVIDIA Nemotron 3 Ultra and its key specs?

It is a hybrid Mamba-Transformer 550B MoE model with 55B active parameters, million-token context, delivering 5x throughput and 30% cost reduction for agentic tasks.

What does LEAP achieve on Putnam problems?

LEAP solves all Putnam 2025 problems through a Lean verifier loop that enables reliable mathematical reasoning in agents.

What is the Meta-Agent Challenge focused on?

It evaluates how well current agents perform self-improvement, revealing issues such as reward hacking and limited autonomous progress.

How does streaming communication benefit multi-agent systems?

It improves multi-agent reasoning performance by 7.3 percentage points through more efficient information exchange during collaborative tasks.

What benchmarks address long-horizon and adaptive planning?

AdaPlanBench tests adaptive planning under constraints (best score 67.75%), while Meta-Cognitive Memory Policy Optimization reaches 97.1% accuracy at 1.75M tokens using belief entropy.

Which new resources support self-evolving agents?

OpenSkill, Socratic-SWE, SIA, and SubtleMemory provide frameworks and benchmarks for self-evolution, memory discrimination, and dynamic replanning in long-horizon agents.

New agent evaluation and learning paradigms. TELBench/DRIFT for span-level error localization (30-point improvement). MMG2Skill distills web guides into self-evolving skills (+12-25%). Streaming communication in multi-agent reasoning (+7.3 pp). LEAP (Google) solves all Putnam 2025 problems via Lean verifier loop. NVIDIA Nemotron 3 Ultra (hybrid Mamba-Transformer, 550B MoE, 55B active, million-token context, 5x throughput, 30% cost reduction for agentic tasks; open release). New: Meta-Cognitive Memory Policy Optimization (Belief Entropy proxy, 97.1% at 1.75M tokens); AdaPlanBench (adaptive planning benchmark, best 67.75%); Meta-Agent Challenge (agents struggle with self-improvement, reward hacking). Also FluxMem, LiteCoder-Terminal, Maestro RL orchestrator, π-Bench, ACC, BeSafe-Bench, model-based RL on real hardware, Internet for AI Agents.

Sources (13)
Updated Jun 8, 2026