Agentic AI: Self-Evo RL + Memory/Planning + Evals + New Benchmarks + SWE-Bench Collapse

Key Questions

What are self-evolution RL methods in agentic AI?

Methods like RLVR, CEPO, AVSD, OPSD, and SCRL enable agents to improve through reinforcement learning. They are supported by harness engineering and single-rollout techniques like SAO.

What is RHO's performance on SWE-Bench Pro?

RHO achieves a 19-point improvement on SWE-Bench Pro. This reflects progress in coding agent capabilities alongside tools like LoopCoder-v2.

What new benchmarks evaluate proactive agents?

Benchmarks include OpenHands Index, CEO-Bench, and UniClawBench. They test real-world task performance beyond traditional leaderboards.

How do frameworks like LangGraph compare for agent development?

Practical comparisons of LangGraph, CrewAI, and AutoGen highlight differences in orchestration and memory handling. AutoMem shows 2x-4x gains in memory efficiency.

What caused the SWE-Bench collapse?

Contamination in benchmarks has led to overstated model performance. A model selection risk guide recommends prioritizing internal evaluations.

What is LLM-as-a-Verifier?

It is a general-purpose verification framework achieving SOTA results. It supports reliable evaluation in agentic and reasoning tasks.

What does SAO contribute to agentic RL?

Single-Rollout Asynchronous Optimization improves training efficiency for agentic models. It has been deployed in systems like GLM-5.2.

What is Light-Omni designed for?

Light-Omni focuses on reflex over reasoning in agentic video understanding. It incorporates long-term memory for improved performance.

Self-evolution RL papers (RLVR/CEPO/AVSD/OPSD, SCRL, harness engineering). RHO (19-point SWE-Bench Pro). OpenAI Symphony. LoopCoder-v2 (7B) 64.4 SWE-bench. New benchmarks: OpenHands Index, CEO-Bench, UniClawBench (proactive agents). Practical comparison of LangGraph/CrewAI/AutoGen. AutoMem 2x-4x improvement. LLM-as-a-Verifier SOTA. SkillOpt-Lite. Light-Omni. Lilian Weng's harness engineering post. SWE-Together benchmark. TurnOPD. SAO (Single-Rollout Asynchronous Optimization) for agentic RL, deployed in GLM-5.2. New: SWE-Bench collapse due to contamination; model selection risk guide published urging internal evals.

Sources (9)