Software Tech Radar

Research: benchmarks, self-improving agents, state mgmt, debugging

Research: benchmarks, self-improving agents, state mgmt, debugging

Key Questions

What benchmarks evaluate long-horizon agent performance?

WildClawBench provides real-world, long-horizon agent evaluation, complementing AgentLens for assessing complex tasks.

How are self-improving agents being developed?

Approaches include Moss self-evolving source-rewriting, self-distillation for continual learning, and Hermes+Obsidian persistent memory setups.

What memory management techniques are advancing for agents?

Innovations like Memdex for reusable local memory, Tencent's 4-tier pipeline, Δ-Mem, and MemEye address state management challenges.

How do belief state models improve agent reliability?

Agent-BRACE models agent beliefs while STALE, PREPING, and formal methods help debug and maintain consistent agent behavior.

What research focuses on multi-agent collaboration?

Surveys cover collaboration, failure attribution, and self-evolution in LLM-based multi-agent systems beyond individual intelligence.

How is testing of distributed systems using agents evolving?

AI agents are being applied to test distributed systems, alongside harnesses and RL techniques from MCP's 18-month developments.

What persistent memory solutions exist for agents?

Hermes integrated with Obsidian and cross-model solutions like Memdex enable agents to retain and reuse conversation history effectively.

How does self-distillation support continual learning in agents?

Self-distillation allows models to learn continuously without catastrophic forgetting, as shown in recent research papers.

AgentLens/WildClawBench; new: Agent-BRACE belief state modeling, Δ-Mem/STALE/PREPING/MemEye, self-distillation continual learning, testing distributed systems with agents, SpecBench reward hacking, Hermes+Obsidian persistent memory, Moss self-evolving source-rewriting, MCP 18mo RL/harnesses, Tencent 4-tier memory pipeline, Memdex cross-model memory.

Sources (13)
Updated May 24, 2026
What benchmarks evaluate long-horizon agent performance? - Software Tech Radar | NBot | nbot.ai