AI Agent Traps, Memory & Evaluation Advances

Key Questions

What new agent benchmarks target memory and sycophancy issues?

MemSyco-Bench evaluates memory-induced sycophancy while PACE serves as a proxy achieving MAE under 4% and Spearman correlation above 0.80. AgenticDataBench covers fintech B2B data agent cases.

How do new architectures improve agentic video understanding?

Light-Omni achieves 12.1x speedup and 2.6x memory efficiency using dual-state memory for MLLMs. It emphasizes reflex over reasoning for long-term agentic tasks.

What advances exist in agent self-evolution and evaluation?

SkillOpt-Lite enables nano models to outperform larger ones via ZO optimization and integrates into VSCode Copilot. SkillCoach uses self-evolving rubrics for skill selection, following, and reflection.

What persistent challenges remain in AI agent safety?

Domain-camouflaged injections and sycophancy continue to affect agents. Legal agents show over 90% failure rates while misalignment appears to worsen in newer models.

Which papers advance multi-agent and compositional routing?

SkillWeaver formalizes compositional routing for agents. Multi-agent systems demonstrate emergent self-organization under resource constraints, achieving 5x speed with Gemma 4 models.

New Qwen-Image-Agent from Alibaba bridges context gap for real-world image generation using planning, reasoning, search, memory. Misalignment worsening in newer models challenges evaluation. Multi-agent collaboration shows emergent self-organization under resource constraints (5x Gemma 4 speed). Domain-camouflaged injections, sycophancy persist. π-Bench, ACC, HarnessBridge, WeaveBench, FluxMem, AgingBench advance evaluation. Legal Agent >90% failure. PreAct reduces computer-using agent costs. SkillWeaver formalizes compositional routing. Anima Anandkumar's self-report vs behavior evaluation paper selected for ICML oral. New MemSyco-Bench benchmark targets memory-induced sycophancy in agents. New PACE proxy for agentic capability evaluation (MAE <4%, Spearman >0.80) reduces evaluation overhead. AgenticDataBench provides comprehensive benchmark for data agents with fintech B2B cases. SkillCoach introduces self-evolving rubrics for evaluating agent skill-use (selection, following, composition, reflection). New Light-Omni architecture achieves 12.1x speedup and 2.6x memory efficiency for agentic video understanding via dual-state design (global script + latent state), serving as memory system for MLLMs. New SkillOpt-Lite proposes minimal viable agent self-evolution using ZO optimization, enabling nano models to outperform larger ones via optimized harnesses, integrated into VSCode Copilot with 'one line of vibe'. New MetaSkill-Evolve paper introduces adaptive meta-evolution for self-improving agents, addressing the frozen improvement procedure bottleneck.

Sources (7)