AI Safety & Governance Digest

**Agent benchmarks & eval tooling: ClawArena/Claw-Eval/BeSafe/ARC-AGI-3, wild skills/tool ineff/trajectories, DAB/Omni/Proactive/Nemotron, noisy supervision; multimodal evals, AgentHazard, learnable agents, multi-agent realities, world models, Cog-DRIFT** [developing]

**Agent benchmarks & eval tooling: ClawArena/Claw-Eval/BeSafe/ARC-AGI-3, wild skills/tool ineff/trajectories, DAB/Omni/Proactive/Nemotron, noisy supervision; multimodal evals, AgentHazard, learnable agents, multi-agent realities, world models, Cog-DRIFT** [developing]

Key Questions

What is ClawArena and Claw-Eval in agent benchmarking?

ClawArena and Claw-Eval provide trustworthy evaluations for autonomous agents in evolving info environments. They identify harms, coding issues, and skill gaps in realistic settings. They evolve to test tool inefficiency and trajectories.

How do agentic skills perform in wild settings?

Benchmarks like AgentSocialBench and AWS Strands show significant gaps in wild agentic skills despite tool integration. Single agents outperform multi-agent in Stanford tests. Inefficiency patterns persist in realistic scenarios.

What is BeSafe and its findings on agent safety?

BeSafe benchmark reveals over 40% unsafe behaviors in agents. It tests harms alongside ARC-AGI-3, which scores under 1%. It highlights needs for better safety evals in multi-agent realities.

What is Cog-DRIFT in RLVR exploration?

Cog-DRIFT breaks zero-reward pitfalls in RLVR for hard problems using zero-reward exploration. It enables learning on failing rollouts via RLVR techniques. It improves agent adaptation in noisy supervision.

How do world models feature in agent evals?

Benchmarks like WR-Arena, OpenWorldLib, and Nemotron-Cascade test world action models and spatial understanding. They address multimodal evals and learnable agents. They reveal gaps in proactive behaviors like DAB/Omni.

What issues arise with noisy supervision in LLMs?

LLMs show noisy supervision in reasoning and self-execution, per self-execution simulation papers. Test-time learnable adaptation like ThinkTwice optimizes this. FactReview verifies claims amid RAG decay.

What do multi-agent benchmarks reveal?

Stanford papers show more agents do not always yield better results; single agents can outperform. AgentHazard and trajectories learning highlight irrationality in Strands. Signals and retrieval from trajectories improve evals.

What is the role of learnable adaptation in agent evals?

Learning to Learn-at-Test-Time enables language agents with adaptation policies. It supports test-time training amid scaling limits from MIT. It counters wild skill gaps and inefficiency in tool-integrated reasoning.

ClawArena/Claw-Eval trustworthy evals evolving info envs; AgentHazard harms/coding; wild agentic skill gaps realistic settings; tool-integrated inefficiency patterns; learning retrieval from trajectories; AgentSocialBench; AWS Strands irrational; Stanford single>multi-agent; Signals trajectory; BeSafe>40% unsafe; ARC-AGI3<1%; DAB/Omni/Proactive/HippoCamp; Nemotron-Cascade; SpatialLM; World Action Models/OpenWorldLib/WR-Arena; Cog-DRIFT RLVR zero-reward exploration; LLMs noisy supervision reasoning/self-execution; test-time learnable adaptation; FactReview verification; no sensitive retention. RAG decay; MIT scaling.

Sources (25)
Updated Apr 8, 2026