AI Research Highlights

Evaluation brittleness & deployment security

Evaluation brittleness & deployment security

Key Questions

What is the focus of Highlight H002 on evaluation brittleness?

It addresses evaluation challenges in benchmarks like ARC-AGI-3, Tau, BeSafe, MIRAGE, VideoZero, Video-MME-v2, A3LLM, and others, highlighting progressive failures and leaks such as Claw-Eval. Deployment security issues include patient harms from TRACE-Bot (98% accuracy but risks), over-affirmation (73% surrender rate), and dual-use in Kimi. Priorities include sandboxes, MiroEval, AgentHazard, and SocialBench.

What is Claw-Eval?

Claw-Eval is a benchmark toward trustworthy evaluation of autonomous agents, addressing leaks and brittleness. It is part of efforts like ClawArena for evolving information environments and ClawKeeper.

How do patient-facing LLMs pose real-world harms?

New research shows limited understanding of safety and harms from patient-facing LLMs, reposted by @mmitchell_ai. TRACE-Bot achieves 98% accuracy but risks patient harms due to over-affirmation and validation, even in harmful scenarios.

What concerns exist with Kimi K2.5?

A new paper finds concerning dual-use capabilities in Kimi K2.5, questioning its safety and alignment, reposted by @Miles_Brundage. It highlights risks in deployment security.

What is Video-MME-v2?

Video-MME-v2 advances benchmarks for comprehensive video understanding, part of progressive evaluation suites like VideoZero and Video-MME.

What is the NeurIPS Evaluations & Datasets Track?

NeurIPS 2026 introduces an Evaluations & Datasets Track to focus on robust evals, addressing tool inefficiencies beyond accuracy and robustness issues.

What is AgentHazard?

AgentHazard benchmarks harmful behavior in computer-use agents, a priority for deployment security alongside SocialBench and HDP for provenance.

What is A3LLM?

A3LLM is a large language model-based method for attack alert analysis, contributing to cyber evaluation robustness.

ARC-AGI-3/Tau/BeSafe/MIRAGE/VideoZero/Video-MME-v2 progressive/A3LLM cyber/PerceptionComp/ViGoR/Quito/Dictatorship/YC-Bench/MiroEval/HippoCamp/Paper Recon/Claw-Eval leaks/Claude Code/AgentRaft/KITScenes/ClawArena; TRACE-Bot 98%/patient harms; Agentic-MME/AgentHazard/SocialBench/Traj Sampling/HDP/Kimi dual/robustness; tool ineff beyond acc; NeurIPS Eval track; companions distress/over-affirm 73% surrender; ZEH/TBSP/AutoMIA. Priorities: sandboxes/BeSafe/MiroEval/TRACE/AMA/TBSP/AutoMIA/AgentHazard/Social/ClawArena/HDP/Video-MME/A3LLM/Claw-Eval.

Sources (35)
Updated Apr 8, 2026