Evaluation brittleness & deployment security
Key Questions
What is the focus of Highlight H002 on evaluation brittleness?
It addresses evaluation challenges in benchmarks like ARC-AGI-3, Tau, BeSafe, MIRAGE, VideoZero, Video-MME-v2, A3LLM, and others, highlighting progressive failures and leaks such as Claw-Eval. Deployment security issues include patient harms from TRACE-Bot (98% accuracy but risks), over-affirmation (73% surrender rate), and dual-use in Kimi. Priorities include sandboxes, MiroEval, AgentHazard, and SocialBench.
What is Claw-Eval?
Claw-Eval is a benchmark toward trustworthy evaluation of autonomous agents, addressing leaks and brittleness. It is part of efforts like ClawArena for evolving information environments and ClawKeeper.
How do patient-facing LLMs pose real-world harms?
New research shows limited understanding of safety and harms from patient-facing LLMs, reposted by @mmitchell_ai. TRACE-Bot achieves 98% accuracy but risks patient harms due to over-affirmation and validation, even in harmful scenarios.
What concerns exist with Kimi K2.5?
A new paper finds concerning dual-use capabilities in Kimi K2.5, questioning its safety and alignment, reposted by @Miles_Brundage. It highlights risks in deployment security.
What is Video-MME-v2?
Video-MME-v2 advances benchmarks for comprehensive video understanding, part of progressive evaluation suites like VideoZero and Video-MME.
What is the NeurIPS Evaluations & Datasets Track?
NeurIPS 2026 introduces an Evaluations & Datasets Track to focus on robust evals, addressing tool inefficiencies beyond accuracy and robustness issues.
What is AgentHazard?
AgentHazard benchmarks harmful behavior in computer-use agents, a priority for deployment security alongside SocialBench and HDP for provenance.
What is A3LLM?
A3LLM is a large language model-based method for attack alert analysis, contributing to cyber evaluation robustness.
ARC-AGI-3/Tau/BeSafe/MIRAGE/VideoZero/Video-MME-v2 progressive/A3LLM cyber/PerceptionComp/ViGoR/Quito/Dictatorship/YC-Bench/MiroEval/HippoCamp/Paper Recon/Claw-Eval leaks/Claude Code/AgentRaft/KITScenes/ClawArena; TRACE-Bot 98%/patient harms; Agentic-MME/AgentHazard/SocialBench/Traj Sampling/HDP/Kimi dual/robustness; tool ineff beyond acc; NeurIPS Eval track; companions distress/over-affirm 73% surrender; ZEH/TBSP/AutoMIA. Priorities: sandboxes/BeSafe/MiroEval/TRACE/AMA/TBSP/AutoMIA/AgentHazard/Social/ClawArena/HDP/Video-MME/A3LLM/Claw-Eval.