Evaluation brittleness & deployment security

Key Questions

What is the focus of Highlight H002 on evaluation brittleness?

It addresses evaluation challenges in benchmarks like ARC-AGI-3, Tau, BeSafe, MIRAGE, VideoZero, Video-MME-v2, A3LLM, and others, highlighting progressive failures and leaks such as Claw-Eval. Deployment security issues include patient harms from TRACE-Bot (98% accuracy but risks), over-affirmation (73% surrender rate), and dual-use in Kimi. Priorities include sandboxes, MiroEval, AgentHazard, and SocialBench.

What is Claw-Eval?

Claw-Eval is a benchmark toward trustworthy evaluation of autonomous agents, addressing leaks and brittleness. It is part of efforts like ClawArena for evolving information environments and ClawKeeper.

How do patient-facing LLMs pose real-world harms?

New research shows limited understanding of safety and harms from patient-facing LLMs, reposted by @mmitchell_ai. TRACE-Bot achieves 98% accuracy but risks patient harms due to over-affirmation and validation, even in harmful scenarios.

What concerns exist with Kimi K2.5?

A new paper finds concerning dual-use capabilities in Kimi K2.5, questioning its safety and alignment, reposted by @Miles_Brundage. It highlights risks in deployment security.

What is Video-MME-v2?

Video-MME-v2 advances benchmarks for comprehensive video understanding, part of progressive evaluation suites like VideoZero and Video-MME.

What is the NeurIPS Evaluations & Datasets Track?

NeurIPS 2026 introduces an Evaluations & Datasets Track to focus on robust evals, addressing tool inefficiencies beyond accuracy and robustness issues.

What is AgentHazard?

AgentHazard benchmarks harmful behavior in computer-use agents, a priority for deployment security alongside SocialBench and HDP for provenance.

What is A3LLM?

A3LLM is a large language model-based method for attack alert analysis, contributing to cyber evaluation robustness.

ARC-AGI-3/Tau/BeSafe/MIRAGE/VideoZero/Video-MME-v2 progressive/A3LLM cyber/PerceptionComp/ViGoR/Quito/Dictatorship/YC-Bench/MiroEval/HippoCamp/Paper Recon/Claw-Eval leaks/Claude Code/AgentRaft/KITScenes/ClawArena; TRACE-Bot 98%/patient harms; Agentic-MME/AgentHazard/SocialBench/Traj Sampling/HDP/Kimi dual/robustness; tool ineff beyond acc; NeurIPS Eval track; companions distress/over-affirm 73% surrender; ZEH/TBSP/AutoMIA. Priorities: sandboxes/BeSafe/MiroEval/TRACE/AMA/TBSP/AutoMIA/AgentHazard/Social/ClawArena/HDP/Video-MME/A3LLM/Claw-Eval.

Sources (35)

Updated Apr 8, 2026

Evaluation brittleness & deployment security

Key Questions

What is the focus of Highlight H002 on evaluation brittleness?

What is Claw-Eval?

How do patient-facing LLMs pose real-world harms?

What concerns exist with Kimi K2.5?

What is Video-MME-v2?

What is the NeurIPS Evaluations & Datasets Track?

What is AgentHazard?

What is A3LLM?

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

$$\textrm{A}^{3}\text {LLM}$$ : A Novel Large Language Model-Based Method For Attack Alert Analysis | Springer Nature Link

Paper page - Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

@mmitchell_ai reposted: New blog: Real-world safety and harms from patient-facing LLMs There is limited...

@mmitchell_ai reposted: Artificial intelligence models overly affirm and validate users, even when users...

@Miles_Brundage reposted: 🚨New paper! How safe and aligned is Kimi K2.5? We found concerning dual-use ca...

Advancing adversarial and LLM robustness in trustworthy AI: a comprehensive survey | Artificial Intelligence Review | Springer Nature Link

Paper page - ClawArena: Benchmarking AI Agents in Evolving Information Environments

HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems

AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

Paper page - Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

@_akhaliq: Signals Trajectory Sampling and Triage for Agentic Interactions paper: https://t.co/XPfBucLx0i htt...

Introducing the Evaluations & Datasets Track at NeurIPS 2026

How to Test a Thinking Machine

TBSP: Measuring LLM Self-Preservation Bias

Journal of Metaverse » Submission » Governing Artificial Intelligence in the Evolving Academic Publishing Ecosystem

TRACE-Bot: Protecting Healthcare Platforms from AI-Driven Social Bots

@rosstaylor90: 🌶️ One more spicy take while I am jet lagged and less inhibited than usual: We expect agents to be ...

"Cognitive surrender" leads AI users to abandon logical thinking, research finds

🗞️ Daily ArXiv CS Digest — April 02, 2026#ArXiv #AI #ml #dl #cv #NLP #rl #llm #research

@Miles_Brundage reposted: Today, I'm releasing the first eval meant to test whether frontier models will h...

@omarsar0: Can an AI agent run a startup for a year without going bankrupt? Turns out most can't. New benchma...

@mmitchell_ai: Child safety is an area where we deeply need ML tools to work well, and it's the area where we know ...

@_akhaliq: ClawKeeper Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watcher...

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

ViGoR-Bench: Evaluating Reasoning in Visual Models

Anthropic code leak exposes Claude AI internals after release error

San Diego startup pitches fix after AI agent exposed Meta user data

Paper page - MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?

PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

@omarsar0: NEW paper from Google DeepMind The biggest threat to AI agents isn't a smarter attacker. It's the w...

@CharlesVardeman reposted: Excited about our new paper: AI Agent Traps AI agents inherit every vulnerabil...

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks (Mar 2026)

Paper page - BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation