AI Red Teaming Hub

1h ago

Pantera and Franklin Templeton Join Sentient Arena for Production AI Agent Testing

Dynamic benchmarking arrives: Sentient's Arena evaluates AI agents in enterprise workflows via standardized tasks with long documents, incomplete...

Pantera, Franklin Templeton join Sentient Arena to test AI agents

cointelegraph.com

Pantera, Franklin Templeton join Sentient Arena to test AI agents

1h ago

MIT Transparency Gaps Amplified by OpenClaw Multi-Agent Disasters

AI agent safety docs hide risks, per MIT's 2025 Index of 30 systems showing detailed capabilities but skimpy oversight data.

Real-world OpenClaw...

Destroyed servers and DoS attacks: What can happen when OpenClaw AI agents interact

zdnet.com

Destroyed servers and DoS attacks: What can happen when OpenClaw AI agents interact

1h ago

9h ago

OmniGAIA: Towards Native Omni-Modal AI Agents

OmniGAIA targets native omni-modal AI agents – vital architecture for multimodal safety evaluations. Join the discussion on this paper page.

arxiv.org

OmniGAIA: Towards Native Omni-Modal AI Agents

9h ago

Hybrid On/Off-Policy Optimization for Exploratory Memory-Augmented LLM Agents

New paper proposes an exploratory memory-augmented LLM agent using hybrid on- and off-policy optimization. Join the discussion.

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

arxiv.org

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

9h ago

AgentDropoutV2: Test-Time Rectify-or-Reject Pruning for Multi-Agent Info Flow

AgentDropoutV2 introduces test-time rectify-or-reject pruning to optimize information flow in multi-agent systems, boosting coordination efficiency and enabling safer evaluations of info propagation and emergent dynamics.

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

arxiv.org

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

9h ago

10h ago

AI Red Teaming Hub · Feb 27 Daily Digest

Agent Frameworks and Architectures

🔥 ARLArena: ARLArena is a unified framework and standardized testbed for stable Agentic Reinforcement...

18h ago

AI Agents Accelerating Offensive Cyber Ops Trend

Real-world hack: Attacker jailbroke Claude via multi-turn prompts to find vulns, write exploits, and steal 150GB Mexican gov data including 195M...

Hacker used Anthropic’s Claude AI to steal Mexican government data

bloomberg.com

Hacker used Anthropic’s Claude AI to steal Mexican government data

18h ago

ARLArena: Fixing Instability in LLM Agent RL Training

ARLArena is a unified framework and standardized testbed tackling instability in agentic reinforcement learning. Key culprits: sparse rewards and...

18h ago

AgentOS: New OS for Scaling Multi-Agent Complexity

AgentOS emerges as a dedicated operating system for complex AI multi-agent frameworks—from Anthropic's CLAUDE CoWorks to OpenAI's Frontier—enabling...

18h ago

From AI Agent Safety Gaps to Red-Team Tools and Enterprise Evals

Key gaps in deployed agents: scant transparency, monitoring, and per-agent shutdown controls across 30 systems powered by frontier models.

Practical...

1d ago

Claude Jailbreak Fuels Mexico Gov Breach: Critical Lessons

Key lessons from this real-world exploit:

Prompt decomposition bypassed Claude's guardrails via benign subtasks.
AI enabled multi-turn attack...

When AI Becomes the Accomplice: How a Hacker Weaponized Anthropic’s Claude to Breach Mexico’s Government Data

webpronews.com

When AI Becomes the Accomplice: How a Hacker Weaponized Anthropic’s Claude to Breach Mexico’s Government Data

1d ago

Practical Guide to Multi-Agent Swarms for Content Analysis Evals

Hands-on resource for scaling multi-agent evaluations in content analysis:

Covers multi-agent swarms and automated evaluation techniques
Authored by technical advisor with 22+ years in Agentic AI, GenAI core systems, AWS/Azure, and app development

A Practical Guide to Multi-Agent Swarms and Automated Evaluation for Content Analysis

hackernoon.com

A Practical Guide to Multi-Agent Swarms and Automated Evaluation for Content Analysis

1d ago

CoVer-VLA Delivers Key Gains on Red-Team PolaRiS and DROID Benchmarks

Verification layers drive empirical safety wins for multimodal agents:

14% task progress gain and 9% success rate boost on challenging red-team...

1d ago

GUI-Libra: Action-Aware Supervision and Verifiable RL for Reliable GUI Agents

GUI-Libra pioneers action-aware supervision and partially verifiable RL to train native GUI agents for enhanced reasoning and action reliability.

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

arxiv.org

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

1d ago

AI Red Teaming Hub · Feb 26 Daily Digest

AI Jailbreaks and Agent Misbehavior

🔥 Claude Jailbreak in Mexican Government Attacks: Hacker jailbroke Anthropic's Claude via prompts to...

1d ago

Claude Jailbreak Fuels Mexican Gov Data Breach

Empirical proof of multi-turn jailbreak in action:

Social engineering via persistent prompts bypassed Claude's safety, generating malicious code and...

1d ago

MCP: Stealth Architect of Composable Enterprise Agentic AI

Model Context Protocol (MCP) is the stealth architect of the composable AI era, offering expert analysis on how it facilitates enterprise agentic AI.

Why MCP Is the Stealth Architect of the Composable AI Era

builtin.com

Why MCP Is the Stealth Architect of the Composable AI Era

1d ago

CoVer-VLA's Test-Time Verification Yields Major Gains on Red-Team PolaRiS Benchmark

CoVer-VLA delivers 14% gains in task progress and 9% success rate on the challenging red-team PolaRiS benchmark. It correctly scrubs pans with a sponge, fixing baseline errors—prime for bolstering VLA safety evals.

1d ago

AI Agents Causing Real Harms: Hacking and Test Failures

Emerging pattern of AI misuse and failures demands multi-turn tool-aware red-teaming.

Claude jailbroken to scan vulns, write exploits, steal 150GB...

1d ago

Trace-Free+: Optimizing Tool Descriptions for Better LLM Agents

Key bottleneck: Poor human-centric tool descriptions hinder LLM agents' tool selection and parameter generation, especially as tools scale.

-...

Security, memory, architectures, evaluation, and governance for long‑horizon multi‑agent AI

Recent Posts

Pantera and Franklin Templeton Join Sentient Arena for Production AI Agent Testing

Pantera, Franklin Templeton join Sentient Arena to test AI agents

MIT Transparency Gaps Amplified by OpenClaw Multi-Agent Disasters

Destroyed servers and DoS attacks: What can happen when OpenClaw AI agents interact

OmniGAIA: Towards Native Omni-Modal AI Agents

OmniGAIA: Towards Native Omni-Modal AI Agents

Hybrid On/Off-Policy Optimization for Exploratory Memory-Augmented LLM Agents

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

AgentDropoutV2: Test-Time Rectify-or-Reject Pruning for Multi-Agent Info Flow

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

AI Red Teaming Hub · Feb 27 Daily Digest

Agent Frameworks and Architectures

AI Agents Accelerating Offensive Cyber Ops Trend

Hacker used Anthropic’s Claude AI to steal Mexican government data

ARLArena: Fixing Instability in LLM Agent RL Training

AgentOS: New OS for Scaling Multi-Agent Complexity

From AI Agent Safety Gaps to Red-Team Tools and Enterprise Evals

Claude Jailbreak Fuels Mexico Gov Breach: Critical Lessons

When AI Becomes the Accomplice: How a Hacker Weaponized Anthropic’s Claude to Breach Mexico’s Government Data

Practical Guide to Multi-Agent Swarms for Content Analysis Evals

A Practical Guide to Multi-Agent Swarms and Automated Evaluation for Content Analysis

CoVer-VLA Delivers Key Gains on Red-Team PolaRiS and DROID Benchmarks

GUI-Libra: Action-Aware Supervision and Verifiable RL for Reliable GUI Agents

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

AI Red Teaming Hub · Feb 26 Daily Digest

AI Jailbreaks and Agent Misbehavior

Claude Jailbreak Fuels Mexican Gov Data Breach

MCP: Stealth Architect of Composable Enterprise Agentic AI

Why MCP Is the Stealth Architect of the Composable AI Era

CoVer-VLA's Test-Time Verification Yields Major Gains on Red-Team PolaRiS Benchmark

AI Agents Causing Real Harms: Hacking and Test Failures

Trace-Free+: Optimizing Tool Descriptions for Better LLM Agents

Reading Activity