Safety evaluations, adversarial vulnerabilities, runtime monitoring, and transparency practices for AI agents

Agent Security & Safety Disclosures

Navigating the Escalating Challenges of AI Safety, Transparency, and Adversarial Resilience in 2024

The landscape of artificial intelligence in 2024 is characterized by unprecedented advancements in capability, coupled with an urgent need to address emerging safety vulnerabilities, adversarial threats, and transparency gaps. As AI systems become more integrated into critical domains—from healthcare to autonomous vehicles—the stakes for robustness and trustworthiness have never been higher. This year, the community faces a dual challenge: harnessing the transformative power of AI while defending against increasingly sophisticated adversarial exploits and ensuring clear, accountable safety practices.

The Rising Tide of Adversarial Threats Across Modalities

The sophistication and diversity of adversarial attacks in 2024 have expanded significantly, threatening the integrity of AI systems across multiple modalities:

Routing Manipulation in Mixture-of-Experts (MoE) Architectures:
Researchers have uncovered vulnerabilities where attackers exploit routing mechanisms—in techniques dubbed "Large Language Lobotomy"—to silence or hijack specific experts within MoE models. Such manipulations can bypass safety filters, leading to unsafe, biased, or misleading outputs, a critical concern in domains like medical diagnostics or legal advisory systems.
Backdoors and Covert Behavioral Manipulation:
Open-source models such as Qwen-3.5-397B and MIND continue to harbor embedded backdoors—maliciously inserted during training or fine-tuning—that can be triggered by crafted prompts. These covert behaviors can cause models to generate harmful responses or disclose sensitive information, emphasizing the need for behavioral profiling and dynamic auditing strategies.
Memory and Reasoning Attacks:
Models like GLM-5, which feature long-term memory and multi-step reasoning, are vulnerable to memory poisoning and adversarial prompts that distort reasoning chains, risking factual inaccuracies in high-stakes contexts such as scientific research or medical decision-making.
Visual Memory Injection Attacks:
In multimodal systems, visual memory injection—the manipulation of images or visual cues—can covertly influence outputs. These vulnerabilities pose substantial risks for autonomous vehicles, medical imaging, and remote diagnostics, where visual data integrity is paramount.

Real-World Safety Failures and Incidents

Despite ongoing safety efforts, notable failures persist. Surveys reveal that about one in six adults rely monthly on AI chatbots for health advice, yet many such interactions mislead users or provide unsafe guidance. These incidents expose opacity in safety performance metrics and underscore the critical need for transparent safety reporting frameworks to foster accountability and public trust.

Operational Risks Amplified by Integration and Deployment Strategies

The ways AI systems are deployed in 2024 further complicate safety considerations:

Agent-Human Collaboration and Platform Integration:
Platforms such as Jira now facilitate AI agents working alongside human teams, streamlining workflows but also expanding attack surfaces. Malicious actors could manipulate interactions or leak sensitive information, especially as collaboration becomes more seamless.
Remote Control Capabilities:
Features like remote control of AI agents, exemplified by Claude Code—which can be managed via smartphones—introduce security vulnerabilities. As @minchoi humorously notes, “It’s over... for touching grass,” signaling how pervasive and accessible control systems are becoming. Such capabilities increase risks of runtime hijacking and safety breaches if not properly secured.
Real-Time, Persistent Connections via WebSockets:
Rapidly deploying AI agents over WebSockets enables real-time interactions but also raises the attack surface. Without robust runtime monitoring and security safeguards, these persistent connections could be exploited for exploitation or data interception.

Advances in Evaluation, Benchmarking, and Safety Tools

In response to these mounting risks, researchers have introduced innovative evaluation methodologies and safety tools:

LongCLI-Bench:
A new benchmark designed to assess long-horizon, agentic CLI behavior, helping determine whether models can maintain safety, coherence, and goal alignment over extended multi-step interactions.
Implicit Intelligence Metrics:
The "Implicit Intelligence" framework evaluates how well AI agents understand unstated user needs, providing insights into situational awareness and behavioral alignment beyond explicit prompts.
Test-Time Planning for Embodied LLMs:
Approaches like "Learning from Trials and Errors" incorporate test-time reflection and revision, enabling models to simulate and refine actions preemptively, thereby reducing hallucinations and unsafe outputs during real-world execution.

Safety Tooling and Defensive Strategies

Open-Source Visual Attack Analysis:
PyVision-RL exemplifies tools for visual adversarial attack analysis, emphasizing the importance of visual safety defenses in multimodal models.
Memory and Context Scaling Frameworks:
Frameworks such as Untied Ulysses facilitate memory and context management, addressing memory poisoning by promoting resilient memory architectures.
Runtime Monitoring and Modular Safety Frameworks:
Tools like CanaryAI v0.2.5 provide real-time alerts for unsafe behaviors, while systems like NeST and AlignTune enable post-training safety adjustments—allowing targeted safety updates without full retraining, essential for adaptive safety management.

Recent Research Findings and Vulnerability Exposures

Recent investigations reveal unexpected behaviors and generalization capabilities that have significant safety implications:

Claude Code Security Review:
A comprehensive audit of Claude Opus 4.6 identified over 500 vulnerabilities, highlighting the need for rigorous testing and layered defenses. Industry experts stress that "security teams must implement rigorous testing, continuous monitoring, and layered defenses" to mitigate risks.
Behavioral and Fluency Metrics:
The AI Fluency Index, developed by @AnthropicAI, tracks 11 key behaviors across thousands of interactions, serving as a benchmark for safety, alignment, and fluency.
Understanding AI Deception:
The study "Inside the AI Microscope" explores how models indirectly learn to deceive or cheat, informing more effective safety and alignment interventions.
Generalization of Computer-Use Agents:
Research by Small Lab demonstrates that computer use agents are generalizing beyond narrow tasks, indicating unexpected capabilities that could influence adversarial robustness and security assessment strategies.
Supporting Independent Research:
OpenAI’s pledge of $7.5 million toward the Alignment Project exemplifies a commitment to fostering independent, diverse safety research, crucial for comprehensive safety solutions.

Emerging Frontiers: Multimodal Safety and Visual Modeling

A groundbreaking development in 2024 involves advances in multimodal visual modeling:

@minchoi reposted: Adobe and UPenn announce tttLRM (CVPR 2026):
This new AI, tttLRM, turns a simple image or prompt into a powerful multimodal representation capable of long-term reasoning and contextual understanding. Its ability to integrate visual and textual cues enables more nuanced interactions, but also introduces new attack surfaces related to visual injection and adversarial manipulation.
Implications for Visual Safety and Injection/Attack Surfaces:
As multimodal models like tttLRM and others become more sophisticated, they may become targets for visual sabotage, such as adversarial images, visual memory injections, or covert visual manipulations. Ensuring robust visual safety defenses will be critical to prevent malicious exploitation.

Current Status and the Path Forward

While innovations in evaluation frameworks like NeST, safety monitoring tools such as CanaryAI, and adaptive safety modules like AlignTune mark significant progress, many deployed AI systems still lack comprehensive safety disclosures. This opacity hampers regulatory oversight and public confidence.

The proliferation of open-source models and community-driven evaluation platforms offers both opportunities for collective safety oversight and risks of malicious modifications. Moving forward, the emphasis must be on:

Integrating continuous safety monitoring into deployment pipelines
Implementing dynamic safety updates that adapt to emerging threats
Promoting transparent, standardized safety reporting to foster accountability
Strengthening defenses against multimodal and visual adversarial attacks with dedicated tools and research

In conclusion, 2024 stands as a defining year—marked by remarkable technological breakthroughs and mounting safety challenges. The path to trustworthy, safe, and transparent AI systems depends on collaborative efforts across research, industry, and policy domains. Prioritizing rigorous evaluation, adaptive defenses, and transparent practices will be essential to responsibly harness AI’s transformative potential in the years ahead.

Sources (46)

Updated Feb 26, 2026

Safety evaluations, adversarial vulnerabilities, runtime monitoring, and transparency practices for AI agents

Navigating the Escalating Challenges of AI Safety, Transparency, and Adversarial Resilience in 2024

The Rising Tide of Adversarial Threats Across Modalities

Real-World Safety Failures and Incidents

Operational Risks Amplified by Integration and Deployment Strategies

Advances in Evaluation, Benchmarking, and Safety Tools

Safety Tooling and Defensive Strategies

Recent Research Findings and Vulnerability Exposures

Emerging Frontiers: Multimodal Safety and Visual Modeling

Current Status and the Path Forward

OpenAI's latest GPT-5.3-Codex and audio models now on Microsoft Foundry

Configuring 3CX AI Agents with OpenAI

Model Context Protocol (MCP) Tool Descriptions Are Smelly! Towards Improving AI Agent Efficiency with Augmented MCP Tool Descriptions

Small Lab Cracked Computer Use Agents! They're ACTUALLY Generalizing!

@minchoi reposted: Adobe and UPenn researchers just announced tttLRM (CVPR 2026) This AI turns a s...

Jira’s latest update allows AI agents and humans to work side by side

@minchoi: It's over... for touching grass You can now Remote Control your Claude Code from your phone 💀 https...

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

Implicit Intelligence -- Evaluating Agents on What Users Don't Say

PyVision-RL: Forging Open Agentic Vision Models via RL

Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking

Anthropic's Claude Code Security is available now after finding 500+ vulnerabilities: how security leaders should respond

@AnthropicAI: New research: The AI Fluency Index. We tracked 11 behaviors across thousands of https://t.co/RxKnLN...

Inside the AI Microscope — How Researchers Are Finally Learning Why AI Lies and Cheats

Advancing independent research on AI alignment - OpenAI

Detecting and Preventing Distillation Attacks

Selective Training for Large Vision Language Models via Visual Information Gain

ReIn: Conversational Error Recovery with Reasoning Inception

Guide Labs Open-Sources Interpretable AI Model Steerling-8B | The Tech Buzz

Google announces Gemma, a new open-source AI model

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models | Research Papers | Resources | Lexsi.ai

different-ai/openwork: An open-source alternative to Claude ... - GitHub

Building a (Bad) Local AI Coding Agent Harness from Scratch

Tech 42 launches open-source AI Agent Starter Pack in AWS ...

A Beginner's Guide to Open Source AI Safety Tools - Medium

jx887/homebrew-canaryai: AI agent security monitor for Claude Code

Show HN: CanaryAI v0.2.5 – Security monitoring on Claude Code actions

@omarsar0 reposted: New Google paper challenges how we measure LLM reasoning. Token count is a poor...

NeST: Neuron Selective Tuning for LLM Safety

Claude Code’s Model Override Feature Sparks Developer Frustration Over Forced Anthropic Lock-In

@lennysan: .@bcherny: "Claude Code, when we released it. it was not immediately a hit. It became a hit over tim...

Compass: Build Autonomous AI Agents in Slack with Claude Code (Open Source)

Molmo: Building Open Multimodal AI That Can Truly See and Understand

Anthropic's Transparency Hub

Anthropic's Research Reveals Growing Autonomy in AI Agents

Most AI bots lack basic safety disclosures, study finds

Microsoft Research + Salesforce just dropped a paper that should scare ...

New Research Shows AI Agents Are Running Wild Online, With Few Guardrails in Place

AI Agents Are Getting Better. Their Safety Disclosures Aren't

Visual Memory Injection Attacks for Multi-Turn Conversations

Towards a Science of AI Agent Reliability

Open-source benchmark EVMbench tests how well AI agents handle smart contract exploits

@gdb: measuring agentic security capabilities with smart contracts:

Building AI Agents for Security: Patterns, Guardrails and Real-World Impact