Prompt injection, jailbreaks, model extraction, and red-teaming techniques against agents

Attacks and Red Teaming for AI Agents

The Evolving Landscape of Prompt Injection, Jailbreaks, and Model Extraction Attacks Against AI Agents

As AI systems, particularly large language models (LLMs) and multimodal agents, become increasingly integrated into critical societal infrastructure and personal applications, adversaries have developed sophisticated techniques to exploit their vulnerabilities. This article explores the core attack methodologies—prompt injection, jailbreaks, model distillation, and subtle bypasses—and examines red-teaming practices and safety failures that aim to defend against these threats.

Concrete Attack Techniques on LLMs and Agents

Prompt Injection and Jailbreaks

Prompt injection involves inserting malicious or carefully crafted inputs that manipulate the model's behavior, often bypassing safety filters or leading the AI to produce unintended outputs. For example, attackers craft prompts that override system instructions, effectively "jailbreaking" the model to reveal sensitive information or perform unsafe actions.

A notable example is the Claude Opus 4.6 jailbreak, which demonstrated how prompt engineering can circumvent built-in safety measures within advanced models. Such exploits exploit the model's sensitivity to context and the fact that many models process inputs without robust validation.

Jailbreak techniques often involve multi-step prompt chaining or embedding harmful instructions within seemingly innocuous interactions, making detection challenging. Attackers leverage the models’ reliance on context and their probabilistic nature to slip malicious prompts past filters.

Subtle Bypass Methods and Distillation Attacks

Beyond overt prompt injections, attackers employ subtle bypasses that exploit the model's training data, architecture, or interpretability weaknesses. These include:

Model distillation attacks, where adversaries extract knowledge from a deployed model by querying it extensively, effectively creating a surrogate that can be analyzed to uncover proprietary information or vulnerabilities.
Prompt injection via external tools or code snippets, where attackers embed malicious code in inputs that the model then executes or exposes.

Attack Surface Expansion: Automated and Swarm Attacks

Recent efforts, such as Scale 23x, have scaled attack simulations exponentially, revealing vulnerabilities like prompt injections, backdoors, and adversarial trigger phrases. These large-scale red-teaming efforts expose weak points in current defenses, emphasizing the need for layered, adaptive security measures.

Defensive Red-Team Practices and Safety Failures

Adversarial Testing and Formal Verification

To mitigate these threats, security teams conduct adversarial testing using tools like DREAM, which simulate malicious prompts and behaviors to identify systemic weaknesses before deployment. Formal verification initiatives, such as TorchLean, have made strides in providing mathematically grounded safety guarantees by formalizing neural networks within proof assistants, addressing the opacity and unpredictability that enable prompt bypasses.

Real-Time Monitoring and Anomaly Detection

Platforms like MUSE integrate real-time safety monitoring, performance tracking, and anomaly detection. These systems can flag unusual model outputs or internal activations indicative of prompt injections or malicious manipulations, allowing swift intervention.

Layered Defense Strategies

Behavioral classifiers analyze internal neural activations to detect malicious prompts during operation.
Ontology firewalls deployed rapidly in response to exploits, such as the Claude Opus jailbreak, exemplify dynamic mitigation strategies.
Norm monitoring tools like GHOSTCREW facilitate norm drift detection, ensuring emergent agent behaviors do not diverge into unsafe or unintended patterns.

Limitations and Failures

Despite advances, systemic safety failures occur when attackers succeed in bypassing filters or exploiting emergent behaviors. For instance, self-organizing agent societies—where agents develop their own languages and norms—can unexpectedly diverge from intended safety protocols, as exemplified by incidents where norm divergence led to safety collapse.

Emerging Threats and Ongoing Challenges

Model Extraction and Knowledge Distillation

LLM distillation attacks threaten proprietary models by extracting valuable knowledge through query-based methods. These attacks undermine intellectual property and can be used to craft more effective jailbreaks or adversarial prompts.

On-Device and Embedded Attack Vectors

With frameworks like OpenJarvis and Perplexity’s Personal Computer, AI agents operate directly on user devices, creating new attack vectors. Malicious actors can exploit local access to files and memory, increasing the difficulty of detection and mitigation.

Multi-Agent and Swarm Vulnerabilities

Research into multi-agent reinforcement learning (MARL) and swarm intelligence reveals that coordinated agent behaviors can be manipulated or hijacked, especially when emergent norms or shared languages are involved. These collective systems, while robust and scalable, introduce systemic risks if compromised.

The Road Ahead: Balancing Innovation and Security

The rapid evolution of attack techniques underscores the importance of comprehensive red-teaming, layered defenses, and formal verification to safeguard AI systems. As adversaries develop more subtle and scalable methods, defenders must adapt through:

Proactive adversarial testing to uncover vulnerabilities early.
Dynamic, real-time monitoring for prompt detection of malicious inputs.
Rigorous norm and behavior regulation in multi-agent systems to prevent emergent safety failures.
On-device security protocols that preserve privacy while resisting exploitation.

In conclusion, the ongoing arms race between attack methodologies and defensive strategies highlights the critical need for continuous innovation in AI safety, security, and robustness. Only through layered, adaptive, and formally grounded defenses can we ensure that AI agents operate safely and ethically amid increasingly sophisticated threats.

Sources (13)

Updated Mar 16, 2026

AI Red Teaming Hub

Prompt injection, jailbreaks, model extraction, and red-teaming techniques against agents

The Evolving Landscape of Prompt Injection, Jailbreaks, and Model Extraction Attacks Against AI Agents

Concrete Attack Techniques on LLMs and Agents

Prompt Injection and Jailbreaks

Subtle Bypass Methods and Distillation Attacks

Attack Surface Expansion: Automated and Swarm Attacks

Defensive Red-Team Practices and Safety Failures

Adversarial Testing and Formal Verification

Real-Time Monitoring and Anomaly Detection

Layered Defense Strategies

Limitations and Failures

Emerging Threats and Ongoing Challenges

Model Extraction and Knowledge Distillation

On-Device and Embedded Attack Vectors

Multi-Agent and Swarm Vulnerabilities

The Road Ahead: Balancing Innovation and Security

Top AI Security Concerns | Episode 43

Agentic attack chains advance as infostealers flood criminal markets

Cloudflare’s AI Security Tools Are Now Generally Available — Here’s What That Means for Your Apps

Q&A - AI's Journey Through Zero-Days And A Thousand Bugs

@Miles_Brundage reposted: 1/n Today we're releasing the first public draft of the Security Level 5 (SL5) s...

The Future of AI Security: Detecting Risks, Jailbreaks, and Vulnerabilities

Test Your AI Agents Like a Hacker - Automated Prompt Injection Attacks

Black Hat USA 2025 | Advanced Bypass Techniques and a Novel Detection Approach

LLM Distillation Attacks — The New AI Extraction Economy | by Adnan Masood, PhD. | Mar, 2026 | Medium

Scale 23x - Red Teaming the Robot: Practical Open Source Security for LLMs by Karol Piekarski

Week in Review: Safety Backfires, Scrapping AGI & Agents Fight Back — Week of Mar 2–6, 2026

Prompt Injection Attacks: Risks for Chatbots and How to Prevent Them | XHack Blog

Building Tougher AI Agents: Why Smart Restraint Beats a Straightjacket | by Aditya Inamdar | Mar, 2026 | Medium