Security incidents, attacks, and failures in agentic and LLM systems

Security, Attacks & Safety Failures

Escalating Security Incidents and Strategic Risks in Agentic and Large Language Model Systems: Recent Developments and Implications

The integration of artificial intelligence (AI), especially large language models (LLMs) and agentic systems, into critical sectors—such as infrastructure, defense, healthcare, and finance—has transformed operational capabilities. However, this rapid deployment has exposed an alarming rise in security vulnerabilities, operational failures, and malicious exploits that threaten the safety, privacy, and strategic stability of these powerful systems. Recent developments underscore the urgency of advancing multi-layered safety and security measures to keep pace with evolving threats.

Continued Escalation in Incidents and Failure Modes

Technical vulnerabilities are becoming more sophisticated and widespread, revealing systemic fragility:

Data Leakage and Model Extraction Attacks: Attackers are employing advanced techniques like model extraction and knowledge distillation to steal proprietary data embedded within models. For example, recent reports have documented successful extraction of sensitive healthcare and financial datasets. Such breaches not only compromise individual privacy but also enable adversaries to reconstruct training data, facilitating targeted attacks and further exploitation.
Prompt Injection and Hallucinations: Exploiting prompt injection vulnerabilities remains a significant concern. Malicious actors craft prompts that manipulate models into generating biased, false, or sensitive information—sometimes with serious real-world consequences. Notably, hallucinations—where models produce fabricated or misleading responses—persist in high-stakes domains. For instance, medical diagnostic models have fabricated diagnoses or treatment plans, risking patient safety and undermining trust.
Operational Failures and Data Breaches: Deployment of complex AI systems has occasionally led to critical failures. Microsoft’s Copilot, for instance, experienced a bug that inadvertently summarized confidential emails, exposing sensitive information to unintended recipients. Such incidents reveal vulnerabilities in safeguarding mechanisms, especially when models handle sensitive operational data. Furthermore, multimodal language models (MLLMs) face latent reasoning failures, such as latent-token reasoning failures, impairing their ability to perform reliably in safety-critical contexts like autonomous navigation or medical decision-making.

Systemic safety and reasoning limitations also persist:

Despite their impressive capabilities, many models struggle with complex reasoning and context understanding. Failures like latent-token reasoning errors highlight the models’ difficulty in handling nuanced or multi-step tasks, elevating risks in applications demanding high precision and safety.

Geopolitical and Regulatory Responses

The geopolitical landscape reflects growing concerns over AI security, prompting regulatory actions and strategic partnerships:

Legal and Procurement Disputes: The U.S. government, under directives from President Trump, recently blacklisted the AI startup Anthropic from federal contracts, citing safety concerns or political considerations. Anthropic has challenged this move legally, raising questions about how safety standards influence access to government AI contracts. These disputes exemplify how national security and economic interests intersect with AI safety policies.
Defense Sector Integration: Defense agencies are rapidly integrating AI into operational frameworks. OpenAI announced a groundbreaking deal to embed its models into the U.S. Department of Defense’s classified networks, aiming to enhance strategic capabilities. While promising, such integration raises risks related to insider threats, access controls, and the security of highly sensitive information.
International Efforts and Standards: Globally, efforts are underway to enhance AI safety and prevent misuse:
- Export Controls and Safety Standards: Countries are implementing tighter export controls and safety protocols to prevent malicious or unintended proliferation.
- Shared Transparency and Accountability: Initiatives such as Article 12 logging infrastructure—crucial for compliance with the EU AI Act—are being launched to promote transparency and accountability, ensuring organizations meet safety standards.

Advances in Evaluation, Safety, and Defense Strategies

In response to rising risks, the AI research community is developing sophisticated tools and frameworks:

Evaluation Platforms and Benchmarks:
- Contamination-Resistant Benchmarks: Recognizing that many datasets share overlaps with training data, new protocols are emerging to ensure assessments genuinely reflect model capabilities without contamination.
- MobilityBench: An innovative platform evaluates route-planning agents in dynamic, real-world scenarios, crucial for autonomous vehicle safety validation under realistic conditions.
- Skill-Inject: Recently introduced as a security benchmark for agentic systems, Skill-Inject measures an agent’s vulnerability to payload or skill injection attacks—manipulations that can alter behavior, leak sensitive data, or compromise safety. An accompanying video, "Skill-Inject: New LLM Agent Security Benchmark", offers insights into attack vectors and mitigation strategies, emphasizing the importance of robust, secure agent design.
Behavioral Steering and Formal Verification:
- Compositional Steering Techniques: These enable behavioral adjustments without retraining, allowing operators to dynamically steer models away from unsafe behaviors.
- Safety Constraints and Formal Methods: Frameworks like CodeLeash embed safety constraints directly into models—preventing misinformation or manipulation. Formal verification approaches, such as TLA+ and TorchLean, are employed to mathematically prove safety properties before deployment, especially critical in high-stakes environments like healthcare or autonomous vehicles.
Runtime Monitoring and Watermarking:
- Real-Time Monitoring: Tools like Cekura facilitate continuous observability, enabling early detection of anomalies, misuse, or malicious behaviors.
- Watermarking and Fingerprinting: These techniques serve to verify model ownership, detect unauthorized reuse, and deter malicious deployment.
Defense Against Exploits: Efforts are underway to strengthen models against prompt injections, extraction attempts, and hallucination mitigation, aiming to maintain trustworthiness under adversarial conditions.

Emerging and Notable Developments

Practical agent onboarding and security lessons are increasingly recognized as essential for safe deployment (N1). The growing field of 'agentic engineering' is emerging as a discipline, focusing on designing agents with inherent security considerations (N2).

Skill brittleness—where agent capabilities degrade or fail unpredictably—is a persistent challenge, leading to a cat-and-mouse dynamic where attackers and defenders constantly adapt (N4). To address this, researchers are exploring unified evaluations of model controllability, such as "How Controllable Are Large Language Models?", which assesses the ability to steer or constrain model behavior effectively.

In addition, NDSS 2025 plans to feature a comparative evaluation of LLMs focused on vulnerability detection, highlighting ongoing efforts to improve security assessment tools (N9).

Ongoing Research Directions

Federated Agent Reinforcement Learning (FARL): While promising for robustness and privacy, FARL introduces risks like poisoning attacks and malicious coordination, emphasizing the need for robust safeguards.
Inter-Head Attention (IHA): This technique enhances reasoning fidelity by enabling cross-head information exchange within models, significantly reducing hallucinations and improving reliability in complex reasoning tasks.
Multi-layered Safety-by-Design: The consensus remains that formal verification, comprehensive testing, runtime monitoring, and regulatory compliance must operate as an integrated security architecture to effectively mitigate escalating risks.

Current Status and Implications

The landscape of agentic and LLM security is increasingly characterized by escalating incidents, regulatory pressures, and technological innovations. High-profile failures—such as data breaches, adversarial exploits, and deployment mishaps—serve as stark reminders that security cannot be an afterthought.

The community's response—through advanced benchmarks, formal safety frameworks, and real-time monitoring tools—demonstrates a clear recognition of these challenges. However, the dynamic nature of threats demands continuous vigilance, cross-sector collaboration, and proactive safety engineering.

Implications for stakeholders include the necessity of adopting safety-by-design principles, ensuring transparency, and building resilience into systems from the ground up. As AI systems become more autonomous and integrated into critical infrastructure, rigorous security measures will be fundamental to safeguarding societal trust and strategic stability.

Conclusion

The proliferation of agentic and large language models has unlocked extraordinary capabilities but has concurrently exposed a complex web of vulnerabilities. Recent incidents and strategic disputes underscore that security is an ongoing, evolving challenge—one that requires multi-layered defenses, formal guarantees, and transparent evaluation.

The AI community’s innovations—such as Skill-Inject, Cekura, and formal verification tools—represent vital steps toward resilient, trustworthy systems. Yet, the escalating threat environment underscores the imperative for continued research, collaborative standards, and safety-first development practices.

Only through holistic, proactive security strategies can society responsibly harness AI’s transformative potential while mitigating the strategic and operational risks inherent in increasingly autonomous systems—ensuring they serve humanity reliably and safely in the years ahead.

Sources (45)

Updated Mar 4, 2026

Security incidents, attacks, and failures in agentic and LLM systems

Escalating Security Incidents and Strategic Risks in Agentic and Large Language Model Systems: Recent Developments and Implications

Continued Escalation in Incidents and Failure Modes

Geopolitical and Regulatory Responses

Advances in Evaluation, Safety, and Defense Strategies

Emerging and Notable Developments

Ongoing Research Directions

Current Status and Implications

Conclusion

@DataScienceHarp reposted: Not onboarding your agent is on you. @richmondalake, Director of AI Developer E...

The Man Who Coined 'Vibe Coding' Says The Next Big Thing Is 'Agentic Engineering'

@svpino: Skills in Claude Code right now are a cat-and-mouse game. Today, they work. Tomorrow, they fail. T...

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities

NDSS 2025 – A Comparative Evaluation Of Large Language Models In Vulnerability Detection

Show HN: Open-Source Article 12 Logging Infrastructure for the EU AI Act

India's top court angry after junior judge cites fake AI-generated orders

Launch HN: Cekura (YC F24) – Testing and monitoring for voice and chat AI agents

@jaseweston: Continual learning in production FTW (with humans-in-the-loop) – a detailed report on methods to it...

TorchLean: Formalizing Neural Networks in Lean

[PDF] Shall We Play a Game? Language Models for Open-ended Wargames

CoVe: Training Interactive Tool-Use Agents via Constraint-Guided Verification

hack::soho | Safety-Neuron-Based Attacks on LLMs | Stjepan Picek

@weaviate_io: 𝗠𝗖𝗣 𝗼𝗿 𝗔𝗴𝗲𝗻𝘁 𝗦𝗸𝗶𝗹𝗹𝘀? Here's the difference: 𝗠𝗖𝗣 (𝗠𝗼𝗱𝗲𝗹 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗣𝗿𝗼𝘁𝗼𝗰𝗼𝗹) connects agents to extern...

Off-the-Shelf Large Language Models Are Unreliable Judges – Jonathan Choi (USC / WashU)

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Skill-Inject: New LLM Agent Security Benchmark

[PDF] FEDERATED AGENT REINFORCEMENT LEARNING

IHA: Enhancing LLM Reasoning via Cross-Head Mixing

These federal agencies may have a Claude problem now

Anthropic to take Trump's Pentagon to court over AI dispute

OpenAI reaches deal to deploy AI models on U.S. Department of War classified network

Study: MLLM Latent Tokens Fail to Reason

Show HN: CodeLeash: framework for quality agent development, NOT an orchestrator

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Spilled Energy: Training-Free LLM Error Detection

[PDF] ATGEN: ADVERSARIAL REINFORCEMENT LEARNING

AIs can't stop recommending nuclear strikes in war game simulations

Paper page - PyVision-RL: Forging Open Agentic Vision Models via RL

@omarsar0: This new paper on agent failure makes an interesting claim. This is particularly important for long...

How to Manage Misinformation in Large Language Models

@srush_nlp: This has been really fun to use. Also interesting to see people exploring tools for verifying agent ...

DREAM: Deep Research Evaluation with Agentic Metrics

Implicit Intelligence -- Evaluating Agents on What Users Don't Say

@omarsar0 reposted: Be careful what you put in your AGENTS dot md files. This new research evaluate...

Judge Reliability Harness | RAND

Chinese companies distilled Claude to improve own models, Anthropic says | Reuters

Detecting and Preventing Distillation Attacks

Researchers Demonstrate New Internal Steering Technique for LLMs

Siteline

Enterprises are racing to secure agentic AI deployments