In January 2026, The Register reproduced a simple test: create a project directory with a .env file and a .claudeignore listing .env as excluded. Start Claude Code v2.1.12. Result — Claude read the .env file and printed the secrets to the console. Every API key, database credential, and token was now in the model’s context window, available to any subsequent tool call.
Prompt injection in agentic systems is not a nuisance. It is a weaponisable attack chain. The attack surface is no longer just a text box. It is every file the agent reads, every document it retrieves, every API response it processes. And consequences flow through a tool chain that is capable of real-world actions.
This article is part of the complete AI agent security guide. We cover how indirect injection works in practice, how data exfiltration via tool calls operates, why multi-agent architectures amplify blast radius, and the layered defence model that represents the professional standard.
Why Is Prompt Injection More Dangerous in Agentic Systems Than in Chat Interfaces?
In a chatbot, a successful prompt injection affects one conversation turn. In an agentic system, it hijacks a tool chain — and every privileged action that tool chain can perform becomes the attacker’s to direct.
A chatbot receives input, generates text, and stops. An agent reasons, selects tools, executes them, feeds results back into its reasoning, and continues — across dozens of sequential steps with real-world effects: file reads, API calls, database writes, code execution. A single manipulated output at any point hijacks the chain.
Two variants matter here.
Direct Prompt Injection: the attacker supplies malicious input directly — a user message or system prompt override.
Indirect Prompt Injection (IDPI): malicious instructions are embedded in external content the agent retrieves — documents, emails, web pages, API responses. The agent processes the attacker’s instructions as legitimate because they occupy the same context window as its actual instructions. There is no architectural boundary between trusted system prompts and untrusted retrieved data.
OpenAI acknowledged in December 2025 that this is “unlikely to ever be fully solved.” It is a structural property of the architecture, not a model deficiency. Researchers Schneier et al. (2026) formalised this with the Promptware Kill Chain — five stages treating injection payloads as malware in natural-language space. Palo Alto Networks Unit 42 documented 12 in-the-wild IDPI incidents with 22 payload engineering techniques. This is not theoretical.
How Does Indirect Prompt Injection Hijack an Agent Through a Document It Retrieves?
An indirect injection attack embeds instruction-like text in content the agent retrieves. When that content enters the context window, the model cannot reliably distinguish it from authoritative instructions — and acts on it.
Here is the mechanism. An agent processes an invoice PDF. The PDF contains hidden text — white-on-white, or zero-width Unicode characters, invisible to a human reviewer but readable by the LLM — instructing the agent to forward the contents of the .env file to an external address. The agent executes. The user sees a completed task. The attacker has the credentials.
The RAG Pipeline is the primary persistent attack surface. Every document in a retrieval-augmented knowledge base is a potential injection vector. A poisoned corpus entry persists and influences every future query that triggers its retrieval — scaling the injection across every agent session.
The highest-severity confirmed instance: EchoLeak (CVE-2025-32711), a zero-click IDPI vulnerability in Microsoft 365 Copilot, CVSS 9.3 Critical. A crafted email caused Copilot to access internal files and transmit them externally — zero user interaction required. EchoLeak bypassed Microsoft’s own filters, circumvented link redaction, and abused a CSP-approved Microsoft domain. Before EchoLeak, indirect injection was largely considered theoretical. That characterisation is no longer defensible.
How Does a Hijacked Agent Exfiltrate Data Through Tool Calls Without Triggering Alerts?
A hijacked agent embeds sensitive data into the parameters of ostensibly normal tool calls. From the outside, these look like legitimate API requests.
The exact mechanism: the agent reads a secrets file — a routine, permitted tool call — then constructs an outbound HTTP request where credentials are encoded in a URL parameter that looks like telemetry. Most observability stacks log the call as normal. Sensitive data in a parameter value does not appear as a flagged field.
This is what makes Model Context Protocol (MCP) a distinct injection surface. MCP tool descriptions — which agents read to understand available capabilities — can themselves be poisoned, subverting tool-use behaviour before any task begins. MCP tool poisoning achieves 84.2% success rates with auto-approval enabled. 43% of public MCP servers contain command injection flaws.
Redaction Gateways address the secondary exfiltration channel through logging — middleware that strips system prompts, API keys, and schemas from telemetry before it reaches observability systems. Pair with identity controls that limit the blast radius of a hijacked agent — zero-standing-privilege credential scoping ensures that even a successful exfiltration captures only a short-lived, narrow-scope token.
What Are Memory Poisoning and Goal Hijacking, and Why Do They Persist Across Sessions?
Memory poisoning plants malicious data in an agent’s persistent memory store, causing corrupted instructions to survive session resets. Goal hijacking redirects the agent’s objectives at the workflow level while it appears to function normally.
Session injection affects one interaction. Memory poisoning survives resets. AgentPoison research demonstrated persistent behavioural manipulation across multiple tasks from a single poisoned write. A single poisoned write in a shared memory store propagates to every agent reading from it. Isolation failure anywhere produces system-wide effects.
Goal Hijacking (OWASP ASI01) is harder to detect because each individual tool call may look correct. The deviation only becomes visible at the workflow-objective level.
Memory Segmentation isolates agent memory by user session and domain context. Provenance Tracking records the origin and trust level of every data element entering agent context.
What Do the Claude Code Case Studies Reveal About the Limits of Opt-Out Security Controls?
Two Claude Code incidents demonstrate the same pattern: opt-out controls place the security burden on the developer, require correct configuration, and fail when the underlying behaviour violates documented expectations.
Incident 1 — The .env auto-loading behaviour. Claude Code silently loads .env, .env.local, and similar files from a project directory into context with no warning. v2.1.12 continues to read credentials even when .env is listed in .claudeignore — the deny rule syntax uses two leading // rather than the / users expect, and permissions.deny has documented bugs. Even explicit denial may not work reliably.
The developer must know the behaviour exists, configure non-standard syntax correctly, and trust it holds across version updates. Safe defaults would invert this. Opt-out security is structurally weaker than opt-in safety.
Incident 2 — The cyberespionage case. Anthropic documented Claude Code being used to automate intelligence-gathering operations. No vulnerability, no policy violation. Your threat model must include intentional misuse, not just accidental failures.
Is there a real fix? It remains largely unsolved. Layered mitigations that reduce blast radius are the standard. Any vendor claiming a complete solution is selling something the research community has not found.
What Does the Layered Defence Model Look Like, and Why Is Input Sanitisation Alone Not Enough?
Five layers, each addressing a different attack-chain stage. No single layer is sufficient because each can be independently bypassed. Together, they reduce blast radius to manageable levels.
Layer 1 — Input Sanitisation (front-end). Microsoft’s Spotlighting wraps retrieved content in delimiters instructing the model to treat it as data, not instructions. Azure AI Prompt Shields adds classifier-based detection. Limitation: classifiers can be evaded through Base64 encoding, Unicode homoglyphs, and multilingual payloads. Sanitisation is front-end reduction, not elimination.
Layer 2 — Output Validation (mid-layer). Post-generation review of agent outputs before downstream tool calls execute. The Goal-Lock Mechanism — a secondary model validating whether proposed actions align with stated objectives — catches goal hijacking that bypassed front-end controls.
Layer 3 — Tool Execution Sandboxing (runtime containment). Restricting each tool to only the paths, network destinations, and data stores it legitimately requires. Outbound Network Allowlisting blocks exfiltration via unexpected domains. Sandboxing contains damage post-bypass — distinct from sanitisation, which prevents injection. See AI agent sandboxing explained for what that actually looks like in practice.
Layer 4 — Redaction Gateways (log protection). Strip system prompts, API keys, and schemas from telemetry before it reaches observability infrastructure.
Layer 5 — Human-in-the-Loop (HITL) controls. Risk-tiered approval gates: low-risk reads proceed automatically; writes and deletes require confirmation; irreversible operations require human review.
Omit any layer and a category of attack has no mitigation.
What Is a Practical Red-Teaming Framework for AI Agent Prompt Injection Testing?
Red-teaming requires adversarial testing across all retrieval entry points — documents, emails, APIs, MCP tool descriptions — not just direct user inputs.
Use the Promptware Kill Chain as the test matrix: Initial Access (inject via each retrieval source), Privilege Escalation (test jailbreak techniques), Persistence (write a test memory entry; confirm subsequent sessions do not execute it), Lateral Movement (test whether a compromised sub-agent propagates to the orchestrator or peers), Actions on Objective (verify no test credentials appear in outbound traffic; confirm HITL checkpoints trigger appropriately).
A common failure pattern: unit-level tests pass but the composed workflow fails because trust accumulates across handoffs. Red-team at the workflow level.
Eval-Driven Guardrails close the loop — convert exploit patterns into automated evals, promote them to runtime enforcement. For operationalising evals and behavioural baselines, see how to build an AI agent governance and monitoring programme.
The Minimum Viable Prompt Injection Defence for Teams Without a Dedicated Security Engineer
Three prioritised controls give the highest security-to-overhead ratio. Complete these before any production agentic deployment.
Step 1 — Secrets Exclusion (highest priority, lowest overhead). Never store secrets in project directories agents can access. The Claude Code incident makes the structural problem clear: if an agent’s working directory contains credentials, the agent will likely access them regardless of ignore directives. Use secrets managers (AWS Secrets Manager, HashiCorp Vault) instead of .env files in project roots. Configure .claudeignore and deny rules as secondary controls, not primary ones.
Step 2 — Output Validation for External-Facing Tools. Before any outbound tool call, validate that parameter values do not contain credential-format strings (JWT tokens, API key formats) or base64 blobs matching sensitive data shapes. Moderate engineering effort; substantial reduction of the data exfiltration surface.
Step 3 — Redaction Gateways for Logs. Redact fields matching sensitive data patterns before logs reach any observability platform. Most aggregation platforms have built-in field-redaction features — extend them to cover API key formats, JWT tokens, and private key headers.
Do all three before any production agentic deployment. Next: tool execution sandboxing, then HITL controls for high-stakes operations. These controls reduce blast radius; they do not eliminate the threat. For the full programme, see AI agent security from supply chain to SOC.
Frequently Asked Questions
What is indirect prompt injection?
Indirect prompt injection embeds malicious instructions in external content — documents, emails, web pages, API responses — that an AI agent retrieves. Unlike direct injection, it exploits content the agent is designed to act on. In agentic systems, this is the dominant threat.
Can an AI agent be hijacked by a document it is just summarising?
Yes. A summarisation task is indistinguishable from an execution task at the model level. The invoice PDF example: the agent reads the document, encounters injected instructions, executes them. The user receives a completed summarisation; the attacker receives credentials.
How does prompt injection differ from a standard cyberattack?
Standard attacks exploit deterministic software vulnerabilities — buffer overflows, SQL injection — with CVEs and patches. Prompt injection exploits the model’s inability to distinguish trusted instructions from untrusted data in natural-language space. There is no patch because it exploits a design property of the architecture, not a defect in deterministic code.
What is the OWASP AI Agent Security Cheat Sheet and where can I find it?
At cheatsheetseries.owasp.org/cheatsheets/AI_Agent_Security_Cheat_Sheet.html. It covers prompt injection, tool abuse, data exfiltration, memory poisoning, goal hijacking, and HITL controls. The OWASP Top 10 for Agentic Applications 2026 — including ASI01 (Agent Goal Hijack) and LLM01 (Prompt Injection) — is at owasp.org/www-project-top-10-for-large-language-model-applications/.
Can prompt injection cause my AI agent to expose API keys or database credentials?
Yes. The Claude Code .env incident demonstrates the concrete pathway: secrets loaded into context are accessible to any tool call the agent executes, including calls constructing outbound requests with attacker-controlled parameters. Data exfiltration via tool calls is documented and production-viable.
What is memory poisoning in an AI agent context?
Memory poisoning inserts malicious data into an agent’s persistent memory store, causing corrupted instructions to survive session resets. In multi-agent systems with shared memory, a single poisoned write corrupts every future interaction reading from that store.
What is the “blast radius” of a prompt injection attack?
The total scope of harm — systems, agents, users, and data stores affected when an attack propagates through an agentic workflow. In a single-agent system, a successful injection affects one session. In a multi-agent architecture, a compromised orchestrator propagates instructions to all sub-agents, and shared memory poisoning contaminates every future interaction.
Is prompt injection still a real problem in 2026 — does anyone have a working fix?
Yes, it remains largely unsolved. OpenAI’s December 2025 acknowledgment — “unlikely to ever be fully solved” — is professional consensus. The professional standard is layered mitigations that reduce blast radius: input sanitisation, output validation, sandboxing, redaction gateways, and HITL controls. Any vendor claiming a complete solution is selling something the research community has not found.
What is Model Context Protocol (MCP) and why does it create new prompt injection risks?
MCP connects AI agents to external tools. MCP tool descriptions — which agents read to understand available capabilities — can be poisoned with malicious instructions, subverting tool-use behaviour before any task begins. 43% of public MCP servers contain command injection flaws; MCP tool poisoning achieves 84.2% success rates with auto-approval enabled.
What does “spotlighting” mean as a prompt injection defence?
A Microsoft technique that wraps retrieved external content in special delimiters, instructing the model to treat delimited content as data rather than instructions. It reduces indirect injection risk but does not eliminate it. Best deployed as one layer in a stack.
How does multi-agent architecture increase the blast radius of a prompt injection attack?
Propagation occurs through three vectors: direct delegation (compromised orchestrator issues injected instructions to sub-agents), shared context (Agent B reads Agent A’s corrupted memory), and inter-agent protocols (A2A messages carrying poisoned instructions). The MAGPIE benchmark confirmed leakage even when only one agent had initial access.
What should I check in my logs if I suspect a prompt injection attack has occurred?
Outbound tool calls to unexpected domains; encoded parameter values matching credential patterns; agent outputs deviating from the stated objective; memory writes with instruction-like text; unusual data volume spikes. If provenance tracking is implemented, it identifies which retrieved content entered context before anomalous behaviour. Without redaction gateways, sensitive data may already be in log payloads — restrict access immediately.