What I Found When I Started Breaking AI Agents

Six months ago I started taking AI agent security engagements. Not theoretical research. Not CTF challenges. Production systems where companies had deployed LLM-powered agents that could read databases, call APIs, send emails, and execute code.

I expected to find novel attack classes. I did. But the thing that surprised me most was how many of the vulnerabilities were familiar. The same bug categories that have haunted web applications for twenty years are showing up in agent architectures, wearing slightly different clothes.

The familiar bugs in unfamiliar places

The OWASP Top 10 for LLM Applications exists because the security community recognized that LLM systems share structural similarities with traditional application security problems. After working through several agent security assessments, I can confirm: the overlap is real and the consequences are worse.

Injection is still injection. Prompt injection is the SQL injection of the LLM world, and it’s roughly where SQL injection was in 2005: everybody knows about it, most systems are still vulnerable, and the mitigations are incomplete. The difference is that SQL injection has parameterized queries. Prompt injection does not have an equivalent defense that works reliably across all cases.

In one engagement, I found an agent that processed customer support tickets. Users could submit text that the agent would summarize and route. A crafted ticket body could override the agent’s system instructions and cause it to dump its tool configuration, including API keys for internal services. The fix was input sanitization and output filtering, but the vulnerability existed because nobody had threat-modeled the agent as an application that processes untrusted input. They thought of it as “AI” rather than “software.”

Authorization failures are amplified. When a web application has a broken access control issue, an attacker can access data they shouldn’t. When an agent has a broken access control issue, an attacker can instruct the agent to access data they shouldn’t, using the agent’s own credentials and permissions. The agent becomes a privilege escalation proxy.

I tested an internal agent that had read access to a company’s entire data warehouse so it could answer business intelligence questions. Any employee could ask it questions. The authorization model assumed that the agent’s responses would naturally scope to appropriate data. They did not. Asking the right questions surfaced compensation data, board materials, and unreleased financial figures. The agent had one permission level. The humans using it had dozens. Nobody had reconciled the difference.

The new attack surface

Some of what I’m finding doesn’t map cleanly to traditional application security.

Tool use is the new SSRF. Agents that can call tools — APIs, databases, file systems, other services — inherit every vulnerability in those integrations. An agent with a “search the web” tool can be directed to make requests to internal infrastructure. An agent with a “read file” tool can be directed to read sensitive configuration. This isn’t theoretical. I’ve used tool-calling to pivot from an external-facing agent to internal network resources in a production environment.

The OWASP LLM Top 10 calls this “Excessive Agency” (LLM08), and it’s the finding I report most often. Developers give agents access to powerful tools because they need them to be useful. But “useful” and “secure” require different permission models, and the frameworks don’t enforce least privilege by default.

Context window poisoning is a persistence mechanism. In multi-turn conversations, earlier messages influence later behavior. If an attacker can inject content into an agent’s context window — through a document it ingests, a database record it reads, a previous conversation turn — that content shapes every subsequent action. This is not a one-shot exploit. It’s a persistent backdoor that lives in the agent’s working memory.

Chain-of-thought manipulation bypasses safety controls. Many agent frameworks use chain-of-thought reasoning to decide which tools to call and how to respond. The reasoning itself can be influenced by crafted inputs. I’ve seen cases where an agent’s safety filters correctly flagged a direct malicious request but approved the same action when it was embedded in a reasoning chain that made it appear legitimate.

Why your current pentest vendor probably misses this

Most penetration testing firms staff engagements with people who understand network security, web application security, or cloud security. Those are necessary skills but insufficient for agent security assessments.

Testing an LLM agent requires understanding:

How transformer architectures process and prioritize instructions
How system prompts, user inputs, and tool outputs interact in the context window
How different model providers handle safety filtering (and where the gaps are)
How agent frameworks like LangChain, CrewAI, and AutoGen implement tool calling
How retrieval-augmented generation introduces document-based injection vectors

This is why I got the GIAC Machine Learning Engineer (GMLE) certification. Not because a credential proves competence, but because the underlying knowledge — ML model architectures, training pipelines, inference optimization — is prerequisite to understanding how these systems fail. You cannot effectively attack what you do not understand mechanistically.

What to do about it

If you’re deploying agents in production, here’s the minimum:

Threat model the agent as an application. It takes untrusted input, processes it, and produces output that has side effects. Every principle from application security applies, plus the LLM-specific attack surface.

Apply least privilege to tool access. The agent should have the minimum permissions needed for each task. Not one set of credentials that works for everything. Implement per-tool authorization that considers who is asking, not just what is being asked.

Monitor agent actions, not just outputs. Log every tool call, every database query, every API request the agent makes. You cannot detect abuse by reading the agent’s text responses. You detect it by watching what the agent does.

Test with adversarial inputs. Not just “does the agent refuse to say bad things.” Test whether crafted inputs can cause the agent to take unauthorized actions, access unauthorized data, or leak system configuration. This is a pentest, not a content moderation review.

Assume prompt injection will succeed. Design the system so that a successful injection has limited blast radius. Defense in depth applies here the same way it applies everywhere else.

The window is closing

Right now, AI agent security is where cloud security was in 2015: most organizations know it matters, few have a structured approach, and the attack surface is growing faster than the defenses. The companies that invest in agent security assessments now will have a significant advantage over those that wait for the first major incident to force the conversation.

If you’re shipping agents and want an honest assessment of where the vulnerabilities are, I’m taking engagements. No pitch deck, no generic scanner output. Hands-on testing by someone who understands both the ML and the security side. [email protected]