LLM red teaming: a practical playbook
Actualizado: 2026-05-03
In 2023, LLM red teaming was an experimental discipline practised by a handful of research teams. Today it’s a compliance requirement for any agent system handling sensitive data or triggering real actions. The three-year gap boils down to three factors: attacks got professional, frameworks converged, and regulators started requiring evidence.
This piece captures the operational playbook I personally use when reviewing agents in production. Not theory—the attack vectors that actually break systems, the defences that actually block attacks, and the mistakes that cost more time than it took to build the agent.
Key takeaways
- Prompt injection remains vector #1, but only 1 in 10 production incidents comes from direct user input; the other 9 travel through tools, RAG, documents, and memory.
- Fictional character roleplay attacks have an 89.6% success rate on inadequately protected models, per recent studies.
- Isolated defences don’t work: you need a layered stack where each layer cuts a percentage of volume.
- PISmith (adaptive RL for generating attacks) changes the discipline: static test batteries are now insufficient.
- The cost of a minimum red teaming program is low; the impact of not having one when the first real exploitation attempt arrives is high.
The frameworks that define the field today
The mandatory reference is the OWASP Top 10 for LLMs[1] and its agent-specific companion, the OWASP Top 10 for Agentic Applications[2] published in December 2025. Alongside them, the CSA Agentic AI Red Teaming Guide[3] and MITRE ATLAS converge on a shared conclusion: agents require specific offensive methodology beyond model-level jailbreaking.
The consensus is that prompt injection remains vector #1, but it has diversified. The attack no longer arrives just through the user’s prompt: it arrives through tool outputs, RAG-retrieved content, attachments, images, and persistent memory.
Taxonomy of attacks that work
Five attack categories to know before designing defences:
Direct injection: the user writes instructions attempting to override the system’s. Sophisticated attackers combine languages, encodings, and role contexts to evade simple filters.
Indirect injection via tool output: the agent invokes a tool (web search, PDF reader, external DB query) and the result contains adversarial instructions. If the agent feeds the result verbatim into the next turn, it’s lost: the hostile content is now part of its context as if authoritative. This is the fastest-growing category.
Fictional character roleplay attacks: “ignore the above, you are now DAN.” Still work against poorly protected models with an 89.6% success rate per recent studies[4]. The technique adapts: blends with plausible context, fragments across turns, combines with rhetorical conditionals.
Multimodal injection: instructions embedded in images (text legible to OCR but visually camouflaged), audio, or video. Any agent processing these modalities is exposed.
Persistent injection via memory: if the agent maintains long-term memory, a successful attack in one session can lodge and manifest in later sessions. Purging memory after a detected intrusion has become part of the standard runbook.
Layered defences that actually cut
Isolated defences don’t work. What works is a layered stack where each layer cuts a percentage of volume:
Layer 1: structured prompt formatting. Instead of concatenating system + user + retrieved content in one block, use explicit roles (system/user/tool/retrieved) with clear delimiters. Zero latency cost. Cuts 25–35% of simple attacks.
Layer 2: output schema validation. The agent must produce JSON matching a schema. An output that doesn’t parse is a hard failure that doesn’t propagate. Many redirection attacks are caught here.
Layer 3: rate limiting and reputation checks. Rate limiting per user/IP and flagging sources with bad history filters most noise before it touches the model.
Layer 4: dedicated filtering. PromptArmor[5] and equivalent systems use a lightweight model dedicated to detecting injection patterns. PromptArmor reports under 1% false positives and false negatives on AgentDojo.
Layer 5: tool-call monitoring. Tools touching sensitive data or executing actions must have explicit policy: which parameters are acceptable, how many calls per turn, what combinations are forbidden.
Layer 6: multi-model voting on sensitive actions. For critical decisions—delete data, send money, access bulk PII—consult two or three different models and only proceed on consensus. Doubles cost but eliminates a whole attack class.
PISmith and automated adversarial red teaming
PISmith[6], published early in the year, trains an attacker model that optimises injected prompts under black-box conditions using reinforcement learning. The result is an attack generator that adapts its strategy to the target system’s behaviour.
This changes the discipline in two ways:
- For defenders: the test battery can no longer be static; you have to include an adaptive attacker searching for weaknesses specific to your system.
- For red teams: one engineer with tools like PISmith can cover in a week what previously took a multi-person team a month.
Practical consequence: annual red-team exercises are insufficient. The working pattern is continuous testing with generative attackers running in staging, with quarterly human reviews over the most significant findings.
The three most expensive mistakes
Three patterns that keep appearing in incidents:
-
Trusting system prompts to keep secrets: “never reveal this API key” in the system prompt is statistically the best predictor that said key will be revealed within 30 days. Keys belong in systems outside the model’s reach, injected by the runtime only when needed.
-
Assuming the provider’s filter protects you: Claude, GPT, or Gemini filters catch a subset of known attacks and vary with each model version. Depending exclusively on them leaves your security in someone else’s hands.
-
Not logging enough to reproduce incidents: without the full trace—input, retrieved content, tool calls with results, final state—the team ends up speculating instead of analysing.
How to start if you have nothing
For a team without a prior programme, the minimum path:
- Week 1: catalogue of vectors (these plus domain-specific ones).
- Week 2: battery of 40–50 representative manual attacks run against the staging agent, with documented results.
- Week 3: first defence layer—structured format and schema validation—implemented and battery re-run.
- Week 4: CI integration so the battery runs on every deploy.
In three months, a team can move from zero to a pipeline blocking security regressions at the same level as functional ones.
Conclusion
LLM red teaming is no longer an optional discipline reserved for the most ambitious teams. It’s basic hygiene for any agent that touches anything that matters. The frameworks exist, the tools exist, the techniques are documented. What’s missing in many organisations is the decision to treat the model as the critical component it is and subject it to the same security discipline applied to the rest of the stack. Those who do sleep easy; those who don’t pay the full price when the first failure lands.