Notes — Sander Schulhoff on AI Security and Guardrails
Four questions [Adler frame]
Q1. What is it about?
A focused critique of AI security industry practices — specifically automated red teaming as a sales tool and AI guardrails as a defensive product. Followed by concrete architecture advice for builders. The core thesis: the industry has applied a classical-security mental model to a problem that requires a different model entirely.
Q2. How is it argued?
Empirical: research paper co-authored with OpenAI, Google DeepMind, and Anthropic testing all state-of-the-art guardrails against human red teamers (100% success in 10–30 attempts) and automated systems. Mathematical argument on attack space size (10^1,000,000). Mechanistic: intelligence gap explanation for why encoding attacks bypass guardrails. Practical case study: ServiceNow second-order prompt injection.
Q3. Is it true?
The mathematical argument is sound. The empirical claim (human attackers beat all guardrails) is sourced from a published paper with major lab co-authors — highly credible. The advice is appropriately graded: low-risk chatbot deployments get “do nothing”; agentic deployments get specific architecture recommendations. The “guardrails never work” framing may be too absolute — the argument is specifically that guardrails are not effective defences against determined attackers and create false confidence, which is accurate and important.
Q4. What of it?
The practical upshot: stop buying guardrails; invest in classical least-privilege architecture; hire people who understand both AI and classical cybersecurity; treat each agent capability as an attack surface. The broader upshot: AI security will require new frameworks, not patches on top of classical security assumptions.
Glossary
Guardrail. An AI model (typically lighter than the main model) that sits in front of and/or behind the main model to classify inputs/outputs as valid or malicious. Widely sold as a defence; shown to be bypassable by determined attackers.
Intelligence gap. The capability difference between a guardrail model and the main model. Encoding an attack (Base64, ROT13, Spanish) bypasses the guardrail (cannot decode) while succeeding against the main model (can).
Indirect prompt injection. A prompt injection where the adversarial instructions come from a third-party source in the agent’s environment (e.g., a webpage it reads), not directly from a user. No user participation needed at time of attack.
Adversarial robustness. Resistance to adversarial inputs. Continuous property, measured in terms of the proportion of attack attempts blocked. A meaningful measure requires sampling from a statistically significant portion of the infinite attack space — currently impossible.
Adaptive attacker. An attacker who can observe partial outcomes and iteratively refine their attack. Humans are adaptive attackers; automated RL-based systems are weaker adaptive attackers. The correct benchmark for a defence is how it performs against human adaptive attackers.
CBRN. Chemical, biological, radiological, nuclear (and explosive) — the harm category taxonomy used in advanced red teaming competitions. Represents worst-case uplift scenarios.
The two industry failures
Automated red teaming: works too well [§ The guardrails problem]
Every transformer-based or transformer-adjacent system can be broken by automated red teamers. This is always true. It means the results of a third-party red team audit are not informative about your system’s unusual risk — they inform you that you have the same properties as every other system. Selling the audit → selling the guardrail is a product funnel, not a security solution.
Guardrails: work too poorly [§ Why guardrails fail]
Three independent failure modes:
- Infinite attack space. 10^(1,000,000) possible prompts for a GPT-5-class model. “99% effective” is an unmeasurable claim; there are effectively infinite uncovered attacks regardless.
- Human attackers beat all defences. Joint research with OpenAI, Google DeepMind, Anthropic: 100% success rate for human attackers across all guardrails in 10–30 attempts. Automated attackers also eventually succeed.
- Intelligence gap. Encoding circumvents lighter guardrail while succeeding against capable main model.
None of these failures are fixable without solving adversarial robustness — an open research problem across the entire field.
What good architecture looks like
The threat model [§ Concrete advice]
For agents: the threat is not “bad output” but “bad action.” Enumerate every capability the agent has. Each capability is an attack surface. The question is not “what could it say?” but “what could a maximally motivated attacker make it do?”
Schulhoff’s heuristic: “imagine this agent as an angry god that wants to cause maximum harm — how do you contain it?” [§ Concrete advice]
Tiered response [§ When to worry]
| Deployment type | Risk | Recommendation |
|---|---|---|
| Chatbot, no tool access, user-only data | Low | Standard hygiene only; no AI-specific investment |
| Agent with read-only tool access | Medium | Classical data permissions; confirm scope boundaries |
| Agent with write access, multi-user data | High | Least-privilege from scratch; human review at action boundaries; classical cybersecurity audit |
The hybrid skill [§ Skill investment]
The gap is not “AI knowledge” or “security knowledge” — it is the person who understands both the AI model (why encoding bypasses a guardrail) and the classical security model (why least-privilege matters). This hybrid skill is currently rare; investing in it is more valuable than buying guardrail products.