Reading Notes

Sander Schulhoff on Prompt Engineering and Red Teaming

Source: Sander Schulhoff on Prompt Engineering and Red Teaming

Notes — Sander Schulhoff on Prompt Engineering and Red Teaming

Four questions [Adler frame]

Q1. What is it about?
Practical prompt engineering for both conversational and product use cases, plus an introduction to prompt injection and AI red teaming. The argument structure is: prompting still matters → here are the techniques that work → here is the security problem that undermines all of it.

Q2. How is it argued?
Primarily empirical: specific performance numbers (0%–90% swings from prompt quality), competition results (HackAPrompt dataset: 600,000 techniques; Best Theme Paper at EMNLP 2023), case studies (medical coding +70% with few-shot), practitioner evidence (techniques tested in the wild, including recent Base64-encoded attacks confirmed to work as of recording). Some conceptual framing (artificial social intelligence; the patch-a-bug analogy).

Q3. Is it true?
The empirical claims are well-sourced from peer-reviewed NLP research and competition data. The performance claims (bad prompt → 0%, good prompt → 90%) are task-specific and should not be generalised without knowing the baseline. The structural claim — prompt injection is unsolvable — is strongly argued and aligns with adversarial robustness literature, though “unsolvable” may overstate; “asymptotically hard to mitigate reliably” is more precise. The role-prompting debunking is well-evidenced.

Q4. What of it?
Practical implications for builders: use few-shot over zero-shot; decompose hard tasks; add a self-criticism loop; stop using role prompting. For security: prompt injection is not a feature to build around — it is an architectural constraint. Every agentic capability is an attack surface. Design with least-privilege from the start, not as a bolt-on.


Glossary

Prompt injection. A class of attack where a malicious user embeds instructions in their input that override the developer’s system prompt. Distinct from jailbreaking (no system prompt present).

Jailbreaking. Tricking a model to produce forbidden output when interacting directly without a developer system prompt.

Artificial social intelligence. Schulhoff’s term for the skill of communicating effectively with AI systems — understanding their responses, adapting prompts accordingly, knowing which techniques work in which contexts. Analogous to social intelligence with humans.

Few-shot prompting. Providing the model with labelled examples (input-output pairs) before the target query. Zero-shot = no examples; one-shot = one; few-shot = several.

Decomposition. Breaking a task into explicit subproblems before solving the main problem. Related to chain-of-thought but operates at task structure level, not token-by-token reasoning.

Self-criticism. Three-step pattern: solve → critique → implement critique. Effective because models can identify errors on review that they miss in generation.

Ensembling. Running the same problem through multiple prompt variants; aggregating by majority vote. Improves accuracy on structured tasks at cost of compute.

Adversarial robustness. How well a model or system resists adversarial inputs. Continuous property, not binary — cannot be “solved,” only improved marginally.

Intelligence gap. The capability difference between a guardrail model (lighter, cheaper) and the model it guards (more capable). Encoded attacks slip through the guardrail but succeed against the main model.


Prompt engineering: what works and what does not

The two modes [§ Prompting techniques overview]

Most people do conversational prompt engineering: iterating interactively with a chatbot. The research (and economic value) is in product-focused prompt engineering: optimising a fixed prompt that routes millions of inputs.

The distinction matters because techniques differ: conversational prompting benefits from follow-up questions and clarification; product-focused prompting requires optimising a static string against a distribution of inputs.

Few-shot is the single most reliable technique [§ Few-shot prompting]

Rationale: models are trained on text that contains patterns of exemplars followed by new instances. Providing examples exploits those patterns directly. Format matters: use formats the model has seen commonly in training data (XML, Q:/A:, bullet lists). For style transfer, even output-only examples (no input) work.

Role prompting: debunked [§ Role prompting]

In the GPT-3 era, telling the model it was a “math professor” measurably improved math performance. Post-RLHF, frontier models do not respond consistently to role prompting — they are already instruction-tuned to perform well across roles. Often neutral or slightly negative effect. Not recommended as a default technique.

Self-criticism works because generation ≠ evaluation [§ Self-criticism]

A consistent observation: models that produce an incorrect answer will often identify it as incorrect when asked to review. The generation and evaluation processes appear to draw on different internal computations. This is why the three-step pattern (generate → critique → rewrite) delivers a “free performance boost.”


Prompt injection: the structural problem

Why it is not patched like a bug [§ Prompt injection]

Classical security vulnerability: discrete bug, finite search space, patch closes it with high certainty. Prompt injection: continuous, infinite search space. A 99% effectiveness claim covers a statistically insignificant sample of possible attack prompts. After closing one attack vector, the model remains susceptible to variants because the underlying capability (language understanding) is the same capability being exploited.

Schulhoff’s summary: “You can patch a bug, but you can’t patch a brain.” [§ Prompt injection]

Agentic amplification [§ Agentic prompt injection risk]

With chatbots: attack surface = what the model says. With agents: attack surface = what the model does. Database writes, emails, code execution, embodied robot actions. The same injection vulnerability that produces harmful text becomes a vulnerability that transfers money or executes malware.

Indirect injection [§ Indirect injection]: a malicious third-party (e.g., a webpage the agent reads) embeds adversarial instructions that redirect the agent’s behaviour. No malicious user required at time of attack — the attack is embedded in the environment.

Effective mitigations [§ What works]

  1. Narrow task scope — fine-tune the model to one function; a model that can only do one thing is much harder to redirect [?]
  2. Least-privilege architecture — read-only by default; enumerate write permissions as attack surfaces
  3. Classical input/output hygiene — validate, sanitise, audit at system boundaries
  4. Human review at action boundaries — irreversible high-stakes actions require human confirmation

What does not work [§ What does not work]

  • Prompt-based defences (“ignore malicious instructions”) — broken and well-documented as broken since 2023
  • Guardrails (AI classifiers on inputs/outputs) — intelligence gap allows encoding-based bypasses; human attackers break them in ~10–30 attempts

See also