Notes — Simon Willison on Agentic Engineering and the Future of Code

Four questions [Adler frame]

Q1. What is it about?
A practitioner’s account of what the AI coding transition actually requires — not the optimistic version, not the catastrophist version, but the textured version from someone who coined key terms in the space and has been doing it professionally for 25 years. Three main claims: (1) vibe coding and agentic engineering are genuinely different things with different responsibility profiles; (2) the dark factory pattern is the next frontier and nobody fully knows how to do it yet; (3) a major AI safety disaster is coming because of normalised deviance around prompt injection risk.

Q2. How is it argued?
Primarily from first-person practitioner experience (one-person company, 4 parallel coding agents by 11 AM, projects now 95% AI-generated code) plus specific observed examples (StrongDM dark factory experiment, OpenClaw prompt injection near-misses, Red/Green TDD prompt shortcut). Some conceptual framing (normalisation of deviance from Challenger disaster literature). Light on academic citations; heavy on direct observation.

Q3. Is it true?
The vibe coding / agentic engineering distinction is analytically clean and useful — it captures something real about responsibility profiles. The dark factory pattern is described as genuinely experimental — Willison is honest that ‘we’re figuring out what that looks like.’ The Challenger disaster prediction has not materialised in three years of Willison making it; this is either a slow-moving process or the disaster is less inevitable than he suggests. The cognitive load claim (4 parallel agents → wiped by 11 AM) is consistent with reports from other practitioners. The human agency thesis is philosophically interesting but asserted rather than argued.

Q4. What of it?
For practitioners: the responsible use of coding agents requires professional practices (tests, templates, cognitive load management, review) — it is not a ‘set and forget’ tool. For safety thinkers: normalised deviance is the correct framing for AI security risk, not misalignment or superintelligence. For product builders: vibe coding and agentic engineering require different trust profiles; conflating them is a risk.

Glossary

Vibe coding. Not looking at code, not understanding it, going on vibes. Andrej Karpathy’s original term, later generalised beyond his intent. Willison’s refined definition: appropriate for personal tools where only you bear the consequences; not appropriate for production code used by others.

Agentic engineering. Willison’s preferred term for professional software development using coding agents. Requires full engineering depth and experience. Involves running multiple agents in parallel, reviewing output, applying professional quality standards.

Dark factory pattern. Software production where agents produce professional-quality code that the engineer does not directly review line-by-line. Named after ‘lights-out factories’ — so automated that lights can be off. Current frontier; not yet fully solved. Tests are the leading candidate for the verification layer.

AI slop. Willison’s term for low-quality AI-generated content produced at scale. Cheapness of production incentivises quantity over quality.

Normalisation of deviance. Academic term from the 1986 Challenger disaster post-mortem. The process by which repeated success with a known-risky practice builds institutional confidence until disaster occurs. Willison applies this to AI security/prompt injection.

Red/Green TDD. Test-driven development shorthand: write test first → watch it fail (Red) → implement → watch it pass (Green). Useful as a compact prompt for agents because they recognise the jargon and apply the pattern.

The key distinction: vibe coding vs. agentic engineering

[§ Vibe coding vs. agentic engineering]

The distinction is a responsibility distinction, not just a workflow distinction. Vibe coding produces code you are not accountable for in a professional sense — fine if only you are exposed to it. Agentic engineering produces code you are accountable for, you just produce it with agent assistance.

The confusion matters because it creates misleading narratives:

‘Vibe coding is how AI will replace engineers’ (no — agentic engineering, done well, requires more expertise than traditional coding)
‘AI coding is fast and cheap’ (yes, but fast-cheap-agentic-engineering is not the same as fast-cheap-vibe-coding-for-production)

Willison’s experience: 95% of his code is AI-generated. He works harder than ever. He is cognitively exhausted by 11 AM running 4 agents in parallel. This is not because the agents are not good — it is because directing 4 agents well requires every ounce of his 25 years of experience. [§ Cognitive load]

The dark factory pattern

[§ Dark factory pattern]

Current professional agentic engineering workflow: human specifies → agent implements → human reviews. The bottleneck has moved from ‘writing the code’ to ‘reviewing the code.’ The review step is still human.

Dark factory removes the human from the review step while maintaining professional quality standards. The open question: how? The leading answer is automated tests — if the agent writes tests and all tests pass, you have some confidence the code is correct without reviewing it directly.

StrongDM experiment: $10,000/day simulating employee AI interactions 24/7 to test their AI access management system. Built their own API simulations of Slack/Jira/Okta because real SaaS platforms have rate limits. This is what dark factory looks like in practice today: extremely expensive, built on simulation, and still experimental. [§ Dark factory pattern, StrongDM]

Normalisation of deviance and the Challenger prediction

[§ Challenger disaster]

The Challenger disaster happened not because nobody knew the O-rings were unreliable in cold weather — many people did — but because repeated successful launches despite the known risk normalised the risk institutionally.

Willison applies this to AI: everyone knows prompt injection is unreliable. AI systems are being deployed in increasingly agentic contexts. Every deployment that does not end in a disaster (a major theft, a large-scale data breach, a harmful action from an injected agent) increases institutional confidence. Eventually, this will end badly.

Note: Willison has made this prediction every six months for three years. It has not materialised yet. [?] Either: (a) the risk is real but lower than he estimates; (b) the disaster is coming but slowly; (c) incremental model improvements are reducing the risk faster than deployment is increasing it.

This is a genuinely hard prediction to evaluate because ‘no disaster yet’ is consistent with both ‘the risk is manageable’ and ‘we’re accumulating institutional overconfidence before the eventual failure.‘

Human agency as the irreplaceable human contribution

[§ Human agency vs. AI agency]

Willison’s formulation: agents have no agency in the meaningful sense because they have no human motivations. You can program an agent to optimise a goal, but it cannot self-determine which goals are worth pursuing or what matters.

This makes human agency — the capacity to decide what problems to take on, what direction to go — more valuable, not less, in an agentic world. The agents are powerful instruments; humans must provide the direction.

Cross-reference: this complements Truell's 'logic designer' framing but at a more philosophical level. Truell says engineers become logic designers specifying intent. Willison says the agency that drives that specification is fundamentally human and irreplaceable.

Practical agentic engineering techniques

[§ Practical techniques]

Testing is not optional. Tests play three roles with agents:

Confirm the code actually ran and the basic assertion holds
Catch regressions when new features are built
Enable safe parallelisation (change one thing; tests verify nothing else broke)

Red/Green TDD: write test first, watch fail, implement, watch pass. A five-word prompt shortcut agents understand.

Starting templates matter. A repository with a single example test in the preferred style causes agents to conform to that style throughout. The cheapest way to establish code standards.

Cognitive load is the real constraint. Code generation is now cheap; direction and review are the expensive parts. Finding personal limits for parallel agent work is a genuine skill.

Context hoarding pays off. Maintaining a research repository of prior work (75+ public projects) that agents can search gives them access to prior solutions. Agents use tool calls to search; they are not limited to the context window.