AI Safety Levels
AI Safety Levels (ASL) are Anthropic’s framework for categorising the risk posed by model capabilities, embedded in their responsible scaling policy. The framework defines what deployment and safety commitments are required at each level.
The levels
| Level | Capability threshold | Risk characterisation |
|---|---|---|
| ASL-1 | Below GPT-2 | Negligible risk |
| ASL-2 | Basic capability; early GPT era | Limited risk; standard safeguards sufficient |
| ASL-3 | Meaningful uplift for creating weapons of mass destruction | Moderate risk from misuse; current Claude models (2025) |
| ASL-4 | Potential for significant loss of human life from misuse | High risk; requires substantially stronger safety measures |
| ASL-5 | Potential extinction-level outcomes if misaligned or misused | Catastrophic risk; may require halting development until safety proven |
The if-then commitment structure
The key design choice in the ASL framework: rather than imposing burdens on all models regardless of capability, requirements are triggered when a model demonstrably crosses a threshold. This avoids two failure modes — crying wolf (burdening safe models, eroding credibility) and complacency (vague pledges with no accountability).
Each new model is tested empirically against the trigger conditions before deployment. See Responsible Scaling Policy for the broader framework.
ASL-3 — current status and timeline
As of late 2024, current Claude models are ASL-2. Dario Amodei expects ASL-3 to be triggered within 2025. The threshold: a model that provides meaningful uplift to non-state actors seeking CBRN capability — above what Google Search provides.
At ASL-3: enhanced security to prevent model theft by non-state actors; targeted deployment filters for CBRN-adjacent queries. The concern is non-state actors specifically — state actors already have high proficiency in these areas independently.
Earlier note (from Benjamin Mann, 2025): As of early 2025, Anthropic classified Claude at ASL-3. The evidence: measurable uplift to a bad actor seeking to create a bioweapon — above Google Search baseline. Anthropic testified to Congress about this capability.
Purpose of the framework
- Operationalises safety claims. Instead of “we care about safety,” ASL provides concrete commitments tied to measurable capability thresholds.
- Enables responsible scaling. Anthropic can continue developing models while publishing the conditions under which they would need to pause.
- Builds trust with policymakers. Publishing both the risks and the commitments gives legislators a concrete basis for evaluation.
The “God in a box” problem
Ben frames ASL-4 and ASL-5 in terms of historical AI safety theory: early concern was about keeping a superintelligent system contained and aligned. The irony of language models: people are actively pulling the “God out of the box” (giving models full internet access, credentials, broad agency). The ASL framework is Anthropic’s attempt to gate capability deployment against safety maturity.
Where mainstream views differ
Some researchers argue capability-based risk levels are too crude — a model could theoretically be very capable but well-aligned, or less capable but poorly aligned. Anthropic’s view: capability is a necessary (if not sufficient) indicator of risk; a less capable model that’s badly aligned is less dangerous than a highly capable one in the same situation.
ASL-4 — autonomy and interpretability
At ASL-4, the concern shifts from humans misusing the model to the model itself behaving in misaligned ways on long-horizon tasks. Key challenge: models at this capability level may be able to sandbag capability evaluations — presenting themselves as less capable than they are. Standard interaction-based testing becomes unreliable.
This is where Mechanistic Interpretability becomes load-bearing: an independent verification channel that reads internal model state rather than relying on what the model says. The interpretability tooling must not be accessible to the model during training to remain uncorrupted as a test set.
See also
- Responsible Scaling Policy — the if-then framework ASL is embedded in
- Constitutional AI
- Mechanistic Interpretability
- Evals
- Benjamin Mann on Anthropic and AGI
- Dario Amodei on Claude, AGI and the Future of AI — detailed ASL 1–5 breakdown
- Boris Cherny on Claude Code — Boris’s three-layer safety model (alignment → evals → in-the-wild)