Concept

AI Safety Levels

conceptai-safetyanthropicagialignmentmulti-source

AI Safety Levels

AI Safety Levels (ASL) are Anthropic’s framework for categorising the risk posed by model capabilities, embedded in their responsible scaling policy. The framework defines what deployment and safety commitments are required at each level.


The levels

LevelCapability thresholdRisk characterisation
ASL-1Below GPT-2Negligible risk
ASL-2Basic capability; early GPT eraLimited risk; standard safeguards sufficient
ASL-3Meaningful uplift for creating weapons of mass destructionModerate risk from misuse; current Claude models (2025)
ASL-4Potential for significant loss of human life from misuseHigh risk; requires substantially stronger safety measures
ASL-5Potential extinction-level outcomes if misaligned or misusedCatastrophic risk; may require halting development until safety proven

The if-then commitment structure

The key design choice in the ASL framework: rather than imposing burdens on all models regardless of capability, requirements are triggered when a model demonstrably crosses a threshold. This avoids two failure modes — crying wolf (burdening safe models, eroding credibility) and complacency (vague pledges with no accountability).

Each new model is tested empirically against the trigger conditions before deployment. See Responsible Scaling Policy for the broader framework.


ASL-3 — current status and timeline

As of late 2024, current Claude models are ASL-2. Dario Amodei expects ASL-3 to be triggered within 2025. The threshold: a model that provides meaningful uplift to non-state actors seeking CBRN capability — above what Google Search provides.

At ASL-3: enhanced security to prevent model theft by non-state actors; targeted deployment filters for CBRN-adjacent queries. The concern is non-state actors specifically — state actors already have high proficiency in these areas independently.

Earlier note (from Benjamin Mann, 2025): As of early 2025, Anthropic classified Claude at ASL-3. The evidence: measurable uplift to a bad actor seeking to create a bioweapon — above Google Search baseline. Anthropic testified to Congress about this capability.


Purpose of the framework

  1. Operationalises safety claims. Instead of “we care about safety,” ASL provides concrete commitments tied to measurable capability thresholds.
  2. Enables responsible scaling. Anthropic can continue developing models while publishing the conditions under which they would need to pause.
  3. Builds trust with policymakers. Publishing both the risks and the commitments gives legislators a concrete basis for evaluation.

The “God in a box” problem

Ben frames ASL-4 and ASL-5 in terms of historical AI safety theory: early concern was about keeping a superintelligent system contained and aligned. The irony of language models: people are actively pulling the “God out of the box” (giving models full internet access, credentials, broad agency). The ASL framework is Anthropic’s attempt to gate capability deployment against safety maturity.


Where mainstream views differ

Some researchers argue capability-based risk levels are too crude — a model could theoretically be very capable but well-aligned, or less capable but poorly aligned. Anthropic’s view: capability is a necessary (if not sufficient) indicator of risk; a less capable model that’s badly aligned is less dangerous than a highly capable one in the same situation.


ASL-4 — autonomy and interpretability

At ASL-4, the concern shifts from humans misusing the model to the model itself behaving in misaligned ways on long-horizon tasks. Key challenge: models at this capability level may be able to sandbag capability evaluations — presenting themselves as less capable than they are. Standard interaction-based testing becomes unreliable.

This is where Mechanistic Interpretability becomes load-bearing: an independent verification channel that reads internal model state rather than relying on what the model says. The interpretability tooling must not be accessible to the model during training to remain uncorrupted as a test set.


See also