Reading Notes

Nicole Forsgren on DORA, SPACE, and Measuring Developer Productivity

Source: Nicole Forsgren on DORA, SPACE, and Measuring Developer Productivity

Four questions [Adler frame]

Q1. What is it about? How to measure developer productivity with scientific rigour — and why the most important finding overturns a central assumption of software engineering management: that speed and stability trade off against each other. Forsgren presents DORA (four operational metrics) and SPACE (five-dimensional framework) as complementary tools for measuring what actually matters.

Q2. How is it argued? Through the multi-year State of DevOps research programme and the Accelerate findings. The speed-stability insight is not theoretical — it is an empirical result from a large-scale longitudinal study. SPACE is argued on the grounds that single-metric measurement is always gameable, and a multi-dimensional framework resists optimisation attacks.

Q3. Is it true? The speed-stability finding has been replicated across multiple annual DORA surveys with large samples [?]. The SPACE argument that multi-dimensional measurement resists gaming is sound in theory; in practice, organisations still find ways to optimise metrics independently. The elite benchmarks are descriptive rather than prescriptive, but are the best empirically grounded targets available.

Q4. What of it? For engineering leaders: the speed-stability finding should terminate the “stability requires slowness” argument permanently. For teams: start with the DORA four metrics as a baseline before adding more. For PMs: the four-box framework — start with words before data — applies to any domain where measurement choices shape behaviour.


Glossary

DORA four metrics. The four measures of software delivery performance identified by the DORA research: deployment frequency, lead time for changes, mean time to restore (MTTR), and change failure rate. Predictive of organisational performance and business outcomes.

SPACE framework. A five-dimensional framework for measuring developer productivity: Satisfaction and well-being, Performance, Activity, Communication and collaboration, Efficiency and flow. Requires at least three dimensions simultaneously to avoid gaming.

Elite performers. The top tier in DORA research: deploy on demand (multiple times per day), lead time under one day, MTTR under one hour, change failure rate 0–15%.

Four-box framework. Forsgren’s methodology for building measurement systems: (1) define the thing in words, (2) identify what evidence would indicate presence or absence, (3) choose a proxy metric, (4) validate the proxy. Prevents measuring what is easy rather than what matters.

Change failure rate. The percentage of deployments that cause a degradation requiring remediation. One of the four DORA metrics; the only one where lower is always better.


Speed and stability move together

The central empirical finding of the DORA research: high-performing engineering organisations are both faster and more stable than low performers. [§ Speed-stability finding]

This finding overturned the prevailing assumption — that deployment frequency and stability trade off, and that teams choosing speed accept more risk. The research shows the opposite: the organisations that deploy most frequently also have the lowest change failure rates and fastest recovery times.

The mechanism: frequent deployment forces investment in test automation, observability, and deployment tooling, which make each individual deployment smaller and safer. The “big bang” deployment pattern — infrequent but large — is what creates instability, not the frequency of deployment itself.

Implication: any organisational policy that restricts deployment frequency to protect stability is likely having the opposite effect.


DORA four metrics

Four metrics that together characterise software delivery performance. [§ DORA metrics]

Deployment frequency: how often code is deployed to production. Proxy for feedback cycle length and batch size.

Lead time for changes: time from code commit to running in production. Captures pipeline efficiency.

Mean time to restore (MTTR): time to recover from a production incident. Captures resilience and operational maturity.

Change failure rate: percentage of deployments that require remediation. Captures deployment safety.

Elite benchmarks (empirically observed):

  • Deployment frequency: on demand (multiple per day)
  • Lead time: under one day
  • MTTR: under one hour
  • Change failure rate: 0–15%

The four metrics are deliberately balanced: high deployment frequency with high change failure rate is not high performance. All four must be tracked together to avoid optimising one at the expense of others.


SPACE framework

Five dimensions for measuring developer productivity, selected to resist single-metric gaming. [§ SPACE framework]

  • S — Satisfaction and well-being: how developers experience their work; leading indicator for performance
  • P — Performance: outcomes and quality of output, not volume
  • A — Activity: measurable actions (commits, PRs, reviews) — useful for context, not as primary metrics
  • C — Communication and collaboration: how information flows across the team
  • E — Efficiency and flow: how often engineers reach sustained productive depth

The design principle: measuring fewer than three dimensions simultaneously allows teams to optimise the measured dimensions while degrading the unmeasured ones. Measuring at least three creates enough coverage that pure optimisation is costly.


The four-box measurement framework

Before choosing any metric, work through four steps. [§ Four-box framework]

  1. Define in words: what does “good” look like? Write a sentence, not a number.
  2. Identify evidence: what would you observe if the definition were satisfied? If you cannot answer this, the definition is underspecified.
  3. Choose a proxy: select the most direct, least gameable measurable correlate of the evidence.
  4. Validate the proxy: confirm that optimising the proxy actually produces the outcome in the definition.

Most measurement failures happen at step 1 — teams skip directly to the number without agreeing on the underlying concept. This produces metrics that are technically measurable but strategically irrelevant or actively misleading.


Company size is irrelevant

High performance on the DORA metrics is not a function of company size. [§ Company size]

Small companies can be low performers; large companies can be high performers. The DORA research controls for size and finds that size does not predict performance — organisational practices and architectural choices do.

Implication: the “we’re too big to deploy frequently” argument is not supported by evidence. Large, high-performing organisations deploy frequently. The constraint is process and architecture, not scale.