Systems-Theoretic Accident Model and Processes (STAMP) at Google

Google's SRE team pioneered methods to keep failures rare by engineering reliability into every part of the stack—Service Level Objectives (SLOs), error budgets, isolation strategies, thorough postmortems, progressive rollouts, and other techniques. In the face of increasing system complexity and emerging challenges: what's next? How can we continue to push the boundaries of reliability and safety?

To address these challenges, we are reexamining our beliefs about why incidents occur. Google is exploring a new causality model, Systems-Theoretic Accident Model and Processes (STAMP). We are using two methods based on this model. System Theoretic Process Analysis (STPA) is forward-looking. STPA enables us to analyze pure software systems and discover the unknown unknowns: risks of which you are unaware and not actively seeking. Causal Analysis based on System Theory (CAST) is retrospective, enabling us to supercharge our postmortems. Learn more about how we're using STPA and CAST in the following videos, articles, and podcast.

The Case of the Misnamed Cities: CAST Analysis of a Google Maps Incident

The Case of the Misnamed Cities: CAST Analysis of a Google Maps Incident

Learn more
Mapping a Better Future with STPA talk from SREcon Americas 2025

Mapping a Better Future with STPA talk from SREcon Americas 2025

Learn more
STPA for Software Systems -- Illuminate the Unknown Unknowns SREcon EMEA 2025

STPA for Software Systems -- Illuminate the Unknown Unknowns SREcon EMEA 2025

Learn more

The Evolution of SRE at Google

by Tim Falzone and Ben Treynor Sloss

STPA - Teaching a new way to prevent
outages at Google

by Garrett Holthaus

Google SRE Prodcast
Listen to the Prodcast:

The One With STPA and Jeffrey and Theo