Systems-Theoretic Accident Model and Processes (STAMP) at Google
Google's SRE team pioneered methods to keep failures rare by engineering reliability into every part of the stack—Service Level Objectives (SLOs), error budgets, isolation strategies, thorough postmortems, progressive rollouts, and other techniques. In the face of increasing system complexity and emerging challenges: what's next? How can we continue to push the boundaries of reliability and safety?
To address these challenges, we are reexamining our beliefs about why incidents occur. Google is exploring a new causality model, Systems-Theoretic Accident Model and Processes (STAMP). We are using two methods based on this model. System Theoretic Process Analysis (STPA) is forward-looking. STPA enables us to analyze pure software systems and discover the unknown unknowns: risks of which you are unaware and not actively seeking. Causal Analysis based on System Theory (CAST) is retrospective, enabling us to supercharge our postmortems. Learn more about how we're using STPA and CAST in the following videos, articles, and podcast.
|
The Evolution of SRE at Google by Tim Falzone and Ben Treynor Sloss |
STPA - Teaching a new way to prevent by Garrett Holthaus |
The One With STPA and Jeffrey and Theo