Anatomy of an Incident - Google - Site Reliability Engineering

When it comes to system design, failure is inevitable. Scientists and engineers implement solutions based on the available information, without a complete knowledge of the future. You can’t always anticipate the next zero-day event, viral media trend, weather disaster, or shift in technology. But you can be prepared to respond when incidents like these affect your systems.

With this report, SRE and DevOps practitioners, IT managers, and engineering leaders will explore methods to help your organization prepare for, respond to, and recover from incidents. With advice from Ayelet Sachto, Adrienne Walcer, and Jessie Yang, you’ll learn how to be prepared to handle failure if and when it happens.

Learn the stages of the incident management lifecycle: preparedness, response, recovery, and mitigation

Deal proactively with incidents: issues that escalate beyond metrics and alerts
Be prepared: practice disaster role playing and incident response exercises
Learn the characteristics of the incident-response organizational structure
Examine steps to recovery and mitigation after an incident has occurred
Conduct postmortems to analyze what went wrong
Explore a real-world example from Google: The Mayan Apocalypse
Learn how to measure and reduce incidents impact
Use postmortems as a tool for prevention and psychological safety

PDF EPUB MOBI