Anatomy of an Incident - Google - Site Reliability Engineering
When it comes to system design, failure is inevitable. Scientists and engineers implement solutions based on the available information, without a complete knowledge of the future. You can’t always anticipate the next zero-day event, viral media trend, weather disaster, or shift in technology. But you can be prepared to respond when incidents like these affect your systems.
With this report, SRE and DevOps practitioners, IT managers, and engineering leaders will explore methods to help your organization prepare for, respond to, and recover from incidents. With advice from Ayelet Sachto, Adrienne Walcer, and Jessie Yang, you’ll learn how to be prepared to handle failure if and when it happens.
Learn the stages of the incident management lifecycle: preparedness, response, recovery, and mitigation
- Deal proactively with incidents: issues that escalate beyond metrics and alerts
- Be prepared: practice disaster role playing and incident response exercises
- Learn the characteristics of the incident-response organizational structure
- Examine steps to recovery and mitigation after an incident has occurred
- Conduct postmortems to analyze what went wrong
- Explore a real-world example from Google: The Mayan Apocalypse
- Learn how to measure and reduce incidents impact
- Use postmortems as a tool for prevention and psychological safety