Example Incident State Document
Shakespeare Sonnet++ Overload: 2015-10-21
Incident management info: https://incident-management-cheat-sheet
(Communications lead to keep summary updated.)
Summary: Shakespeare search service in cascading failure due to newly discovered sonnet not in search index.
Status: active, incident #465
Command Post(s): #shakespeare
on IRC
Command Hierarchy (all responders)
-
Current Incident Commander: jennifer
- Operations lead: docbrown
- Planning lead: jennifer
- Communications lead: jennifer
-
Next Incident Commander: to be determined
(Update at least every four hours and at handoff of Comms Lead role.)
Detailed Status (last updated at 2015-10-21 15:28 UTC by jennifer)
Exit Criteria:
- New sonnet added to Shakespeare search corpus TODO
- Within availability (99.99%) and latency (99%ile < 100 ms) SLOs for 30+ minutes TODO
TODO list and bugs filed:
- Run MapReduce job to reindex Shakespeare corpus DONE
- Borrow emergency resources to bring up extra capacity DONE
- Enable flux capacitor to balance load between clusters (Bug 5554823) TODO
Incident timeline (most recent first: times are in UTC)
-
2015-10-21 15:28 UTC jennifer
- Increasing serving capacity globally by 2x
-
2015-10-21 15:21 UTC jennifer
- Directing all traffic to USA-2 sacrificial cluster and draining traffic from other clusters so they can recover from cascading failure while spinning up more tasks
- MapReduce index job complete, awaiting Bigtable replication to all clusters
-
2015-10-21 15:10 UTC martym
- Adding new sonnet to Shakespeare corpus and starting index MapReduce
-
2015-10-21 15:04 UTC martym
- Obtains text of newly discovered sonnet from shakespeare-discuss@ mailing list
-
2015-10-21 15:01 UTC docbrown
- Incident declared due to cascading failure
-
2015-10-21 14:55 UTC docbrown
- Pager storm,
ManyHttp500s
in all clusters
- Pager storm,