Incident Metrics in SRE - Google - Site Reliability Engineering

Written by:
Štěpán Davidovič

Measuring improvements as a result of a process change, product purchase, or a technological change is commonplace. In reliability engineering, statistics such as mean time to recovery (MTTR) or mean time to mitigation (MTTM) are often measured. These statistics are sometimes used to evaluate improvements, or track trends.

In this report, we use a simple Monte Carlo simulation process (which can be applied in many other situations), as well as statistical analysis, to demonstrate that these statistics are poorly suited for decision making or trend analysis in the context of production incidents. To replace these, we propose better ways to achieve the same measurements for some contexts.