Measuring Reliability
What got you here won't get you there
The SLOs ecosystem is just one tool we can use to measure reliability... This talk is about understanding when your tools apply for answering questions, and building new ones if you need them.
Scroll down to explore more
Measuring Reliability
...what got you here won't get you there
Štěpán Davidovič, Google
What is "reliability"? It's a fuzzy concept!
Lagging Indicator
- User's perception of how the system worked for them
- It has already impacted the business, cannot be changed
- It's user experience, and we should know about it
"The system didn't work an hour every day of the last week, that sucks."
Leading Indicator
- Risk property of the system as it stands
- Not yet realized, can be changed
"The system is at 99% of disk usage. It works now, but it might break any minute!"
- Leading Indicator is important, but out of scope here
- This slide deck will focus on the ability to measure what has already happened, and draw conclusions from that.
Measure reliability?
"Define your SLIs/SLOs!"
...let's look at that more!
The SLO Model Recap
Service Level Indicator (SLI)
Time series data which can tell us how good the level of service is. Often from logs or sampled counters.
Examples:
- tuple: {HTTP 500s, all HTTP responses}
- ratio: responses under 200ms / all responses
Service Level Objective (SLO)
Predicate on a mathematical function applied on SLI. Has free parameters. Aim is to keep this predicate true.
Mathematical example:
Organizational example:
"We hit/missed our SLO last quarter"
Answering our reliability questions
When we say "measure reliability", we want our data to give us some insight. We are answering questions, by using our available data.
But there is no single reliability question! An engineer oncall is in a different situation than a CEO strategizing.
The purpose of computing is insight, not numbers.
Our reliability questions?
Some illustrative examples:
- Oncall engineer responds to and mitigates an incident.
Did their action help? - Team manager holds weekly production review team meetings, are there creeping problems?
- Customer support asks if a customer has problems?
- SVP wants to understand customer-perceived reliability of product portfolio before meeting with the customer.
- CEO wants to know if company's reliability is getting worse.
Do we need to pivot?
Illustration #1: Oncall Engineer
Engineer got paged with the alert "SLI_Suddenly_Awful".
After an hour of debugging, engineer tried a mitigation.
How do they know if it helped?
If it were me:
- Wait for a while
- Look at whether the SLI has recovered to above SLO
(or maybe even to previous levels)
We built an intuitive, ad-hoc model to answer our question!
Illustration #2: CEO wonders
Comparing how many SLOs were met month to month
CEO is wondering whether reliability of the company's product is getting worse, and new reliability work needs to be prioritized.
How do they interpret the reliability data?
Maybe they compare how many SLOs were met, month to month?
Let's say we have 200 SLOs. Are we getting less reliable?
Naive answer: Yes, because 12 (6.0% of total) SLOs not met is (a lot) more than 3 (1.5%) or 4 (2%).
Better answer: We can't tell by interpreting this data this way.
Illustrating using a binomial model:
- Let's say probability of not meeting SLO each month is 1/24
- Then 95% confidence interval is from 3 to 13 SLOs not met, even if the average reliability doesn't change
Naive idea was non-obviously dangerous. The more impactful (=costly) the decisions, the more important to check your methods!
This illustration model is very flawed (SLOs typically not IID, etc.). It's only an illustration, prompting going beyond the naive answer.
SLI/SLO model got us here, but…
We need more models. The examples show that SLO model alone isn't sufficient. In practice we build ad-hoc models in our heads, intuitive but sometimes dangerously wrong.
The SLI/SLO model helped us make good progress in reliability! But what got us here won't get us there. The illustrations were strawman, but the problem is real.
To figure our next steps, let's understand some of the limitations and assumptions of SLIs and SLOs.
Error Budgets Have Error Margins
That's okay! But do you know what yours are?
We establish our "error budget" based on acceptable losses
- 100% availability is fanciful
- Since we make 1M USD, we set a failure budget 10K USD
We set our SLO to correspond to our error budget
- We make 1M USD/yr, so 10K implies 99% SLO
We measure incident impact - but inaccuracy is a problem!
- Impact assessment may be inaccurate, e.g. order of magnitude!
- Estimate your inaccuracy: Ask three independent incident reviewers for impact estimate, and observe variance
- This is a problem even without any black swan events!
SLO Model Assumes Linearity
...in time and space!
Aggregating upwards hides bad behavior
- What if your product never works for a handful of users?
- For SLI/SLO, it's the same as if it didn't work a little bit for all users!
SLIs assume all requests are equal
- They can have different costs, user utility, or revenue
- Human-curated SLI grouping helps, e.g. group by API call or location
Human-designed grouping is not always possible
- Example: Free-form SQL queries for a database, or arbitrary input video formats for video encoder
SLO model aggregates over time
- 1000x 1-minute full outages is equal to...
- ...1x 1000-minute full-outage?
- ...2x 1000-minute half-outages?
To your users and your business, this difference may matter
Aggregation to global (or zonal) SLO
can hide severe problems!
SLIs Aren't The Best Data
...they are the easiest data
Practical SLIs are often limited to sources which have:
- High sample rate
- Low cost to sample (and interpret)
- Low sampling latency
We should integrate other useful sources:
- Complaints on Twitter
- Crowdsourced outage reporting
- Direct customer feedback
Valuable signals worth not ignoring! Could we integrate them into our day-to-day reliability management?
Know Your Tools
This talk isn't to say SLIs or SLOs are bad - or good. The SLOs ecosystem is only a tool. The only question for a tool is if it fits the need.
You should understand your needs. They might be answering questions with data, or organizational design, or many more!
This talk is about understanding when your tools apply for answering questions, and building new ones if you need them.
Operationalization
"...defines a fuzzy concept so as to make it clearly distinguishable, measurable, and understandable by empirical observation..."
Reliability Measurement Models
...operationalization using three "simple" steps
Identify your key reliability questions
- Some are generic (e.g. need to alert), many aren't
- Be precise, and think of cost of consequent action
Build a model for each question
- This is creative, and hard work
- Consider how hard it is to agree on a model to answer the question "given this data, should we alert someone?"
- ML techniques are tempting, but beware their caveats
Backtest your models against historical data
- For boolean questions, you can get a confusion matrix
- Identify model shortcomings, and iterate
Good news: We're doing it already!
"SLO alerting" is an example of building a fresh model
- Input is SLI data, output is a boolean answer
- Frequent topic of articles and discussions
- Alerting decision is made frequently
Models for identifying unusual behaviors, such as:
- Anomaly detection in monitoring solutions
- SRECon'21 talk "Beyond Goldilocks Reliability"
- But beware: "unusual" is not automatically "bad"!
- Cost of being wrong drives accuracy requirements
However, models for high-level decision are hard
- Typically very infrequent decisions
- Not always clear what should've been done, even in hindsight
Conclusion
What got you here won't get you there!
SLI/SLO model is a helpful hammer, but not everything is a nail
- Understand what questions you need answered!
- Match your tool to that, don't start with a tool
Build models, and backtest them (...and publish them?)
- Start with just three reliability questions
- Backtesting is sometimes hard, "what should we have done?" not always accurate or available
- Think of the cost of the answer being wrong, be ready
Include new or external data in your reliability day-to-day practice
- Complaints on Twitter as a regularly measured quantity? :-)
- Ideally: used as input to your regularly exercised models
See also:
- Incident metrics in SRE (O'Reilly, 2021)
- The VOID Report (Verica, 2024)
- ML for Operations (USENIX ;login:, 2020)
- How to measure Anything (Wiley, 2020)
Thanks to: Ben Appleton, Kristina Bennett, Brent Bryan, Brendan Gleason, Paul Holden, Jennifer Mace, Jake McGuire, Niall Richard Murphy, Courtney Nash, Alex Rodriguez, Dylan Vener, Salim Virji
For their review and thoughts on this (or preceding) material
Illustrations by: Allyssa Jill Olivan (@kleinebean)
allyssajillolivan.myportfolio.com
See also:
- Incident metrics in SRE (O'Reilly, 2021)
- The VOID Report (Verica, 2024)
- ML for Operations (USENIX ;login:, 2020)
- How to measure Anything (Wiley, 2020)
Illustrations by: Allyssa Jill Olivan (@kleinebean)
allyssajillolivan.myportfolio.com