Measuring Reliability

What got you here won't get you there

The SLOs ecosystem is just one tool we can use to measure reliability... This talk is about understanding when your tools apply for answering questions, and building new ones if you need them.

Scroll down to explore more

Measuring Reliability

...what got you here won't get you there

Štěpán Davidovič, Google

for SREcon EMEA 2022, Oct 25th-27th

What is "reliability"? It's a fuzzy concept!

Lagging Indicator

User's perception of how the system worked for them
It has already impacted the business, cannot be changed
It's user experience, and we should know about it

"The system didn't work an hour every day of the last week, that sucks."

Leading Indicator

Risk property of the system as it stands
Not yet realized, can be changed

"The system is at 99% of disk usage. It works now, but it might break any minute!"

Leading Indicator is important, but out of scope here
This slide deck will focus on the ability to measure what has already happened, and draw conclusions from that.

Measure reliability?

"Define your SLIs/SLOs!"

...let's look at that more!

The SLO Model Recap

Service Level Indicator (SLI)

Time series data which can tell us how good the level of service is. Often from logs or sampled counters.

Examples:

tuple: {HTTP 500s, all HTTP responses}
ratio: responses under 200ms / all responses

Service Level Objective (SLO)

Predicate on a mathematical function applied on SLI. Has free parameters. Aim is to keep this predicate true.

Mathematical example:

Organizational example:

"We hit/missed our SLO last quarter"

Answering our reliability questions

When we say "measure reliability", we want our data to give us some insight. We are answering questions, by using our available data.

But there is no single reliability question! An engineer oncall is in a different situation than a CEO strategizing.

The purpose of computing is insight, not numbers.

- Richard Hamming

Our reliability questions?

Some illustrative examples:

Oncall engineer responds to and mitigates an incident.
Did their action help?
Team manager holds weekly production review team meetings, are there creeping problems?
Customer support asks if a customer has problems?
SVP wants to understand customer-perceived reliability of product portfolio before meeting with the customer.
CEO wants to know if company's reliability is getting worse.
Do we need to pivot?

Illustration #1: Oncall Engineer

Engineer got paged with the alert "SLI_Suddenly_Awful".
After an hour of debugging, engineer tried a mitigation.

How do they know if it helped?

If it were me:

Wait for a while
Look at whether the SLI has recovered to above SLO
(or maybe even to previous levels)

We built an intuitive, ad-hoc model to answer our question!

Illustration #2: CEO wonders

Comparing how many SLOs were met month to month

CEO is wondering whether reliability of the company's product is getting worse, and new reliability work needs to be prioritized.

How do they interpret the reliability data?

Maybe they compare how many SLOs were met, month to month?

Let's say we have 200 SLOs. Are we getting less reliable?

Naive answer: Yes, because 12 (6.0% of total) SLOs not met is (a lot) more than 3 (1.5%) or 4 (2%).

Better answer: We can't tell by interpreting this data this way.

Illustrating using a binomial model:

Let's say probability of not meeting SLO each month is 1/24
Then 95% confidence interval is from 3 to 13 SLOs not met, even if the average reliability doesn't change

Naive idea was non-obviously dangerous. The more impactful (=costly) the decisions, the more important to check your methods!

This illustration model is very flawed (SLOs typically not IID, etc.). It's only an illustration, prompting going beyond the naive answer.

Illustration by Allyssa Jill Olivan

SLI/SLO model got us here, but…

We need more models. The examples show that SLO model alone isn't sufficient. In practice we build ad-hoc models in our heads, intuitive but sometimes dangerously wrong.

The SLI/SLO model helped us make good progress in reliability! But what got us here won't get us there. The illustrations were strawman, but the problem is real.

To figure our next steps, let's understand some of the limitations and assumptions of SLIs and SLOs.

Illustration by Allyssa Jill Olivan

Error Budgets Have Error Margins

That's okay! But do you know what yours are?

We establish our "error budget" based on acceptable losses

100% availability is fanciful
Since we make 1M USD, we set a failure budget 10K USD

Error Budgets Have Error Margins image one

We set our SLO to correspond to our error budget

We make 1M USD/yr, so 10K implies 99% SLO

Error Budgets Have Error Margins image two

We measure incident impact - but inaccuracy is a problem!

Impact assessment may be inaccurate, e.g. order of magnitude!
Estimate your inaccuracy: Ask three independent incident reviewers for impact estimate, and observe variance
This is a problem even without any black swan events!

Error Budgets Have Error Margins image three

SLO Model Assumes Linearity

...in time and space!

Aggregating upwards hides bad behavior

What if your product never works for a handful of users?
For SLI/SLO, it's the same as if it didn't work a little bit for all users!

SLIs assume all requests are equal

They can have different costs, user utility, or revenue
Human-curated SLI grouping helps, e.g. group by API call or location

Human-designed grouping is not always possible

Example: Free-form SQL queries for a database, or arbitrary input video formats for video encoder

SLO model aggregates over time

1000x 1-minute full outages is equal to...
...1x 1000-minute full-outage?
...2x 1000-minute half-outages?

To your users and your business, this difference may matter

Aggregation to global (or zonal) SLO
can hide severe problems!

SLO Model assumes linearity horse-sized duck

Illustration by Allyssa Jill Olivan

SLIs Aren't The Best Data

...they are the easiest data

Practical SLIs are often limited to sources which have:

High sample rate
Low cost to sample (and interpret)
Low sampling latency

We should integrate other useful sources:

Complaints on Twitter
Crowdsourced outage reporting
Direct customer feedback

Valuable signals worth not ignoring! Could we integrate them into our day-to-day reliability management?

Know Your Tools

This talk isn't to say SLIs or SLOs are bad - or good. The SLOs ecosystem is only a tool. The only question for a tool is if it fits the need.

You should understand your needs. They might be answering questions with data, or organizational design, or many more!

This talk is about understanding when your tools apply for answering questions, and building new ones if you need them.

Illustration by Allyssa Jill Olivan

Operationalization

"...defines a fuzzy concept so as to make it clearly distinguishable, measurable, and understandable by empirical observation..."

(Wikipedia, May 2022)

Reliability Measurement Models

...operationalization using three "simple" steps

Identify your key reliability questions
- Some are generic (e.g. need to alert), many aren't
- Be precise, and think of cost of consequent action
Build a model for each question
- This is creative, and hard work
- Consider how hard it is to agree on a model to answer the question "given this data, should we alert someone?"
- ML techniques are tempting, but beware their caveats
Backtest your models against historical data
- For boolean questions, you can get a confusion matrix
- Identify model shortcomings, and iterate

Reliability Measurement Models image one

Reliability Measurement Models image two

Reliability Measurement Models image three

Good news: We're doing it already!

"SLO alerting" is an example of building a fresh model
- Input is SLI data, output is a boolean answer
- Frequent topic of articles and discussions
- Alerting decision is made frequently
Models for identifying unusual behaviors, such as:
- Anomaly detection in monitoring solutions
- SRECon'21 talk "Beyond Goldilocks Reliability"
- But beware: "unusual" is not automatically "bad"!
- Cost of being wrong drives accuracy requirements
However, models for high-level decision are hard
- Typically very infrequent decisions
- Not always clear what should've been done, even in hindsight

Illustration by Allyssa Jill Olivan

Conclusion

What got you here won't get you there!

SLI/SLO model is a helpful hammer, but not everything is a nail

Understand what questions you need answered!
Match your tool to that, don't start with a tool

Build models, and backtest them (...and publish them?)

Start with just three reliability questions
Backtesting is sometimes hard, "what should we have done?" not always accurate or available
Think of the cost of the answer being wrong, be ready

Include new or external data in your reliability day-to-day practice

Complaints on Twitter as a regularly measured quantity? :-)
Ideally: used as input to your regularly exercised models

See also:

Thanks to: Ben Appleton, Kristina Bennett, Brent Bryan, Brendan Gleason, Paul Holden, Jennifer Mace, Jake McGuire, Niall Richard Murphy, Courtney Nash, Alex Rodriguez, Dylan Vener, Salim Virji

For their review and thoughts on this (or preceding) material

Illustrations by: Allyssa Jill Olivan (@kleinebean)
allyssajillolivan.myportfolio.com