Teaching a new way to prevent outages at Google

From a young age, I enjoyed the detective work of diagnosing and fixing a broken system–electronics, in my case. There was something fulfilling about taking a silent radio and getting it playing again, sometimes with only a few dollars' worth of replacement parts. So, it wasn't a stretch to shift from post-failure analysis to pre-failure analysis in my first job out of college, as a microprocessor validation engineer. I ran tests on a simulator to find hardware bugs before the chip went into production and the cost to fix problems increased exponentially. I'll always remember the senior engineer who told me to put on my "evil" hat and try to break the chip by throwing the unexpected at it. But how do you come up with the unexpected? Better still, how do you know where to even start looking for possible issues?

Now I'm at Google, where the system complexity is even greater, and Site Reliability Engineers (SREs) work to prevent outages on a planet-scale computer with billions of users. Thankfully, Google has seen increasing success finding issues so they can be fixed before they cause an outage. We're using System Theoretic Process Analysis (STPA), a paper and pencil method that has been successfully used in many other industries since its creation in the early 2000s, to analyze pure software systems and discover the unknown unknowns (risks of which you are unaware and not actively seeking).

In order to scale STPA at Google, we need more experts trained in applying STPA to Google's software systems. In this article, I'll discuss Google's development of custom, in-house STPA training, as well as what we've learned about STPA education in a pure software environment.

STPA in one paragraph

STPA uses system and control theory to model the control-feedback loops in a complex system. It treats system safety as a control problem, and looks for all the ways that control actions in the system might cause the system to enter an unsafe state. Instead of trying to think of all the discrete actions that would immediately cause an outage, and then trying to prevent them, STPA focuses on defining the unsafe system states that could lead to an outage in worst case conditions. By taking the approach to understand why these unsafe control actions might occur, STPA enables us to discover complex, unintended system interactions, which are often the cause of system outages. Once we understand how the system gets into an unsafe state, we can design and implement controls to prevent this, thus preventing the outages associated with unsafe operation. Or, we can detect the unsafe state and take action to resume safe operation. If automatic correction isn't feasible, we can at least alert the humans who are part of the system to the unsafe situation.

Why does Google need custom STPA training?

Given the success Google has had in running STPA—discovering previously unknown issues, and fixing them before they can cause outages—it's clearly in Google's interest to develop more in-house STPA expertise and scale our efforts. There are plenty of existing STPA educational materials and many external consultants who can provide in-person training, so why does Google need to develop custom training? To answer this question, I need to give a bit of history.

STPA training at Google started in 2021 with an initial class for a group of 40 interested Googlers. The interest and momentum spread, and we decided to start hosting instructor-led training sessions based on existing materials. There are a lot of compelling STPA examples from other industries–stories of disasters and eye-opening lessons learned from applying STPA. However, when we presented these examples of physical systems (such as the Mars Polar Lander crash) to Google audiences, the response we got was, "That's interesting, but I don't see how it applies to my pure software system." So, it was clear that we needed some software examples, and even better, some examples of STPA applied to Google systems.

Early training efforts

Even with examples from your own industry or company, STPA training can seem like a daunting task. Successful application of STPA requires significant effort to learn the theory. Then, you need guidance or mentorship from someone with STPA experience until you've gained experience yourself. So, we decided to start with one part of STPA that can stand on its own–the concept of a control structure, modeling a system with control-feedback loops.

Basic control-feedback loop

Figure 1: Basic control-feedback loop

The basic control-feedback loop consists of a controller and a controlled process. We want to keep the controlled process from entering an unsafe state. For example, perhaps we want to keep the temperature of a chemical manufacturing process within a safe range. Or, maybe the controlled process is a database of user-generated content, like business reviews, and we want to keep it free of incorrect or abusive content.

The controller issues control actions to change the state of the controlled process. So, for our database, the controller might issue a control action to add, update, or remove content. But how does the controller know what control action to issue? It knows because it has some idea of the current state of the controlled process, and it gets this idea from feedback. Based on this feedback, it updates its model of the controlled process, then applies an algorithm (a software algorithm, in this case) to determine what control action to issue.

A control structure consists of a number of these basic control loops, interconnected by their control actions and feedback paths. We've seen that just the act of building this model causes a shift in design thinking to include system-level feedback. While most software developers do a thoughtful job of designing the control path–even without knowledge of control theory–they spend less time (if any!) designing the feedback path. When introducing STPA to new teams, we frequently identify missing or inadequate feedback in the control structure for their system, which starts a fruitful discussion on how to improve their design.

We designed and gave our control structure training. However, we discovered that teaching people to create a useful control structure is very difficult to accomplish over a limited period of time. In particular, teaching the right level of controller abstraction to enable meaningful analysis was a major challenge for a 2-day workshop. Though participants in our workshop got a solid start, it takes time, experience creating control structures, and guidance from experts to achieve skill in modeling a system with control feedback loops. Also, though people found the class interesting, it was less compelling without the other steps of STPA, where the control structure plays a key role in discovering design flaws and unsafe system behavior. Finally, perhaps the biggest challenge was that the seven participants in the initial class built control structures for their seven completely different software systems, and we, as the instructors, had our hands full learning enough about each complex system to offer meaningful feedback.

Growing as educators

After the control structures course, we decided it would be more appealing and effective to teach all the steps of STPA. Our goal was to get Googlers to the point where they could get a strong start on their own, running STPA for their system. Thanks to this initial effort on the part of the software developers, learning STPA and doing some of the analysis, things would go more smoothly when an STPA expert joined. This strategy would enable faster generation of high quality results–namely, a list of system design issues waiting to cause outages, as well as recommendations to fix these issues. So, we started building one-hour courses for each step of STPA.

As previously mentioned, we had enough feedback about non-Google STPA examples to know they weren't compelling for our audience. So, as we built the new training, we adopted a general approach: use a non-Google example to introduce the concepts and process for each STPA step, then follow this up with a real example of applying that step to a Google system. We got good compliments on the real Google examples, but we also learned something else about motivating our audience.

Learning what motivates our audience

As we held training sessions for more Googlers, spoke to more SREs, and led STPA projects for more teams, some powerful themes began to emerge that consistently resonated with our audience. The first, as mentioned before, was the concept of feedback. We started using examples from Google STPAs where a feedback path, or in some cases the lack of a feedback path, caused problems. There were examples of feedback from one software component to another, but also examples of missing or incomplete feedback to humans in the system.

In one particular case at Google, a software controller–acting on bad feedback from another software system–determined that it should issue an unsafe control action. It scheduled this action to happen after 30 days. Even though there were indicators that this unsafe action was going to occur, no software engineers–humans–were actually monitoring the indicators. So, after 30 days, the unsafe control action occurred, resulting in an outage. In this one outage, there were two feedback issues–bad feedback from one piece of software to another, and missing feedback from the software to the engineers. We've used STPA to analyze several outages that involved missing or incorrect feedback to humans. As with this outage, where the unsafe condition existed for 30 days, many of the others could have been easily avoided if humans in the system had been made aware of some unsafe system state. This is one of the core principles of STPA–you have a better chance of preventing an outage if you focus on preventing, or at least detecting, the unsafe system states that lead to an outage.

Over and over, we would hear from other Googlers that these examples had caused a shift in thinking about how they approached system design. Focusing not just on control paths, but also on feedback paths, led to increased awareness of other pieces of software infrastructure that interacted with their own piece. For example, instead of assuming that a neighboring system would always behave perfectly, people started asking, "what if that system passes me bad or incomplete information, or doesn't get the information to my system at the right time?" The Ariane 5 rocket, which veered off course and self-destructed on its first launch, is a famous example of a software feedback issue. Among many variables in the code was a floating point value representing the rocket's orientation. This is quite reasonable, but buried in the code was a handoff from one piece of software to another–feedback–where the receiving software interpreted this value as a 16-bit integer. The designers had confidence in this receiving piece of system software that used integer math, since it had functioned perfectly well in the Ariane 4. System thinking leads designers to ask questions like, "What happens if I get an integer error due to a value passed from an adjacent software system?" In the case of Ariane 5, one conclusion is that this kind of error makes it unsafe to interpret the received value as the true rocket orientation, which is what actually happened in the disaster.

Dataflow diagrams vs. control structures

Given that inadequate or missing feedback came up often in STPA at Google, and that thinking about feedback frequently made an impression on the system designers, we shifted our messaging. We started focusing on STPA's ability to analyze not only the control path, but also the feedback path–a part of software design that, as we've seen, often gets little to no attention.

The focus on feedback happens naturally in STPA because we model the system as a control structure–several control loops connected by control actions and feedback. This type of system model contrasts greatly with another model frequently used in software design: the dataflow diagram. These diagrams show how data moves between different software components, e.g. via remote procedure calls. Unlike control structures, dataflow diagrams do not indicate whether data is control or feedback. They also don't establish the control hierarchy–which pieces of software control the state of other pieces of software. Because control structures show this hierarchy, they are often called hierarchical control structures. For software systems consisting of millions of lines of code, dataflow diagrams can be fiendishly complex.

dataflow diagram-where are the flaws

Figure 2: Example dataflow diagram–where are the flaws?

At the start of this article, I mentioned that it's often not obvious where to start looking for issues in a complex system. If you were asked to analyze the dataflow diagram above–33 boxes with a spider web of arrows connecting them–how would you determine what outages might occur due to complex system interactions? For example, many of the boxes at the bottom and top of the diagram only have arrows going into them, not out. If these software components were to go down, how would it affect the system? Even if none of these components were to go down, how do we know there wouldn't be outages due to unforeseen, complex interactions between the components? If you don't know what these software components do, it's difficult to say. Worse, this example dataflow diagram is relatively simple compared to typical software system complexity, especially at Google.

control structure

Figure 3: Example control structure

Using a control structure to model the control-feedback relationships in a system allows for a level of abstraction that makes this analysis–searching for complex system interactions that could lead to an outage–practical. A typical control structure for a Google system has 10–15 boxes, and meaningful analysis can be done with even fewer boxes. The control structure above, with only 4 boxes, is from an actual Google STPA. After working with the system experts to build this control structure, we immediately noticed missing feedback from controller C to controller B–in other words, controller B did not have enough information to support the decisions it needed to make. As we continued running STPA, we also discovered additional system design issues that might lead to outages.

Control structures make it clear what the goal of each software controller is, what other pieces of software it controls, and where it gets feedback. Each arrow is labeled with the control actions or feedback passing between different parts of the system, making it easy to see whether each controller is getting the information it needs to issue the right control actions. Most importantly, we can list the context under which a particular control action would be unsafe, leading to an unsafe system state and possibly an outage.

Again, in training and working with many different Googlers to create control structures, we hit on a convincing message: In the process of applying STPA, you are effectively narrowing down the search for issues from millions of lines of code to a few hundred lines. This happens through identifying where in the software the decisions happen that lead to unsafe behavior. We do this by building scenarios that lead to unsafe control actions–the scenarios literally point to the lines of code responsible for a possible outage!

Putting it all together

Thanks to incorporating real Google examples in our training, and finding the right message to engage and motivate our audience, we were getting noticed by more teams at Google. In addition, the feedback continued trending in a positive direction. One participant said the following:

"The class itself is very well structured. I've heard about STPA in past years, but this was the first time I saw it explained with concrete examples. The Google example at the end was also really helpful."

One team even asked for a custom workshop to run a mini-STPA on their system. As we built out our training, a major accomplishment was a new three-day workshop focused not just on control structures, but on all the steps of STPA, built around another real Google example. However, now we ran into some new challenges.

The first challenge was that, although interest was high and registration filled up within a few days of announcing the first workshop, only about half the registrants showed up. This can probably be attributed to the difficulty in budgeting three days of time for training. To address this challenge, we decided on a stepped approach. We developed two STPA tutorials of 30 and 60 minutes, again built on real Google examples, showing impactful results. At the end of each tutorial, we advertised the workshop. The goal was to reach as many people as possible with these tutorials (it's easier to find 30 minutes in your day than to find several hours), then have attendees self-select for the workshop—those who are the most motivated from these initial tutorials will sign up for the workshop and actually attend.

The second challenge was that, although folks who have attended the workshop enjoyed it and expressed continuing interest in STPA, for the most part, they didn't actually try running STPA on their own system. That brings us to the latest phase of STPA training development at Google: we are building a self-serve internal version of our workshop, with a series of short recordings, including homework assignments. As Googlers watch the videos, they'll fill out templates with additional guidance to start STPA on their own system. Our hope is that working incrementally and doing each part of STPA right after watching the corresponding training video will be less intimidating. By the time a Googler completes the workshop, they'll have a good start on STPA for their system, and as mentioned earlier, this will allow quick generation of high quality results when an STPA expert joins the analysis. Even better, we can take the self-selection principle one level higher, and hopefully find Google's next cadre of STPA experts–early adopters who will help scale STPA at Google by championing it to their individual teams.

You can do it too!

The more you learn about STPA, and the more you see results from successful application of STPA to your business or industry, the more you realize you can't afford not to use it. It's a convincing tool for tackling the unknown unknowns that are everywhere in today's ever increasing complex systems. From a few individuals or a small team running the first STPA for your company, you can build awareness and expertise, and expand the use of STPA, as many others have done. For software companies looking to adopt STPA, there's no substitute for building internal training based on actual software examples–this will enable your software engineers to really see the benefits of STPA and commit to the process of learning it. Yes, there will be challenges along the way, just as Google has faced, but remember–most software companies are great at pivoting to address changing priorities and strategies. If your STPA training falls flat, pivot and try again! Before you know it, you'll be reaping the benefits of improved system reliability and safety.

For more details on SRE's adoption of STPA at Google, the theory behind STPA, and a case study of STPA application at Google, see "The Evolution of SRE at Google."