Andrew Clay Shafer
When I found out people were working on a second SRE book, I reached out and asked if I could write a few words. The principles from the first SRE book align so well with what I always imagined DevOps to be, and the practices are insightful, even when they aren’t 100% applicable outside of Google. After reading the principles from the first SRE book for the first time—embracing risk (Chapter 3), service level objectives (Chapter 4), and eliminating toil (Chapter 5)—I wanted to shout that message from the rooftops. “Embracing risk” resonated so much because I had used similar language many times to help traditional organizations motivate change. Chapter 6 was always an implicit DevOps goal, both to allow humans more time for creative higher-order work and to allow them to be more human. But I really fell in love with “service level objectives.” I love that the language and the process create a dispassionate contract between operational considerations and delivering new functionality. The SRE, SWE (software engineer), and business all agree that the service has to be up to be valuable, and the SRE solution quantifies objectives to drive actions and priorities. The solution—make the service level a target, and when you are below the target prioritize reliability over features—eliminates a classic conflict between operations and developers. This is a simple and elegant reframing that solves problems by not having them. I give these three chapters as a homework assignment to almost everyone I’ve met since. They are that good. Everyone should know. Tell all your friends. I’ve told all mine.
The last decade of my career has been focused on helping people deliver software with better tools and process. Sometimes people say I contributed to inventing DevOps, but I was just in the position to borrow and steal successful patterns from across many different organizations and projects. I get embarrassed when people say “DevOps” was invented by anyone, but especially by me. I don’t consider myself an expert in anything but being inquisitive. My idealized DevOps always patterned off whatever information I could extract or infer from my friends, and my friends happened to be building the internet. I had the privilege of behind-the-scenes access to people deploying and operating a representative sample of the world’s most incredible infrastructures and applications. DevOps symbolizes aspects of the emergent and existential optimizations required to rapidly deliver highly available software over the internet. The shift from software delivered on physical media to software delivered as a service forced an evolution of tools and processes. This evolution elevated operations’ contribution to the value chain. If the systems are down, the software has no value. The good news is, you don’t have to wait for shipping the next shrink-wrapped box to change the software. For some, this is also the bad news. I simply had the opportunity and perspective to articulate the most successful patterns of the new way to a receptive audience.
In 2008, before we used the word DevOps like we do now, I’d been through the dot-com collapse, grad school, and a couple of venture-funded rollercoaster rides as a developer—searching Google for answers daily the whole time. I was working on Puppet full-time and I was fascinated by the potential for automation to transform IT organizations. Puppet thrust me into solving problems in the operations domain. At this time, Google used Puppet to manage their corporate Linux and OS X workstations at a scale that pushed the capabilities of the Puppet server. We had a great working relationship with Google, but Google kept certain details of their internal operations secret as a matter of policy. I know this because I’m naturally curious and was constantly seeking more information. I always knew Google must have great internal tools and processes, but what these tools and processes were wasn’t always apparent. Eventually, I accepted that asking deep questions about Borg probably meant the current conversation wasn’t going very far. I would have loved to know more about how Google did everything, but this simply wasn’t allowed at the time. The significance of 2008 also includes the first O’Reilly Velocity conference and the year I met Patrick Debois. “DevOps” wasn’t a thing yet, but it was about to be. The time was right. The world was ready. DevOps symbolized a new way, a better way. If Site Reliability Engineering had been published then, I believe the community that formed would have rallied to fly the “eliminate toil” flag and the term DevOps might have never existed. Counterfactuals notwithstanding, I know the first SRE book personally advanced my understanding of the possible, and I already helped many others just with the SRE principles.
In the early days of the DevOps movement, we consciously avoided codifying practices because everything was evolving so rapidly and we didn’t want to set limits on what DevOps could become. Plus, we explicitly didn’t want anyone to “own” DevOps. When I wrote about DevOps in 2010, I made three distinct points. First, developers and operations can and should work together. Second, system administration will become more and more like software development. Finally, sharing with a global community of practice accelerates and multiplies our collective capabilities. Around the same time, my friends Damon Edward and John Willis coined the acronym CAMS for Culture, Automation, Metrics, and Sharing. Jez Humble later expanded this acronym to CALMS by adding Lean continuous improvement. What each of these words might mean in context deserves to be a full book, but I mention them here because Site Reliability Engineering explicitly references Culture, Automation, Metrics, and Sharing alongside anecdotes about Google’s journey to continuously improve. By publishing the first SRE book, Google shared their principles and practices with the global community. Now I define DevOps simply as “optimizing human performance and experience operating software, with software, and with humans.” I don’t want to put words in anyone’s mouth, but that seems like a great way to describe SRE as well.
Ultimately, I know DevOps when I see it and I see SRE at Google, in theory and practice, as one of the most advanced implementations. Good IT operations has always depended on good engineering, and solving operations problems with software has always been central to DevOps. Site Reliability Engineering makes the engineering aspect even more explicit. I cringe when I hear someone say “SRE versus DevOps.” For me, they are inseparable in time and space, as labels describing the sociotechnical systems that deliver modern infrastructure with software. I consider DevOps a loose generic set of principles and SRE an advanced explicit implementation. A parallel analogy would be the relationship between Agile and Extreme Programming (XP). True, XP is Agile, arguably the best of Agile, but not all organizations are capable of or willing to adopt XP.
Some say “software is eating the world,” and I understand why they do, but “software” alone is not the right framing. Without the ubiquity of computational hardware connected with high-speed networks, much of what we take for granted as “software” would not be possible. This is an undeniable truth. What I think many miss in this conversation about technology are the humans. Technology exists because of humans and hopefully for humans, but if you look a little deeper, you also realize that the software we rely on, and probably take for granted, is largely dependent on humans. We rely on software, but software also relies on us. This is a single interconnected system of imperfect hardware—software and humans relying on themselves to build the future. Reliability is eating the world. Reliability is not just about technology, though, but also about people. The people and the technology form a single technosocial system. One nice feature about having Google share SRE with the rest of the industry is that any excuses about what kind of processes work at scale became invalid. Google set the highest standard for both reliability and scale. There might be valid arguments about why someone can’t adopt Google SRE practices directly, but the guiding principles should still apply. As I look at the landscape of possibilities to build the future and the ambition to transform human experience with software, I see a lot of ambitious projects to quite literally connect everything to the internet. My math says that the successful projects will find themselves ingesting and indexing incredible amounts of data. Few, if any, will surpass the scale of Google today, but some will be the same size Google was when they started SRE and will need to solve the same reliability problems. I contend that in these cases, adopting tools and process that look suspiciously like SRE is not optional but existential—though there is no need to wait for that crisis because SRE principles and practices apply at any scale.
SRE is usually framed as how Google does operations, but that misses the bigger picture: SRE in practice enables software engineering, but also transforms architecture, security, governance, and compliance. When we leverage the SRE focus on providing a platform of services, all these other considerations get to have first-class emphasis, but where and how that happens may be quite different. Just like SRE (and hopefully DevOps) shifted more and more of the burden to software engineering, modern architecture and security practices evolve from slides, checklists, and hope to enabling the right behaviors with running code. Organizations adopting SRE principles and practices without revisiting these other aspects lose a huge opportunity to improve, and will also probably meet with internal resistance if the people who consider themselves responsible for those aspects are not converted into allies.
I always enjoy learning. I read every word of the first SRE book straight through. I loved the language. I loved the anecdotes. I loved understanding more about how Google sees itself. But the question for me is always, “What behavior will I change?” Learning isn’t collecting information. Learning is changing behavior. This is easy to determine or even quantify in certain disciplines. You have learned to play a new song when you can play the song. You are better at chess when you win games against stronger players. Site Reliability Engineering, like DevOps, should not just be changing titles, but making definitive behavior changes, focusing on outcomes and obviously reliability. The Site Reliability Workbook promises to move forward from an enumeration of principles and practices by Google for Google toward more contextual actions and behaviors. Site reliability is for everyone, but reliability doesn’t come from reading books. Here’s to embracing risk and eliminating toil.