A conversation with John Allspaw at SREcon Part 2
[JAVI BELTRAN, "TELEBOT"]
SPEAKER: Welcome to season six of the Prodcast, Google's podcast about site reliability engineering and production software. This season we are continuing our theme of friends and trends. It's all about what's coming up in the SRE space, from new technology to modernizing processes. And of course, the most important part is the friends we made along the way. So happy listening, and may all your incidents be novel.
[JAVI BELTRAN, "TELEBOT"]
FLORIAN: Welcome back to the Prodcast, Google's podcast on site reliability, engineering, and production software. We're live from the SREcon Americas in Seattle today. And with us, we have a very special guest.
JOHN: Hello. Thank you. My name is John.
MATT:I'm Matt.
FLORIAN: I'm Florian. So John, some of our listeners may know you as someone who cares about resilience and learning from failure.
JOHN: Yeah, accurate.
FLORIAN: Do you want to tell us what your main takeaways were from this year's conference compared to previous years, given that you are also a veteran of this institution?
JOHN: Sure. Yeah. There's always something that is omnipresent and continually present, I think, in all the SREcons that I've been to. I guess I have been to a number of them. It's a practitioner conference.
A long time ago, before SREcon, I was chair of a conference called the Velocity Conference, also a practitioner conference, which meant that you all know software engineers that have contact and experience familiarity with production is just a different perspective than those who don't. Don't suffer any bullshit. Can I swear?
FLORIAN: Sure.
JOHN: It just-- You see right through it. And this is a community that sees right through bullshit. And that's so amazing. It's so great, because what's real is the messy world. No matter what we want to do to model and scrunch things into-- [INAUDIBLE] said earlier today, to compress them into tiny little boxes and neat, nice, orderly things. There's incidents. And there's these phases. And you do these things.
And doesn't matter what the incident is. They can fit in these boxes, and our thinking can fit in these boxes. And all of the lines are-- and the angles are at 90 degrees and ooh, ooh. But that, of course, our experience in handling things or even when there's not any sort of time pressure, it's a big mess. We should be super curious about why it's working at all. And so it doesn't have much, really. On paper, it shouldn't be working nearly as well as it does.
And that's why I come back around to the thing that's omnipresent, what I always appreciate, is the no-bullshit tolerance is when speakers-- and it resonates with this community-- is when they say out loud, yeah, it's messy. And even the thing that's really wild to me is you're amazing at it even if it's messy and you don't know how are amazing at it. The resilience is coming. Wearing pants and shirts. The resilience isn't coming from cables and hard drives.
MATT:So what are some traits of the individuals you see who are proficient in working in these environments? Because I'm sure they're horizontally effective in this way, and that discriminates them from the other people who select themselves out of these communities. Because you come here. You see the same kinds of people doing the same kinds of things, and they speak to each other. And what are these characteristics you see in these people doing these things?
JOHN: Yeah. Characteristics is a better term than traits, I would say. It's really no different than any other specific practitioner crowd or community. It just so happens to be, for the most part, online software. Every now and then, a little bit other. But it's all software. But the same thing is in place, these characteristics that I'll get to.
But there's the same thing in medical and aviation. And there's a reason I have heard that Google's quite interested in STPA, and STAMP, and CAST. There's some talks here. Amazing. It's new, but only to this domain, really. I think Akamai would argue that they were first, easily. But it is new to this community. That's amazing.
But what's great about that example is that didn't start in software. I mean, it did, but I don't think you would say avionics is the same of what we do here. And so what's common in all these practitioner worlds, communities is, shorthand, you could relay a story very quickly that will make absolutely no sense to somebody outside of the community. You will do it very efficiently.
But what if I say this-- Let me just look around the room. Missing where clause. So I can get a reaction out of anybody in this room. And there are people who are in this hotel right now who have no idea why that's funny, weird, or scary. And I had to say three words. That's like a hallmark. This is how you know it is a genuine thing. And I think that there's-- that can scare people away. I think as far as communities go, we're not nearly the scariest, in my experience. So I think it's more like those characteristics come from expertise. Expertise is exposure to experiences that are really broad, agnostic of the time those experiences take. So we would sometimes think of them as equated, like, oh, it's because you've got experience. As a shorthand, that totally works, colloquially. Expertise is really what most people think, because expertise is what makes us amazing and, ironically, weirdly unable to understand how good we are at it.
MATT:Fair.
JOHN: And so it's real. And so that's what I-- I'm always attracted to what makes this work hard and what makes them good at it. That's it. And that's why I keep coming. I've been going to SREcons since the '70s. I mean, whatever, since-- 2015, I was in Dublin, but so yeah. That's my amazingly long answer.
FLORIAN: So earlier in a session, you also mentioned that these experienced practitioners sometimes need to do heroic acts and that we shouldn't fault them for it, because SRE also has this no-heroes culture.
JOHN: Yeah, yeah. The context there is that-- The comment I made was that we tend to think of heroes as a word that's just not-- it's always negative, universally. And when you have it, it must mean something's terrible. First, what's heroic is context-specific. And the other thing is, well, OK, what if you don't have them? Well, things are going to be worse. Would you rather have a hero or not have a hero?
If it's a shorthand for demonstration and proof that you're not doing enough to letting other people or supporting other people to be involved, then this one person, or an alternative is that this person is really selfish and/or territorial, blah, blah, blah. I just find it, sure, that's possible. But it's not the norm.
I think the norm is that some people are just really intuitive. They have a feel for things that others don't largely, again, because of expertise. And thank God you have them, because, yeah, if you would like for them to be worse off, those incidents to be less well handled, then, great, get rid of them. I don't think your competitors will be doing that.
FLORIAN: You heard it here first. Heroes welcome, but not as strategy.
JOHN: But see, that's a great line. I'm not very good at marketing. That's good.
FLORIAN: All right. Well, that's about it.
JOHN: Cool.
MATT:I think that's a good point to close it. Thank you very much, John. That was great.
JOHN: Thanks for having me.
[JAVI BELTRAN, "TELEBOT"]
SPEAKER: You've been listening to the Prodcast, Google's podcast on site reliability engineering. Visit us on the web at sre.google, where you can find books, papers, workshops, videos, and more about SRE. This season is brought to you by our hosts, Jordan Greenberg, Steve McGhee, Florian Rathgeber, and Matt Siegler, with contributions from many SREs behind the scenes. The podcast is produced by Paul Guglielmino and Salim Virji. The podcast theme is "Telebot" by Javi Beltran and Jordan Greenberg.
[JAVI BELTRAN, "TELEBOT"]