Courtney Nash on Complex Systems

Courtney Nash of The VOID discusses the role of human expertise in managing complex systems, and how SREs continue to bring critical value even as technology and AI evolve.

Courtney Nash on Complex Systems

[MUSIC PLAYING]

SPEAKER: Welcome to season 6 of the Prodcast, Google's podcast about site reliability, engineering, and production software. This season, we are continuing our theme of friends and trends. It's all about what's coming up in the SRE space, from new technology to modernizing processes. And, of course, the most important part is the friends we made along the way. So happy

listening, and may all your incidents be novel.

[MUSIC PLAYING]

FLORIAN RATHGEBER: Welcome back to the Prodcast, Google's podcast on site reliability and production engineering. We are

live from SREcon in Seattle today. And with me, I have Courtney.

COURTNEY NASH: Hi.

FLORIAN RATHGEBER: Do you want to introduce yourself and--

COURTNEY NASH: Sure.

FLORIAN RATHGEBER: --tell us your hot take?

COURTNEY NASH: My hot take. OK, so my name is Courtney Nash. I am the cofounder of the VOID, which does incident research. And so I, like most of you, think about incidents a lot, but I tend to think less about the technology of the incidents and

more of the people and the systems of the incidents.

And a thing that has been vexing me lately that has come up a bunch here at SREcon-- we've talked about it. We talked about

it in a workshop-- is-- well, let me start by asking you a question. Do you think you're an expert in anything?

FLORIAN RATHGEBER: That's a tricky one. Well, one phrase that I like is I have approximate knowledge of many things.

COURTNEY NASH: [LAUGHS] OK. Do you feel like there's something you're pretty damn good at? Don't be shy.

FLORIAN RATHGEBER: Well, I think I have a good attention to detail, but yeah, it's always hard to self-evaluate.

COURTNEY NASH: Yeah, well, it's hard to understand your own performance, right?

FLORIAN RATHGEBER: Exactly. Yeah.

COURTNEY NASH: And maybe we also try to be humble, right? We want to be humble.

FLORIAN RATHGEBER: Yeah.

COURTNEY NASH: Being humble is good. I'm an expert in a couple things. And one of the things I'm an expert in is cognitive science stuff. And the thing that vexes me is that we-- you all-- run these incredibly complex systems. And I'll pose you

another question. We talk a lot about failures and a lot about incidents. Why do things go well at Google when they go

well?

FLORIAN RATHGEBER: Well, I think because a lot of smart people put a lot of thought into how to do things.

COURTNEY NASH: Right. So experts make things happen. Experts make things work.

FLORIAN RATHGEBER: Yeah.

COURTNEY NASH: So we talk about AI a lot right now. So my hot take is we need experts. The more AI we have, the more experts

we need, which sounds like a problem because AI is going to make everything easier? Yes?

FLORIAN RATHGEBER: But yeah.

COURTNEY NASH: Maybe?

FLORIAN RATHGEBER: Who evaluates the evaluators?

COURTNEY NASH: And when things don't work, who fixes things? So I was in this really cool talk about metastable failures.

Did you go to that one?

FLORIAN RATHGEBER: No, I did not.

COURTNEY NASH: That was a cool one. I love metastable failures. And he's talking through all of this, and it's deeply technical, deeply complex stuff. And he says, so we sort of, like, stumbled upon a bunch of solutions. Here are the

solutions. And I was like, brrt, wait, please. What? Stumbled upon?

And so I put the questions-- and I asked in Slack. They answered the question, how did you stumble upon these solutions, as though some magical ideas came down from above. And the TL;DR of his answer was some really smart people showed up and

figured out how to fix it.

And I don't think we talk about that enough. I think we talk about AI and what AI is going to do and all of these things, which I'm not an AI hater, but I think we're focused on the wrong things. So my hot take is the more we have AI, the more we

need people, and especially experts.

FLORIAN RATHGEBER: Yeah.

COURTNEY NASH: I don't know if you need more than that for a hot take.

FLORIAN RATHGEBER: Makes sense. So should we then not have a "where we got lucky" in our post-mortem?

COURTNEY NASH: No, you should not. So that'll be my other hot take. Did you get lucky? What's luck? What is luck in that

system or that situation, yeah?

FLORIAN RATHGEBER: So we count that as things that went well.

COURTNEY NASH: Yeah, well, went well-- why did they go well? Yeah, you might be surprised that something went well, but that's not luck, right? That's just something diverging from what you were expecting to happen. So I would argue instead of how we got lucky in postmortems or in any of these things you write up, I would much rather see what we did that was pretty

awesome. You could call it whatever else you want to. You could make it snappy or make an acronym. I don't know.

But so much of why we don't have incidents is experts. And then when we have incidents, the way we get out of them is

experts. So therefore, by logical deduction, we need experts.

FLORIAN RATHGEBER: Cool, all right. So we should be a little less humble, trust our expertise--

COURTNEY NASH: And help other people become experts. Share how we know those things. So what was the other cool one that I picked up today in Paul Read's talk on AI, something he heard? So I'm giving third hand. He heard it from somebody who heard it from somebody in a discussion track here. But this is what's so great about us all getting together and talking about

these things.

FLORIAN RATHGEBER: Yeah.

COURTNEY NASH: Some company has a TIL channel in Slack, and I think they wrote a Slack bot for this. I think it's actually automated. I like automation, too. Automation is cool. And anytime somebody shares something that they learned, like, oh, the other day I learned that the Kafka thing, it does this, and somebody does the little TIL emoji on that, they automate

that. And they pull that all into a channel that anyone can go and look at.

And I was like, that's the cool-- because nerds like me and John Allspaw and people are always talking about disseminating learning through your-- so get the experts to talk about what they know. They're not always good at explaining how they're experts or why they know what they know. But when they tell you something, and somebody goes, whoa, I didn't know that, that's cool, now you're taking something and helping someone else start to become more of an expert in that. So these little

things that we can do can help share our expertise, can help develop those instincts in other people.

And third hot take, I guess I'll say, that will save your ass in an incident. Can I swear on the Google Prodcast? That will save your ass in an incident far more than almost anything we like to talk about about auto-scaling and capacity planning and, like, mrr, mrr, mrr. That stuff's important, but we don't focus enough on expertise, learning, how we can share that

with each other. Those are the things that make the incidents less painful or not happen at all.

FLORIAN RATHGEBER: Awesome. Well, I think that's a good point to close. Thank you very much.

COURTNEY NASH: Thank you for having me.

FLORIAN RATHGEBER: That was great.

SPEAKER: You've been listening to the Prodcast, Google's podcast on site reliability engineering. Visit us on the web at SRE.

google, where you can find books, papers, workshops, videos, and more about SRE.

This season is brought to you by our hosts, Jordan Greenberg, Steve McGhee, Florian Rathgeber, and Matt Siegler, with contributions from many SREs behind the scenes. The Prodcast is produced by Paul Guglielmino and Salim Virji. The Prodcast

theme is "Telebot" by Javi Beltran and Jordan Greenberg.

[MUSIC PLAYING]