Life of An SRE with Tom Cranitch and Megan Yin

How does one become an SRE? And what's the career like? In this episode, Tom and Megan discuss their path to SRE.

Life of An SRE with Tom Cranitch and Megan Yin

MP: Hello, and welcome to the very first episode of season 2 of the Google SRE podcast, or as we affectionately refer to it, the "Prodcast." I'm your host, MP, back from season 1. And in this season, we will be exploring the life of an SRE. And to do so, we will be speaking with individual contributor SREs and members of SRE management from all across Google and all levels of tenure. If you are interested in becoming an SRE, growing your career as an SRE, or just want a little bit of a peek into what makes SRE culture here at Google tick, this is going to be the season for you.

And I do have one thing to touch on before we dive into our first interval. Viv, my co-host from the first season, has decided to jump in the world of responsible AI here at Google and is no longer with us in SRE. We are very grateful to her contributions to the podcast, and we wish her the absolute best of luck in her new role.

In light of that, and in keeping with our theme this season of trying to share as many voices as we can, I will be joined by five other SREs across the season as my co-hosts. So here for episode one is my very first co-host, who I am very excited to introduce. Go ahead.

Pamela: Hello, world. Sorry, I've always wanted to do that. My name is Pamela Vong. I also go by Pam. And currently, I am a senior site reliability engineer in the Geo team.

I have been at Google and been in SRE for almost 1 and 1/2 years now. So I'm still very fresh and new to all things SRE. But prior to joining Google and SRE, I have been mostly a software engineer for the past almost 15 years on a variety of all technologies, and feel a little embarrassed about not knowing what site reliability engineering even was two years ago from the time of this recording.

And at that time, when I was learning about it, I wished that there was a podcast for me to listen to to get the gist of what this whole area of engineering is like, and what the people are like, and the culture is like, because I listen to a lot of tech podcasts. And I've been a huge fan. And that's how I usually get my news and information and how I learn about everything that I do.

So I was super excited in the beginning of 2022 when the podcast got released, and I binged listened to every episode, and was also equally excited to hear when MP was looking for volunteers for the season 2. So here I am. And thank you for having me.

MP: So grateful to have you as both a listener and co-host. So for our first episode, we are going to be starting with some early career SREs here at Google and to hear that very fresh-eyed experience of SRE. So let's go ahead and have our guests introduce themselves.

Tom: Hi, I'm Tom. I also work with Pam on the Geo team, but over here in Sydney, Australia. Nice and warm at the moment. I've been at Google for about 10 months so far. And I joined as a new grad.

Megan: Hi, I'm Megan, and I, similar to Tom, also am about 10 months in to my career at Google. I joined out of college, and I'm working with MP on the Play SRE team as a software engineer.

MP: Thank you both for joining us.

Pamela: Megan, let's start with you. Could you tell us about your experience prior to joining Google?

Megan: Yeah, for sure. So prior to joining Google, I was a student at Cornell University. I had done a couple of software engineering internships prior to joining Google. When I was in the interview rounds with Google, My recruiter did suggest SRE to me as a potential avenue to go down, and I, like you, was not very familiar with SRE at all.

And I had done some reading. I think there's an SRE book out there for public consumption about SRE. And I was pretty interested in what SRE is. It's very different from my experience in software engineering. So I decided to pursue that. And ultimately, this was also how I ended up in the San Francisco office.

Pamela: So, Tom, tell us a bit about your experience prior to joining Google.

Tom: So I did my undergrad in math and computer science up in Brisbane at the University of Queensland. Through uni, I did some work at the Defense Science and Technology Group, which is the research division of the Department of Defense. I was doing full methods research there. I was looking at program safety.

So it was a bit of a change coming into SRE, I think, coming from a more like, formal research environment. I sort of got to the last year of my undergrad degree. I was going to do an honors year. For folks in America, I guess it's sort of like a master's. It's a transition we have to get into a PhD. And I decided in my last year that I wasn't keen to do another year of study, and so I started applying to places, and I got an offer from Google.

Pamela: Did you know about what SRE was prior to applying to Google?

Tom: Not really. I had a similar experience to both of you guys. I applied to a job at Fitbit, actually. There was a new Fitbit team in Sydney. My recruiter then asked me if I was happy for any job at Google.

The obvious answer to that was yes. So I went through the interview process, and I got to the end in Dream Team match. My recruiter floated the idea of SRE, and similar sort of experience, I looked into it more and I really liked the systems focus of it, the large-scale system design, particularly. I found that really interesting. I got an offer from the SRE team in Sydney, as well as JSRE, and after speaking to both of the managers, I just found the SRE work was more aligned with what I wanted to do in the long term.

Pamela: Megan, how did you feel about SRE when you learned about it? And what made you choose SRE?

Megan: I was pretty keen on choosing SRE because it's very different from software engineering. I really liked that SRE, you're exposed to a large surface area, versus a developer, you probably get really deep knowledge in maybe one binary. But as an SRE, you can kind of see like the whole forest, that is, in my case, the Play Store. Also, as an SRE, you're kind of solving different problems all the time, versus I thought that, as a developer, the work was very repetitive and that you're kind of doing design, build, test, repeat. And that was kind of what was interesting to me about SRE.

MP: Was that design, build, test, repeat the experience you had during your internship?

Megan: Yeah, I did have that experience through my internships. I kind of felt like I was just kind of like, building APIs, and kind of just going through the cycle of testing them. And that was also sort of what I experienced through my software engineering coursework in college.

Pamela: What was it like for you both to transition from academics to your professional career? What's the most challenging thing you've had to come across?

Tom: For me, the most challenging part, I think, was just like, the pure scale. Like uni work and the research work I did, it was not necessarily super well-scoped, but it was work that you could do sort of in the period of weeks or maybe a month or two and you could really get your head around things relatively quickly. A uni course was scoped to a semester, and in that time, you're meant to learn all the content, do the assessment, all that kind of stuff.

And I found, joining Google, one of the first things my manager said to me was, I don't expect anything from you in the first six months. And universally, I think that experience is something that most people get at Google, just like, the pure scope of everything here and how long it takes to get your head around the systems that you work on. I think, particularly, in SRE, there's so many systems. Geo has something like hundreds of binaries, right?

And then, also, the developer tooling and everything. Everything at Google is new, and just the amount of time it took me to get my head around it. I mean, I don't have my head around it still, but to get to the point where I had my head around it enough to be productive, that was really new for me.

Megan: I just want to also echo what Tom said. A lot of it is just being able to thrive in ambiguity, which is like, a huge phrase that we use within SRE. Because SRE has such a large surface area, there is a lot of ambiguity in that, and there is a lot to learn. And just like what Tom said, the first six months is a lot of learning, versus, like, in college, six months, you're probably done with the whole course by then. So, yeah, transitioning to that and just being able to constantly learn while also still producing was a huge transition.

Pamela: You've both had an interesting transition into your careers, also, for another reason. You came into Google during the time Google was returning to office. How did that impact your onboarding?

Megan: I'm actually super grateful that I was able to onboard in the more return to office time. I definitely really appreciated being able to see my coworkers in the office and just being able to kind of like, go up to their desk and be kind of like, I'm running into this issue. Could you help me out with that? Through my internships, those were all pretty much virtual, and it's really great being with coworkers in the office, and especially when I was shadowing on-call, which is a huge part of the SRE role at Google. It was really great to see my coworkers triaging an incident in real-time and just be able to go up to their desk rather than having to set up a meeting to see how they were handling the problem.

Tom: Yeah, I would definitely echo that. I joined a few months before RTO. RTO in Sydney was a little bit later than in the US. And coming into the office initially, it was kind of eerie, right? Like, there was no one around except for my team.

I was quite lucky, I think, that my team was in the office. I really enjoy coming into the office. And just having those in-person interactions, I found so valuable. I think part of it as well is just like overhearing what other people are talking about. You get so much context from the conversations that people are having around you, I think.

And definitely echo that, like, just being able to turn to someone and ask a question instead of having to try and perfectly word this message you're sending to them because you've met this person once online and you don't want them to take your message the wrong way and all that kind of stuff, just being able to like, turn to someone and ask a question. And then, if it turns into a bigger thing, you can just go to a meeting room and have the conversation. But not having to balance all these online interactions with people you don't really know, it's super valuable.

Megan: Yeah, those magical hallway conversations are huge.

Pamela: You've mentioned ramping up to go on-call. Megan, do you want to share what it's been like doing on-call versus the project work you get to see?

Megan: Yeah, for sure. So like I said, being on-call is a huge part of being an SRE at Google. I've seen that we tend to budget a lot of time towards being on-call and following up with on-call.

And I would say it took me about six months to get ramped up for that. I got ramped up for that by doing some practice exercises, and also, just shadowing and reverse shadowing. The difference there is that if I reverse shadow, I'm in the hot seat, and someone who's more experienced is there to support me. So, yeah, I was doing that for about six months, and just getting an understanding for the system and the critical flows.

And part of what was really huge about that is learning which pages should you definitely jump for, for example, if they're revenue impacting, then those are probably things that you should probably escalate. And learning that it's OK to escalate and to involve other people was definitely a huge learning for me.

Tom: Yeah, I had a similar sort of experience. We have a relatively quiet page off, for better or for worse. So for me, my ramp up was a bit faster. I got to the point where I was shadowing, and I was shadowing for a little bit, and we didn't get any pages, and my manager was sort of just like, you can go on-call.

It felt a little bit like being pushed in the deep end. But there's this safety net, which I think is hard to trust, but my manager, obviously being here a bit longer, knows that there's that safety net there, and it's like, it's safe to have someone with not heaps of experience being on-call because there's these escalation paths that were there to save you. But yeah, they may-- to ramp up for on-call, we did a lot of ticket work, so slow burn pages go into our ticket queue.

And then, as a team, every week, we sit down and spend about an hour looking through them, and that was really helpful for getting me ready for on-call. Because we don't get a lot of pages, these are the closest things we have, a lot of the time, to real incidents. And they are-- it's a productive exercise as well because these are things that are affecting production. They're real slow burn alerts that need to be fixed by someone anyway. That was definitely super helpful for getting me on call.

MP: For both of you, what was it like adjusting to the real-time aspect of having to be on-call for a production service? Because you think about academic work, or even normal software development work, you're not really in that position where five minutes, 10 minutes, 15 minutes, an hour really matters. But then, you're on-call for these large production systems, and those five minutes, 10 minutes can be really important. And I know, I've been supporting production systems for seven years now, like, I'm kind of desensitized to it. So what's that like for both of you, having to make that adjustment into urgent response?

Tom: Yeah, I actually really enjoy it. We had SRE EDU. I did it, I think, in my third week or so. And there's these breakage scenarios, and there's this fake service, and things break, and you have to try and fix it. And I found it really enjoyable. And I've continued to find pages to be quite enjoyable.

I think, partly, I enjoyed exams, which is the closest thing you have to these time pressure scenarios that you need. And I found that time pressure, I quite enjoy. So that aspect, I actually haven't hated.

One thing I do find a bit stressful is, for us, at least my experience has been that a lot of my pages are manual pages. So something has broken for someone in a service that is related to us, or something has broken in our service and we haven't been alerted, but some other team is experiencing pain because of one of our services. They turn to SRE as sort of like, production experts. That I found really scary.

There's this person who's waiting on you for their service, which is broken. And there's a pressure of someone waiting on you. I actually found that worse than these automatic pages where, even though the revenue impact there may be worse, and the user pain may be worse, the fact that you know that one of your colleagues is waiting on you to reply to them, I found that almost more scary. Particularly as a new grad, like, this person who is far more experienced most of the time is waiting on me with five months experience to sort of get them answers.

Megan: Coming from my perspective, my team, I guess, has a little bit of a heavier on-call. We do get quite a few pages that are more revenue impacting. And I remember the first time I dealt with a revenue-impacting large outage was when I was reverse shadowing, actually, my tech lead.

And that was incredibly nerve-wracking to me, to see so many people get involved, and also, be cognizant of the fact that Google was also losing money for every minute we were going through this. However, my tech lead said something very interesting to me. He said that SRE is for people who tend to run towards fires instead of away from them.

And that is, I do feel, entirely true. And it's really one of the more rewarding aspects of this job, I would say, that real-time aspect. Like you said, MP, it's really interesting to see how these breakages kind of affect people, but also, know how you can triage them and fix them.

Another big learning for me was, like I mentioned before, just learning which pages and incidents are worth getting stressed over. And obviously, the ones that affect our critical flows, are revenue impacting, are definitely ones that you should be a little bit more stressed about, but more one-off pages for tasks that are dying, or something like that, you can maybe take your time with. And that is huge. And also, like I said, knowing that you can always escalate and people are always happy to get involved, especially if money is on the table.

MP: I think one of the things that is fairly well known within SRE, but it's not really even spoken that much is how much authority is delegated to on-caller. And there's this really wild dynamic, that I'm sure both of you have started getting used to, that you can have managers and directors looking at you for guidance, for the answer, and you might have just been on the rotation for a few months at that point. And you have to respond to that. You have to move everyone forward and be that point person.

Tom: I remember having this conversation. We had someone senior from the US come and visit us a few months ago. So I had been on-call for a month or two at that point. I remember having a very similar conversation with them.

It's scary. The whole company puts so much faith and trust in you as this person who has been at Google for like, five or six months that, like-- I don't know, Google Maps, right? We have however many users, and like, the whole service is relying-- I guess there's four.

We have four SRE shards. So there's four on-calls for Google Maps in SRE. Basically, the whole service is sort of resting on these four people.

And your portion of that, the services you're on-call for, which are all super important, the whole company is trusting that you will keep them alive. If not, obviously, there's escalation paths and stuff, but I still find that shocking. It's just like, the amount of faith and trust everyone puts in SRE and on-callers in general.

Megan: Yeah. I would say something I like to think about SRE is sort of like the guardians of Google production, is how I like to think about it. And, yeah, it's an incredible honor to have this much responsibility placed on you, but with power, comes great responsibility, or I forgot what the quote was.

MP: Spider-Man quote. That's also the sudoers. First time you type sudo on a system, “with great power comes great responsibility.”

Megan: Yeah. So I do feel that every time, like, somebody sends me a code change to approve, that that code change could definitely come back to page me if I don't put the necessary review into it. But it's also a super forgiving culture. We're always-- things do break, and also, having the great support that the SRE community has. We all know that we have so much responsibility placed on us, but we're also super collaborative, and it's always OK to escalate if you don't know the answer.

Tom: I would definitely second that. Like, as soon as there's an incident in the office, the number of people that surround the computer. You walk past another team, and you can tell if there's an incident going on because there's suddenly eight people surrounding this one monitor, and like everyone's there to help each other.

And as soon as we have a page in our team, everyone's on their computer looking at dashboards and sort of sharing. And it's super helpful for us as well, the person who's not on-call, because you get practice dealing with pages. But this collaborative thing is definitely-- it's been super rewarding.

Pamela: I think that's definitely one of the highlights that drew me into saying yes to SRE, the whole aspect of this larger team that is going to stand behind you and help you, but not blame you if something goes wrong. Going through the SRE book, the whole culture of blameless postmortems, I think, really spoke out and resonated with this empathetic culture that exists in this side of engineering that I don't think I have seen in all the other aspects of my engineering career before.

Tom: Yeah, the blameless postmortem culture is something that sounds great on paper, and then, you see it. I think it's really hard to trust until you see it in practice. And you see it over and over again that this is actually so deeply embedded in SRE culture at Google, I think, across all of suite culture at Google. I think it does probably come out of SRE. Until you see it reinforced over and over again, I think it takes a while to trust that there is this blameless culture.

Megan: Yeah, exactly. I think the super good approach to it is to always blame the system, because if any one person could have taken down production, then it's the fault of the system, not that individual person. Maybe we need more safeguards in place. Maybe we need to review some of our production policies. But that way, you can make it so that this doesn't happen in the future, rather than placing the blame on any one person.

Tom: I think you start to learn over time as well that it's not very productive blaming anyone. I trust everyone that I work with. I trust that they're not doing anything malicious, and I trust that they made the best decision that they could at the time, in the same way that they're going to trust that I made the best decision I could at the time. And so if you're blaming the people, there's no productive changes that are going to come out of that. Whereas, if you can find faults in the system that we can fix, these are concrete things that we can actually improve.

MP: And even if it is faulty human decision-making, then take the human decision-making out of it.

Tom: Oh, yeah. 100%. Humans are naturally faulty, right? At times, we all make mistakes. And hopefully, the system should be able to pick that up.

MP: I'm curious about that having to-- Tom, you mentioned having to learn to trust in blameless postmortem culture. So I'm wondering if you have any other thoughts on that want to share.

Tom: Yeah, I think, maybe it's coming out of uni, there's this real culture of individual work and getting grades, and you're getting assessed on everything you do. And so coming into an environment where it's-- project work is also true, but particularly on-call, it's this team effort to keep everything running and no one's there to like give you a grade on how well you're doing. Even if you're the single on-caller, it really is a group effort to keep everything running.

And in a similar way, I think, with escalation paths, it takes a long time, or it took a while for me to trust that no one is going to blame you if you escalate or page someone. Geo actually has a dedicated incident response team that we can escalate to if we feel like we can't handle an incident, and no one is going to blame you if you escalate to them. That's what their rotation is there for. And I think it takes a long time though to trust that there is this like, blameless culture hand-in-hand with the escalation paths where no one is going to blame you if you escalate to them.

Megan: Yeah, exactly. I believe the philosophy is always to not be afraid to escalate, escalate early. And no one's going to blame you if you don't know how to solve something. It's very collaborative.

Tom: I think escalation is also one of the best ways to learn to deal with pages. You have someone that is far more experienced than you that you're then working with, and that is such a great way to learn how they deal with pages and become a better on-caller yourself as well.

MP: Have either of you needed to make any adjustments outside of work to just accommodate the inherent stress of supporting live production systems?

Megan: I would say not really. Google has a really great work-life balance. And I would say I do go out quite a bit, and I would say those days when I'm on-call, my shift is usually 11:00 AM to 11:00 PM, and those days when I'm on-call, I'm not able to go out quite as much, but I've learned to appreciate those days, even when they're on the weekends, because it gives me some time to force me to stay at home, relax, tend to a lot of my household chores and stuff like that.

Tom: Yeah, I've found exercising is something I've picked up a lot more this year, and I found that has really helped. I think a mix of just general work stress. Google has excellent work-life balance, but at times, being on-call is obviously a little bit stressful, and I found exercise has really helped with that.

The forced downtime for being on-call I find helpful as well. Like, I can still go and do things in life and carry the laptop with me, but I just have to be a bit more calm on those days and not do anything too wild. That forced downtime I've found really helpful.

Pamela: So this has been great. What are you both most looking forward to as you grow your SRE careers?

Megan: I've been fortunate enough to work with some super talented and super tenured engineers on my team. It was a little bit daunting at first to learn that so many of my coworkers have been at Google like, 5 plus years, and to have so much ownership and so much knowledge on production. It is true that, as an SRE, there is a ton to learn, and it takes a lot of time to do that.

But I am really looking forward to all the learning that I have in the years ahead as an SRE at Google and just the incredible growth that can happen. I have many role models in my team members, and I look forward to learning from them. It's a great honor. And I look forward to filling their shoes someday.

Tom: I would definitely echo the learning. The amount of learning I've done so far has just been crazy. You would think that you would learn less once you leave uni, but I have not found that to be the case. I'm definitely looking forward to that going forward.

In the nearer future, Geo is undergoing some system redesign stuff at the moment, and SRE is having a really big hand in that. And I'm really excited to work on that stuff, being able to shape the future of how our systems work. That sort of project work stuff I'm really excited for.

MP: Well. thank you both for joining us. This has been great. It's been a pleasure to have both of you Thank you, Pam.

Pamela: Thank you, MP.

Tom: Yeah, thanks for having me.

Megan: Thank you.

VOICEOVER: "Prodcast," the Google SRE production podcast is hosted by MP English and produced by Salim Virji. The podcast is edited by Jordan Greenberg. Engineering by Paul Guglielmo and Jordan Greenberg. Javi Beltran composed the musical theme. Special thanks to Steve McGhee and Pamela Vong.