Site Reliability Engineering

Jump to Content

SRE Prodcast

Prodcast is Google's podcast about Site Reliability Engineering and production software.

SRE Prodcast

Season 1: SRE Fundamentals
Season 2: Life of an SRE
Season 3: Champions of the Internet
Season 4: Friends and Trends
Season 5: More Friends, More Trends
Season 6: Prodcast Live!

Season 1: SRE Fundamentals

Season 1 Discusses concepts from the SRE Book with experts at Google.

Season 1, Episode 9

Postmortems with Ayelet Sachto

Ayelet Sachto offers advice on creating an actionable, transparent, and blameless postmortem culture.

View transcript

Further reading

Season 1, Episode 8

Incident Management with Adrienne Walcer

Adrienne Walcer discusses how to approach and organize incident management efforts throughout the production lifecycle.

View transcript

Further reading

Season 1, Episode 7

On-Call Rotations with Andrew Widdowson (APW)

Andrew Widdowson (APW) shares strategies for successful on-call rotations.

View transcript

Further reading

Season 1, Episode 6

Automation with Pierre Palatin

Pierre Palatin dives into different automation strategies, how to build confidence in your system, and why designing the UI may be your biggest challenge.

View transcript

Further reading

Season 1, Episode 5

Client-Transparent Migrations with Pavan Adharapurapu

Pavan Adharapurapu details how to approach large-scale migrations while optimizing for user experience.

View transcript

Further reading

Season 1, Episode 4

Rethinking SLOs with Narayan Desai

Narayan Desai explains why SLOs can be problematic and proposes alternative methods for monitoring complex, large-scale systems.

View transcript

Further reading

Season 1, Episode 3

Alerting with Amelia Harrison

Amelia Harrison advises on when and how to alert, ideal coverage, and tuning.

View transcript

Further reading

Season 1, Episode 2

Customer-Centric Monitoring with Silvia Esparrachiari

Silvia Esparrachiari talks about the challenges of monitoring and the importance of understanding your users.

View transcript

Further reading

Season 1, Episode 1

SRE Philosophy with Jennifer Mace (Macey)

What is SRE, anyway? Jennifer Mace (Macey) gives us her definition of "site reliability engineer," discusses how to manage risk, and shares key questions to ask developers.

View transcript

Further reading

Season 1, Episode 0

Creating the SRE Prodcast with John Reese (JTR)

Host MP English and former Google SRE John Reese (JTR) chat about the creation of the Prodcast.

View transcript

Further reading

Season 2: Life of an SRE

Season 2 "Life of An SRE", examines the career path and growth of individuals in SRE.

Season 2, Episode 8

Life of An SRE: Beyond Google

Former Google SREs, or “Xooglers”, talk with hosts MP and Steve McGhee about site reliability engineering outside of Google. What’s the difference in scale? What skills are generally valuable? And why can’t you build “SRE in a box” that jump-starts pretty much any organization?

View transcript

Further reading

Season 2, Episode 7

Life of An SRE with Sabrina Farmer

Sabrina Farmer, VP of Engineering at Google, talks about her career journey through Site Reliability Engineering. What does management mean? What’s involved in being an effective manager? and what’s a feasibility study? Hear some great advice on how to get what you expect out of a role, wherever on the ladder it is.

View transcript

Season 2, Episode 6

Life of An SRE with Dave Reisner

Dave Reisner talks about his path to Staff SRE, from ArchLinux contributor through DevOps to software engineer. This episode emphasizes the value of strong mentoring and manager relationships, and the challenges of work-life balance.

View transcript

Further reading

Season 2, Episode 5

Life of An SRE with Stephen Benjamin

Explore the role and responsibilities of an SRE manager with Stephen Benjamin.

View transcript

Further reading

Season 2, Episode 4

Life of An SRE with Jessica Theodat

Explore the role and responsibilities of a Senior SRE with Jessica Theodat, as she discusses life-work balance, the value of mentoring, and being a Black woman in SRE.

View transcript

Season 2, Episode 3

Life of An SRE with Shannon Brady and Theo Klein

Explore the career paths of SREs Shannon Brady and Theo Klein, as they discuss their paths to Site Reliability Engineering and finding their areas of expertise.

View transcript

Further reading

Season 2, Episode 2

Life of An SRE with Mariuxi Vasconez and Julian Alarcon

In this episode, Mariuxi and Julian discuss their paths to SRE: what drew them initially to SRE, and what motivates them to continue developing skills

View transcript

Further reading

Season 2, Episode 1

Life of An SRE with Tom Cranitch and Megan Yin

How does one become an SRE? And what’s the career like? In this episode, Tom and Megan discuss their path to SRE.

View transcript

Further reading

Season 3: Champions of the Internet

Season 3 "Champions of the Internet", discusses software systems designed and built by SRE.

Season 3, Episode 14

Special Episode: You Missed a Page from Telebot

This episode features Javi Beltran, a Google engineering lead who created the "Telebot" theme song. With our beloved hosts, Steve McGhee and Jordan Greenberg, Beltran discusses the origins of the song, created in 2012 for Google's paging system. The song was meant to add a touch of levity to what could be a stressful situation for engineers on-call. Beltran also unveils a new, more modern remix of “Telebot” (created in collaboration with our host, Jordan Greenberg!) which will be used as the intro theme for the podcast's next season.

View transcript

Further reading

Season 3, Episode 13

Imperative vs. Declarative Change Workflows with Dominic Hutton & Niccolo' Cascarano

In this episode of the Prodcast, guests Dominic Hutton (Staff SRE, HashiCorp) and Niccolo' Cascarano (Senior Staff SRE at Google) join hosts Steve McGhee and Jordan Greenberg to dive into configurations. They discuss the differences between imperative and declarative configuration, explore the benefits and challenges of each approach, and the need for careful consideration when choosing between the two. Ultimately, the goal is to achieve reliable and maintainable systems through effective configuration management.

View transcript

Further reading

Season 3, Episode 12

Human Factors in Complex Systems with Casey Rosenthal and John Allspaw

This episode features Casey Rosenthal (Founder, Cirrusly.ai) and John Allspaw (Founder and Principal, Adaptive Capacity Labs), joining our hosts Steve McGhee and Jordan Greenberg. Together they discuss how resilience appears in Software Engineering and SRE and explore the importance of understanding the human factors involved in adapting to system failures—highlighting the need for a more qualitative and holistic approach to understanding how engineers successfully adapt to system behavior and improving overall reliability.

View transcript

Further reading

Season 3, Episode 11

Embracing Complexity with Christina Schulman & Dr. Laura Maguire

In this episode of the Prodcast, we are joined by guests Christina Schulman (Staff SRE, Google) and Dr. Laura Maguire (Principal Engineer, Trace Cognitive Engineering). They emphasize the human element of SRE and the importance of fostering a culture of collaboration, learning, and resilience in managing complex systems. They touch upon topics such as the need for diverse perspectives and collaboration in incident response, the necessity of embracing complexity, and explore concepts such as aerodynamic stability, and more.

View transcript

Further reading

Season 3, Episode 10

Maglev: load balancing at Google with Cody Smith and Trisha Weir

In this episode, Cody Smith (CTO and Co-founder, Camus Energy) & Trisha Weir (SRE Department Lead, Google) join hosts Steve McGhee and Jordan Greenberg, to discuss their experience developing Maglev, a highly available and distributed network load balancer (NLB) that is an integral part of the cloud architecture that manages traffic that comes in to a datacenter. Starting with Maglev’s humble beginnings as a skunkworks effort, Cody and Trisha recount the challenges they faced, and emphasize the importance of psychological safety, collaboration, and adaptability in SRE innovation.

View transcript

Further reading

Season 3, Episode 9

Profiling data with Pat Somaru and Narayan Desai

In this episode, guests Narayan Desai (Principal SRE, Google) and Pat Somaru (Senior Production Engineer, Meta) join hosts Steve McGhee and Florian Rathgeber to discuss the challenges of observability and working with profiling data. The discussion covers intriguing topics like noise reduction, workload modeling, and the need for better tools and techniques to handle high-cardinality data.

View transcript

Further reading

Season 3, Episode 8

Google Public DNS (8.8.8.8) with Wilmer van der Gaast and Andy Sykes

This episode features Google engineers Wilmer van der Gaast (Production on-tall) and Andy Sykes (Senior Staff Systems Engineer, SRE), joining hosts Steve McGhee and Jordan Greenberg, to discuss the development and maintenance of Google Public DNS (8.8.8.8). They highlight the initial motivations for creating the service, technical challenges like cache poisoning and load balancing, as well as the collaborative effort between SRE and SWE teams to address these issues. They also reflect on the evolving nature of SRE and advice for aspiring SREs.

View transcript

Further reading

Season 3, Episode 7

SRE in the Retail and Gaming Worlds with Jordan Chernev & Scott Bowers

Guests Jordan Chernev (Senior Technology Executive) and Scott Bowers (SRE, Gearbox Software) who hail from the retail and gaming industries, respectively, join hosts Steve McGhee and Jordan Greenberg to discuss the unique challenges of Site Reliability Engineering in their industries. They share the importance of aligning SLOs with user experience, strategies for handling spikes in traffic, communicating with users during outages, and investing in reliability.

View transcript

Season 3, Episode 6

Incident Response with Sarah Butt and Vrai Stacey

Sarah Butt (Principal Engineer, Centralized Incident Response, Salesforce) and Vrai Stacey (Staff Software Engineer, Google) join hosts Steve McGhee and Jordan Greenberg to dive into incident response—particularly tooling and software for reliability incidents. Tune in for an in-depth discussion on topics such as the importance of communication and collaboration during incidents, and the role of tooling in supporting incident response processes. Sarah and Vrai also share personal takeaways from incidents they have experienced.

View transcript

Season 3, Episode 5

Building Reliable Systems with Silvia Botros and Niall Murphy

Silvia Botros (SRE Architect, Twilio | Author of "High Performance MySQL, 4th edition") and Niall Murphy (Co-founder & CEO, Stanza) join hosts Steve McGhee and Jordan Greenberg, to discuss cultural shifts in database engineering, rate limiting, load shedding, holistic approaches to reliability, proactive measures to build customer trust, and much more!

View transcript

Further reading

Season 3, Episode 4

Creating Systems that are Safe with Liz Fong-Jones

Liz Fong-Jones (former Google SRE and current Field CTO at honeycomb.io) joins hosts Steve McGhee and Jordan Greenberg for a lively discussion centered around observability, its evolution from monitoring, and its role in modern software development. Tune in for more on: the importance of observability as a spectrum, the evolving role of SREs, and advice to aspiring software engineers.

View transcript

Season 3, Episode 3

Production Problems Are For All! with Ben Treynor Sloss

Ben Treynor Sloss (VP of Engineering, Google) joins hosts Steve McGhee and Dr. Jennifer Petoff (Director of Technical Infrastructure Education, Google) to share the evolution of SRE and its impact on software development, how AI and ML significantly impacts SRE practices, and the future of SRE.
Ben coined the term "Site Reliability Engineering" for his team of (now) 4,000 software engineers, engaged in what were traditionally operations functions. Under Ben's leadership, Google SRE wrote two best-selling books on SRE. Since then, the rest of the SaaS industry has come to adopt the SRE name, mission, and practices.

View transcript

Season 3, Episode 2

There Remains a Huge Amount of Work to Do, with Healfdene Goguen

In this episode, Healfdene Goguen (Principal Engineer, Google) joins hosts Steve McGhee and Jordan Greenberg to discuss the vast amount of work to be done by SREs, and the fascinating challenges to tackle with clear real-world implications. It's a truly exciting time to be an SRE at Google!

View transcript

Season 3, Episode 1

SRE, a Basis of Influence with Amy Tobey & Vladyslav Ukis

In this season of Google Prodcast, current and former SREs, both within and outside of Google, chat with hosts Steve McGhee and Jordan Greenberg to discuss software systems designed and built by SREs.
For "episode zero", guests Amy Tobey and Vladyslav Ukis will set the stage for the season with a lively discussion about what Software Engineering means to Site Reliability Engineering.

View transcript

Season 4: Friends and Trends

Season 4 is about SRE "Friends and Trends", We discuss what's coming up in the SRE space, from new technology to modernizing processes and more, as well as the friends we make along the way.

Season 4, Episode 10

The One with Ben Good and Our Kubernetes Friends

In this special episode hosts Steve McGhee from the Google SRE Prodcast and Kaslin Fields from the Google Kubernetes Podcast, welcome Google Cloud Solutions Architect Ben Good to discuss platform engineering. Listeners can look forward to hearing about the role of Kubernetes as a tool for building platforms, how to create "golden paths" for developers, and the importance of observability and self-service in platform design. The conversation also touches on industry trends, the bespoke nature of platforms, and how DORA metrics can be applied to platform engineering practices.

View transcript

Further reading

Season 4, Episode 9

The One With AI Agents, Ramón Llamas, and Swapnil Haria

Google Staff SRE Ramón Llamas and Google Software Engineer Swapnil Haria join our hosts to explore how AI agents are revolutionizing production management, from summarizing alerts and finding hidden errors to proactively preventing outages. Learn about the challenges of evaluating non-deterministic systems and the fascinating interplay between human expertise and emerging AI capabilities in ensuring robust and reliable infrastructure.

View transcript

Further reading

Season 4, Episode 8

The One with Technical Program Managers and Karanveer Anand

This episode features Google Technical Program Manager (TPM) Karanveer Anand, who joins our hosts to discuss the unique role of TPMs in Site Reliability Engineering (SRE). The conversation highlights how SRE TPMs bridge the gap between technical details and business impact, managing complex projects with inter-team dependencies and ensuring system reliability, particularly in the rapidly evolving AI landscape.

View transcript

Further reading

Season 4, Episode 7

The One with STPA, Jeffrey Snover, and Theo Klein

This episode discusses Systems Theoretic Process Analysis (STPA), a method for analyzing complex systems. Theo Klein, a Google SRE, and Jeffrey Snover, a Distinguished Engineer at Google, explain that STPA focuses on identifying how system accidents and losses occur due to a loss of control, rather than component failures. STPA helps identify design flaws early, even before code is written! The discussion highlights that STPA is a human-driven process, prompting critical questions about system goals and potential losses, and that Google is adapting the pure STPA approach for commercial software development to make it more practical and efficient.

View transcript

Further reading

Season 4, Episode 6

The One with Startups and Adam Fletcher

In this episode, hosts Steve McGhee and Matt Siegler are joined by guest, Adam Fletcher, CEO and Co-Founder of MarketStreet. They discuss the current state of web development with LLMs, managing technical debt in startups, the evolution of infrastructure and reliability engineering, the role of community in technology, and the future of software engineering with AI.

View transcript

Further reading

Season 4, Episode 5

The One With SLOs and Sal Furino

In this episode, Sal Furino, Customer Reliability Engineer at Bloomberg, discusses all things Service Level Objectives (SLOs) with hosts Steve McGhee and Matt Siegler. Together, they dig into what successful SLOs look like, how it relates to users, and how SLOs provide an effective framework for joint decisions about system reliability across product, engineering, and leadership teams.

View transcript

Further reading

Season 4, Episode 4

The One With the Future of SRE and Matt Zelesko

Matt Zelesko, the head of Site Reliability Engineering at Google, discusses the evolution of SRE, highlighting the shift from traditional operations to a model that balances velocity and reliability to better serve the rapid advancements in AI and ML. He emphasizes that SRE's core mission is to enable partners to move quickly while meeting reliability goals, and that the sheer scale of Google's infrastructure necessitates the SRE model for cross-system problem-solving. Zelesko envisions AI as a crucial assistant for SREs, improving incident detection, mitigation, and postmortem processes, and allowing SREs to focus on more complex engineering challenges and risk management earlier in the development cycle, while still valuing the hands-on experience of operating production infrastructure.

View transcript

Further reading

Season 4, Episode 3

The One With AI and Todd Underwood

In this Google Prodcast episode, Todd Underwood, a reliability expert from Anthropic with experience at Google and OpenAI, discusses the current state and future of AI in SRE. Todd and the hosts focus on the current state and future of AI and ML in production, particularly for SREs. Topics discussed include the challenges of AI-Ops, limitations of current anomaly detection, the potential for AI in config authoring and troubleshooting, trade-offs between product velocity and reliability, the evolving role of SREs in an AI-driven world, and book publication for optimal timing.

View transcript

Further reading

Season 4, Episode 2

The One With Data Centers and Peter Pellerzi

This episode features guest, Peter Pellerzi (Distinguished Engineer, Google). Peter and the hosts, Matt Siegler and Steve McGhee, focus on the physical infrastructure side of SRE, discussing topics such as the scale of Google's data centers, handling incidents like power outages, testing and preparedness strategies, the use of AI for optimizing cooling plants, and more. Peter also emphasizes the importance of community support, proactive planning, and learning from real-world testing and incidents to ensure high availability and resilience in data center operations.

View transcript

Further reading

Season 4, Episode 1

The One With Security and Jessica Theodat

Jessica Theodat (Senior SRE & Security Tech Lead, Google) joins hosts Jordan Greenberg and Steve McGhee to discuss the intersection of security and site reliability engineering at Google. Jessica touches on risk management, the unique nature of security incident responses, and the shared goals between security and SRE. The crew also delves into the balance between security and SRE, acknowledging the tension and the need for collaboration between teams to achieve business goals and user trust.

View transcript

Season 4, Episode 0

We’re back with Season 4!

In this episode, hosts and producers of Prodcast (including our new co-host, Matt Siegler!) reflect on the previous season and introduce the new season's focus on upcoming trends in Site Reliability Engineering (SRE) and AI, and the friends we make along the way. They also introduce new elements we are bringing in with Season 4, such as video format and a feedback form.

View transcript

Season 5: More Friends, More Trends

Season 5, Episode 8

The One With Damion Yates and Building AI systems

How do you introduce Site Reliability Engineering to an AI research lab, bringing concepts of scale to engineers who are at the leading edge of AI systems?
In the latest episode of The Prodcast, hosts Steve McGhee and Florian Rathgeber chat with Damion Yates, who helped establish the reliability engineering culture at Google DeepMind. Damion shares his journey of bringing scalable infrastructure to DeepMind, supporting massive machine learning experiments.
Discover the unique challenges of supporting AI research, such as managing highly expensive "lockstep" training models where a single machine failure halts the entire process. Damion also explains why he believes "luck is our enemy" in systems engineering, and why protecting a research scientist's time is the ultimate metric for success.

View transcript

Season 5, Episode 7

The One with Carla Geisser and Crisis Engineering

Join us for a discussion with Carla Geisser of Layer Aleph, a company focused on "crisis engineering". Carla distinguishes a crisis from a standard incident by noting that a crisis is novel and lacks a playbook. She outlines five criteria for a true crisis: fundamental surprise, broken critical functions, high visibility, a rigid deadline (unlike internal tech deadlines), and perception breakdown. Crises often arise in organizations that struggle to admit computers control core decisions, leading to complex, glued-together systems. Carla emphasizes that SRE-adjacent skills are essential for connecting the dots and exposing the full system. The key takeaway for SREs is to recognize when a true crisis is happening, as leadership will only be willing to "break rules" and enable substantive change once three of these criteria are met.

View transcript

Season 5, Episode 6

The One with Parker Barnes, Felipe Tiengo Ferreira, and AI

This episode of the Prodcast tackles the challenges of maintaining AI safety and alignment in production. Guests Felipe Tiengo Ferreira and Parker Barnes join hosts Matt Siegler and Steve McGhee to discuss AI model safety, from examining content to emerging security risks. The discussion emphasizes the vital role of SREs in managing safety at scale, detailing multi-layered defenses, including system instructions, LLM classifiers, and Automated Red Teaming (ART). Felipe and Parker dive into the evolving world of AI safety, from core product policies to the groundbreaking Frontier Safety Framework. The guests explore the need for SRE principles like drift detection and context observability. Finally, they raise concerns about the velocity of AI development compressing long-term research, urging the industry to collaborate and share vocabulary to address rapidly emerging risks.

View transcript

Season 5, Episode 5

The One with Shannon Brady and Operating Systems

In this episode of the Prodcast, guest Shannon Brady speaks with hosts Jordan Greenberg and Florian Rathgeber about managing Google’s vast fleet of internal devices. Shannon explains how Google’s Linux platform uses core SRE principles—specifically testing, canarying, and monitoring—for weekly stage rollouts of its Debian-based distribution. Configuration is efficiently managed using Puppet to ensure the right setup for a diverse user base. The conversation pivots to "the year of Linux everything," underscoring its widespread adoption. Discussing AI, Shannon identifies its greatest utility for SREs in rapidly analyzing signals and generating complex queries to resolve outages. This episode reinforces that practicing SRE fundamentals is paramount, demonstrating that you can be an SRE at heart, regardless of your official title.

View transcript

Further reading

Season 5, Episode 4

The One with Denia del Cid

Curious about the real impact of AI on Site Reliability Engineering? In this episode of The Prodcast, Google SRE Denia del Cid breaks down how her team is leveraging AI to transform production workflows. Denia details practical applications like early outage detection, incident similarity analysis, and toil reduction. She explains the critical importance of validating against "golden data sets" and keeping humans in the loop to build trust. Discover how SREs are evolving from skepticism to strategic adoption with Gemini. Tune in for a pragmatic, measured look at the future of reliability.

View transcript

Season 5, Episode 3

The One With Heather Adkins

Join us on The Prodcast as we host Heather Adkins, leader of Google’s Office of Cybersecurity Resilience, for a critical look at the future of digital defenses. We explore the intersection of SRE and security , unpacking the "Secure by Design" philosophy and the shared DNA of incident management. Heather candidly discusses the rise of "Agentic AI hackers" and polymorphic malware , revealing how defenders can use AI to stay ahead. From "castle" defense strategies to "nodal biology" theories, this episode is a must-listen for anyone navigating the new era of AI-driven threats.

View transcript

Further reading

Season 5, Episode 2

The One With SLOs

In this episode, we welcome Alex Hidalgo and Brian Singer of nobl9 to discuss Service Level Objectives (SLOs). Alex and Brian talk about how SLOs can establish a vernacular across industry verticals, leading to constructive conversations and a shared understanding of how to implement SRE practices. Join us for a lively discussion that ranges across SLO topics!

View transcript

Further reading

Season 5, Episode 1

The One With Stephanie Hippo and Observability

In this episode, Steph Hippo, Platform Engineering Director at Honeycomb, joins The Prodcast to discuss AI and SRE. Steph explains how observability helps us understand complex systems from their outputs, and provides a foundation for SRE to respond to system problems. This episode explains how AI and observability build a self-reinforcing loop. We also discuss how AI can detect and respond to certain classes of incidents, leading to self-healing systems and allowing SREs to focus on novel and interesting problems. She advises small businesses adopting AI to learn from others' mistakes (post-mortems) and to commit time and budget to experimentation.

View transcript

Season 6: Prodcast Live!

Season 6, Episode 9

Adam Kramer discusses Incident Response

Google's Tech Incident Response Team (IRT) manages the complex production environment during times of intensive change. In this episode, IRT member Adam Kramer talks about the importance of psychological safety to ensure that engineers can communicate effectively.

View transcript

Further reading

Season 6, Episode 8

Courtney Nash on Complex Systems

Courtney Nash of The VOID discusses the role of human expertise in managing complex systems, and how SREs continue to bring critical value even as technology and AI evolve.

View transcript

Further reading

Season 6, Episode 6-7

A conversation with John Allspaw at SREcon Part 1

John Allspaw of Adaptive Capacity Labs talks with the Prodcast about the challenges of dynamic systems, the value of learning, and more — in two parts, live from SREcon!

View transcript

Further reading

Season 6, Episode 6-7

A conversation with John Allspaw at SREcon Part 2

View transcript

Further reading

Season 6, Episode 5

The SRE At Home

We speak with Ricard Bejarano about being an SRE at home, discussing Home Lab systems.

View transcript

Further reading

Season 6, Episode 4

Matt Zelesko and the Future of SRE

We sit down with Matt Zelesko, VP of SRE at Google, for a candid talk about how AI is changing SRE — and how it’s not.

View transcript

Further reading

Season 6, Episode 3

Handling Burnout with Sam Anderson

Sam Anderson shares his experiences with burnout, and how to support yourself as a reliable system. Sam provides guidance on how to deal with burnout, and some suggestions on how to avoid burnout through understanding yourself and finding the help and support you need.

View transcript

Season 6, Episode 2

Mikey Dickerson and Crisis Engineering

Crisis Engineer Mikey Dickerson joins us to talk about what constitutes a crisis. Mikey draws on his broad experience across industry and the public sector, as well as on work with his team of systems fixers.

View transcript

Further reading

Season 6, Episode 1

This is Fine! with Colette Alexander and Clint Byrum

What’s happening in the world of SRE and Resilience Engineering? Join us as we catch up with fellow podcast hosts Colette Alexander and Clint Byrum of the “This Is Fine!” podcast live at SREcon in Seattle.

View transcript

Further reading

Meet Your Hosts

Jordan Greenberg - Engineering Program Manager, GCP

Jordan Greenberg

Engineering Program Manager, GCP

Seasons 3+

Steve McGhee - Reliability Advocate, SRE

Steve McGhee

Reliability Advocate, SRE

Seasons 2+

Florian Rathgeber - Site Reliability Engineer, GCP

Florian Rathgeber

Site Reliability Engineer, GCP

Seasons 3+

Matthew Siegler - Machine Learning Infrastructure SRE

Matthew Siegler

Machine Learning Infrastructure SRE

Seasons 4+

MP English - Systems Engineer

MP English

Systems Engineer

Seasons 1 & 2

Meet our Production Team

Paul Guglielmino - Staff Software Engineer

Paul Guglielmino

Staff Software Engineer (Sound Engineer)

Salim Virji - Site Reliability Engineer | SRE Education Program Manager

Salim Virji

Site Reliability Engineer | SRE Education Program Manager (Producer)

Foreword

The Google Prodcast Team has gone through quite a few iterations and hiatuses over the years, and many people have had a hand in its existence. For the longest time, a handful of SREs produced the Prodcast for the listening pleasure of the other engineers here at Google.

The credit for a lot of the project really goes to John Reese, known around Google as JTR. The Prodcast was a project he kept alive as other team members came and went. Eventually, JTR decided to explore the world outside of Google and the Prodcast was left in an uncertain state. I had been a part of the team for a while, Viv had just joined the team, and the other member of the team had to step away due to other commitments.

At this point, we decided to make a hard pivot. We decided that we wanted to make a podcast for more than just engineers at Google. We wanted to make something that would be of interest to folks across organizations and technical implementations. In his last act as part of the Prodcast, JTR put us in touch with Jennifer Petoff, Director of SRE Education, in order to have the support of the SRE organization behind us. With that, we turned to one of the most studied resources in SRE: the Google SRE Book.

We didn't want to rehash what the book already discussed in detail; we might as well have just recorded an audiobook if that was our goal. Originally, we were aiming for something in the neighborhood of an update—a revision—to the SRE Book. What we ended up with was a series of conversations with domain experts at Google that often challenged the orthodoxy of the SRE Book, sometimes entirely reframing the topic, as is particularly the case with our episode on SLOs, one of my personal favorites.

I found myself learning new things during every recording session, even though we had already met with our guests to map the episodes out! It was an absolute pleasure chatting with all our guests, to the point that we often continued talking after we finished recording. I am immensely grateful to all our guests for the time they contributed so that we may put all of this together for you. I hope you enjoy listening as much as we enjoyed recording.

To the present and future reliability of you and your services,

— MP English from the Prodcast Team

Acknowledgments

This season is brought to you by hosts Jordan Greenberg, Steve McGhee, Florian Rathgeber, and Matt Siegler, with contributions from many SREs behind the scenes. The Prodcast is produced by Paul Guglielmino and Salim Virji. The Prodcast theme is Telebot, by Javi Beltran and Jordan Greenberg.

We are grateful for the contributions of Sunny Hsiao, Christof Leng, Jennifer Petoff, and John Reese.