What is Site Reliability Engineering (SRE)?
SRE is what you get when you treat operations as if it’s a software problem. Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services — Google Search, Ads, Gmail, Android, YouTube, and App Engine, to name just a few — with an ever-watchful eye on their availability, latency, performance, and capacity.
What we do as SRE
Our job is a combination not found elsewhere in the industry. Like traditional operations groups, we keep important, revenue-critical systems up and running despite hurricanes, bandwidth outages, and configuration errors.
How we SRE at Google
As SRE, we flip between the fine-grained detail of disk driver IO scheduling to the big picture of continental-level service capacity, across a range of systems and a user population measured in billions.