Site Reliability Engineering
Jump to Content
Latest resources
Product-Focused Reliability for SRE
Twentieth Anniversary
Twenty years of SRE lessons learned
Prodverbs
SRE Fundamentals
Measuring Reliability
Why Heroism is Bad
System Theoretic Process Analysis
Books
Building Secure & Reliable Systems
The Site Reliability Workbook
Mobaa
2024 Gallery
2022 Gallery
2020 Gallery
Vector Methods
Classroom
Distributed PubSub
Distributed Image Server
The Art of SLO
Click on a chapter thumbnail to see relevant publications, conference talks, and workshops by Google SREs.
2. The Production Environment at Google, from the Viewpoint of an SRE
3. Embracing Risk
4. Service Level Objectives
5. Eliminating Toil
6. Monitoring Distributed Systems
7. The Evolution of Automation at Google
8. Release Engineering
9. Simplicity
10. Practical Alerting
11. Being On-Call
12. Effective Troubleshooting
13. Emergency Response
14. Managing Incidents
15. Postmortem Culture: Learning from Failure
16. Tracking Outages
17. Testing for Reliability
18. Software Engineering in SRE
19. Load Balancing at the Frontend
20. Load Balancing in the Datacenter
21. Handling Overload
22. Addressing Cascading Failures
23. Managing Critical State: Distributed Consensus for Reliability
24. Distributed Periodic Scheduling with Cron
25. Data Processing Pipelines
26. Data Integrity: What You Read Is What You Wrote
27. Reliable Product Launches at Scale
28. Accelerating SREs to On-Call and Beyond
29. Dealing with Interrupts
30. Embedding an SRE to Recover from Operational Overload
31. Communication and Collaboration in SRE
32. The Evolving SRE Engagement Model
33. Lessons Learned from Other Industries