SRE in the Cloud
Learn how to put SRE principles into practice by leveraging cloud technology. Implement SRE in your organization through tooling, hands-on tutorials, videos, blogs, and other resources.
Simplify your SRE journey with cloud native tooling
Balance development velocity and reliability
Manage reliability and drive alignment between developers and operators with baked-in SRE best practices. Create Service-Level Indicators (SLI), set Service-Level Objectives (SLO), and track errors easily with Service Monitoring. Out-of-the-box metric dashboards are available to help you quickly view and analyze service health.
Reduce toil through built-in integrations
One integrated view across metrics, uptime monitoring, dashboards, and alerts helps with faster resolution and in context observability. You also get access to metrics, traces, and logs with zero setup. Connect to tools you love like PagerDuty to troubleshoot incidents quickly across hybrid and multicloud environments. Near real-time ingestion latency and terabyte per-second ingestion rate ensures you can perform real-time log management and analysis at scale.
Become proactive about observability using open APIs
Leverage open observability tooling to instrument your applications. OpenTelemetry is fully integrated with Cloud Operations, so you can collect and export data from cloud-native applications, Specifically, Cloud Trace allows developers to instrument and export applications with OpenTelemetry for faster incident resolution.
Leverage Google Cloud's operations suite
Monitor, troubleshoot, and improve application performance on your Google Cloud environment.
Gain visibility into the performance, availability, and health of your applications and infrastructure.
Learn by doing with hands-on tutorials, a demo
environment, and step-by-step videos
Sandbox - Demo Service
Cloud Operations Sandbox is an open source tool that helps practitioners learn Google Cloud's operations suite through a one-click deployment script, load generator, and sample services following SRE best practices.
SRE Troubleshooting Lab
Hands-on lab to learn how to navigate resource pages of Google Kubernetes Engine (GKE), use the GKE dashboard, create logs-based metrics, create an SLO, and define an alert to notify SRE staff of incidents.
Engineering for Reliability
Reliability is a key feature of your service. Join us to learn how to define and defend your SLOs and improve observability of your applications running in Google Cloud. Need to know the difference an SLI, SLO, and SLA or how to better use Cloud Operations? This series is for you!
SRE practices in the cloud
Learn SRE Best Practices with resources created by SRE Experts
Customer reliability engineering (CRE) life lessons
Learn valuable life lessons from the CRE team at Google.Learn more
Google Cloud Blog: DevOps & SRE
Google Cloud blogs written by SRE subject matter experts across various SRE and DevOps topics such as setting SLOs, getting the right culture, product announcements, customer stories, and more.Learn more
Increasing business value with better IT operations: A guide to SRE
This paper covers the business benefits of SRE, SRE best practices, what Google Cloud offers for SRE, and how Google's own experience can help customers on their SRE journey.Learn more
Learn from real-world case studies
Learn how Google Cloud customers are able to leverage SRE practices.
2021 Accelerate State of DevOps Report
In this year's report, we broadened our inquiry into operations, expanding from an analysis of service availability into the more general category of reliability. This year's survey introduced several items inspired by SRE best practices. In analyzing the results, we found evidence that teams who excel at these modern operational practices are 1.8 times more likely to report better business outcomes. To read more about this, download the report.Learn more