Index
-
abandoned phase (service lifecycle), SRE engagement in, Phase 6: Abandoned
-
Abuse SRE and Common Abuse Tool (CAT) teams, Ares case study, Case Study 1: Ares
-
access control policies for data processing pipelines, Adhere to Access Control and Security Policies
-
accidents, view in DevOps, Accidents Are Normal
-
ACM Queue article on Canary Analysis service, Conclusion
-
action items in postmortems, Key action item characteristics missing
-
active development phase (service lifecycle), SRE engagement in, Phase 2: Active Development
-
ADKAR, Background, The Prosci ADKAR Model
- AdMob and AdSense systems, What we decided to do
- AdWords NALSD example, AdWords Example
- alerts
- alert suppression, Alerts
- alerting on SLOs, Alerting on SLOs
- burn rate for error budgets, 4: Alert on Burn Rate
- considerations in, Alerting Considerations
- increased alert window, 2: Increased Alert Window
- incrementing alert duration, 3: Incrementing Alert Duration
- low-traffic services and error budget alerting, Low-Traffic Services and Error Budget Alerting
- making sure alerting is scalable, Alerting at Scale
- multiple burn rate alarms, 5: Multiple Burn Rate Alerts
- multiwindow, multi-burn-rate alerts, 6: Multiwindow, Multi-Burn-Rate Alerts
- services with extreme availability goals, Extreme Availability Goals
- target error rate ࣙ SLO threshold, 1: Target Error Rate ≥ SLO Threshold
- contributors to high pager load, Pager load inputs
- paging alerts, defining thresholds for, Alerting
- provided by monitoring system, Alerts
- separating alerting in monitoring systems, Prefer Loose Coupling
- testing alerting logic in monitoring system, Testing Alerting Logic
- using good alerts and consoles to identify cause of pages, Identification delay
- analysis of postmortems, Postmortem analysis, Results of Postmortem Analysis
- animated (undesirable) language in postmortems, Animated language
- anycast, Google Cloud Load Balancing
- APIs
- application errors in data processing pipeline, Pipeline application or configuration
- Application Readiness Reviews (ARRs), Aligning Goals
- architecture and design phase (service lifecycle), SRE engagement in, Phase 1: Architecture and Design
- architecture and design, SRE team partnering on, Partnering on architecture
- architectures (opaque), troubleshooting for, Troubleshooting for Opaque Architectures
- artificial load generation, Artificial Load Generation
- aspirational SLOs, Improving the Quality of Your SLO
- authoring of postmortems, including all participants in, Include all incident participants in postmortem authoring
- autohealing, Handling Unhealthy Machines
- automated builds, Release Engineering Principles
- automated health checking, New bugs
- automation
- archiving and migration, in filer-backed home directory decommissioning, Archiving and migration automation
- assessing risks of, Assess Risk Within Automation
- automating this year’s job away, Automate This Year’s Job Away
- automating toil response, Automate Toil Response
- increasing uniformity in production environment, Increase Uniformity
- measuring effectiveness of automated tasks, Use Feedback to Improve
- partial automation with human-backed interfaces, Start with Human-Backed Interfaces, Start with human-backed interfaces
- using to reduce toil in datacenter, case study, Case Study 1: Reducing Toil in the Datacenter with Automation
- autoscaling, Autoscaling
- avoiding overloading backends, Avoiding Overloading Backends
- avoiding traffic imbalance, Avoiding Traffic Imbalance
- configuring conservatively, Configuring Conservatively
- handling unhealthy machines, Handling Unhealthy Machines
- implementing for data processing pipelines, Implement Autoscaling and Resource Planning
- in Spotify case study, Capacity planning
- including kill switches and manual overrides, Including Kill Switches and Manual Overrides
- load-based, using with load balancing, Combining Strategies to Manage Load
- load-based, using with load shedding, Combining Strategies to Manage Load
- setting constraints on the autoscaler, Setting Constraints
- using with load balancing and load shedding, precautions with, Lessons learned
- working with stateful systems, Working with Stateful Systems
- availability
- availability SLI, A Worked Example
- backend services, avoiding overloading in autoscaling, Avoiding Overloading Backends
- balance between on-call and project work for SREs, Recap of “Being On-Call” Chapter of First SRE Book
- baseline of a functional team, Reducing Overload and Restoring Team Health
- Beyond Corp network security protocol, Problem Statement
- BGP (Border Gateway Protocol), Anycast
- blameful language, mitigating damage of, Failing to reinforce the culture
- blameless language, Use blameless language
- blameless postmortems, Compare and Contrast, It’s Better to Fix It Yourself; Don’t Blame Someone Else, Postmortem Culture: Learning from Failure
- blue/green deployment, Blue/Green Deployment
- boiling the frog, Vigilance
- Borg, Background
- breaks (long-term) in on-call scheduling, Plan for long-term breaks
- Bridges Transition Model, Emotion-Based Models, How These Theories Apply to SRE
- bucketing, using with SLIs, Grading Interaction Importance
- bugs
- burn rate (error budgets)
- business intelligence, Data Analytics
- business priorities, communicating between SRE and developer teams, Communicating Business and Production Priorities
- business processes producing toil, Business Processes
- calculations support (monitoring systems), Calculations
- CALMS (Culture, Automation, Lean, Measurement, and Sharing), Background on DevOps
- canarying releases, Canarying Releases, Identification delay
- balancing release velocity and reliability, Balancing Release Velocity and Reliability
- canary implementation, Canary Implementation
- canarying, defined, What Is Canarying?
- dependencies and isolation, Dependencies and Isolation
- for data processing pipeline, Canarying
- in noninteractive systems, Canarying in Noninteractive Systems
- related concepts, Related Concepts
- release engineering and canarying, Release Engineering and Canarying
- release engineering principles, Release Engineering Principles
- requirements on monitoring data, Requirements on Monitoring Data
- roll forward deployment vs. simple canary deployment, A Roll Forward Deployment Versus a Simple Canary Deployment
- selecting and evaluating metrics, Selecting and Evaluating Metrics
- separating components that change at different rates, Balancing Release Velocity and Reliability
- with GCLB, GCLB: High Availability
- capacity planning, Cost Engineering and Capacity Planning
- CCN (cyclomatic complexity number), Measuring Complexity
- change control, Organizational Change Management in SRE
- change management
- in DevOps and SRE, Compare and Contrast
- organizational change management in SRE, Organizational Change Management in SRE
- case study, common tooling adoption in SRE, Background
- case study, scaling Waze, Background
- emotion-based models, Emotion-Based Models
- how change management theories apply to SRE, How These Theories Apply to SRE
- Kotter’s eight-step process, Kotter’s Eight-Step Process for Leading Change
- Lewin’s three-stage model, Lewin’s Three-Stage Model
- McKinsey’s 7-S model, McKinsey’s 7-S Model
- Prosci ADKAR model, The Prosci ADKAR Model
- SRE embracing change, SRE Embraces Change
- changes
- checklist for new SRE team training, Training roadmap
- checklist for postmortems, Postmortem checklist
- checkpointing in data processing pipelines, Checkpointing
- clarity in good postmortem example, Clarity
- click-through rate (CTR), AdWords Example
- ClickMap (AdWords example), LogJoiner
- client behavior changes, resulting in bugs, New bugs
- Clos network topology, Background
- cloud environment
- code
- codelabs, Training roadmap
- collaboration in DevOps and SRE, Compare and Contrast
- colocation facilities (colos), racks of proxy/cache machines in, Case Study
- communication
- Communications Lead (CL), Incident
- completeness SLO (Spotify case study), Completeness
- complexity
- complexity toil, Configuration-Induced Toil
- components
- Compute Engine, Breaking Down the SLO Wall Between Customer and Cloud Provider
- conciseness in postmortems, Conciseness
- configuration, Configuration Design and Best Practices
- about, What Is Configuration?
- and reliability, Configuration and Reliability
- configuration-induced toil, Configuration-Induced Toil
- critical properties and pitfalls of configuration systems, Critical Properties and Pitfalls of Configuration Systems
- data processing pipeline, causing failures, Pipeline application or configuration
- effectively operating a configuration system, Effectively Operating a Configuration System
- guarding against abusive configuration, Guarding Against Abusive Configuration
- integrating a configuration language, Integrating a Configuration Language
- integrating custom applications, Integrating Custom Applications (In-House Software)
- integrating existing application, Kubernetes, Integrating an Existing Application: Kubernetes
- mechanics of, Mechanics of Configuration
- monitoring system, treating as code, Treat Your Configuration as Code
- philosophy of, Configuration Philosophy
- separating philosophy and mechanics, Separating Philosophy and Mechanics
- when to evaluate, When to Evaluate Configuration
- configuration syntax, tooling for, Configuration syntax
- connection tracking, Maglev
- consensus algorithms, Multidatacenter
- consistent hashing, Maglev
- contacts, preparing for an incident, Prepare a list of contacts
- context, missing, in bad postmortem, Missing context
- continuous integration and continuous delivery (CI/CD), Canarying Releases, Change Should Be Gradual
- coordination of incident response effort, Incident Command System
- correctness SLI, A Worked Example
- correctness SLOs, Data correctness
- corrupt data (in data processing pipeline), Corrupt data
- cost engineering, Cost Engineering and Capacity Planning
- coverage SLI, A Worked Example
- coverage, SLO lacking, Improving the Quality of Your SLO
- CRE, Engaging with CRE
- CRITICAL requests, Alerting at Scale
- cross-team reviews of postmortems, Conduct cross-team reviews
- CRUD-style APIs, monitoring system configuration, Treat Your Configuration as Code
- Cs of incident management (coordinate, communicate, control), Incident Command System
- CTR, AdWords Example
- culture
- customers
- cyclomatic code complexity, Measuring Complexity
- dashboards
- data analysis pipeline, case study, Case Study 2: Data Analysis Pipeline
- data analytics, Data Analytics
- data collection to monitor causes in on-call load, Data quality
- data correctness, Data correctness
- data formats for configuration information, Separate Configuration and Resulting Data
- data freshness, Data freshness
- data isolation, Data isolation/load balancing
- data processing pipelines, Data Processing Pipelines
- data-only languages (configuration), Pitfall 1: Failing to Recognize Configuration as a Programming Language Problem
- datacenters
- debugging, Encourage Consistency
- delayed data (in data processing pipeline), Delayed data
- Deming Cycle, The Deming Cycle
- dependencies
- canarying releases and, Dependencies and Isolation
- circular dependency, pDNS, Background
- effects on availability, Troubleshooting for Opaque Architectures
- external dependencies and incident response, Incident
- in data processing pipeline, causing failures, Pipeline dependencies
- modeling, Modeling Dependencies
- monitoring system metrics from, Dependencies
- planning for failure in data processing pipelines, Plan for Dependency Failure
- deployments
- deprecation phase (service lifecycle), SRE engagement in, Phase 5: Deprecation
- depth in postmortems, Depth
- design phase (service lifecycle), SRE engagement in, Phase 1: Architecture and Design
- design process (NALSD), Design Process
- destructive testing or fuzzing, Preexisting bugs
- detection time (SLO alerts), Alerting Considerations
- development
- development process in Spotify case study, Development process
- mapping data processing pipeline development lifecycle, Map Your Development Lifecycle
- partnership with SRE, critical importance of, When Can Substitute for Whether
- placing your first SRE in development team, Placing Your First SRE
- setting up relationship between SRE and development team, Setting Up the Relationship
- SRE teams having responsibility for, Self-regulating workload
- SRE-to-developer ratio and support of multiple services by single SRE team, Supporting Multiple Services with a Single SRE Team
- SREs sharing ownership with developers, Share Ownership with Developers
- sustaining an effective ongoing relationship between SRE and development teams, Sustaining an Effective Ongoing Relationship
- DevOps
- about, Background on DevOps
- accidents, view of, Accidents Are Normal
- comparing and contrasting with SRE, Compare and Contrast
- elimination of silos, No More Silos
- gradual change, Change Should Be Gradual
- implementation by SRE, Background on SRE
- interrelation of tooling and culture, Tooling and Culture Are Interrelated
- measurement as crucial, Measurement Is Crucial
- organizational context and fostering successful adoption, Organizational Context and Fostering Successful Adoption
- disaster and recovery testing, practicing with customers, Step 5: Practice, Practice, Practice
- disaster recovery testing (DiRT) at Google, Drills, Plan for Dependency Failure
- diskerase, Case Study
- Display Ads Spiderweb, simplification of (case study), Background
- distributed SRE teams, running, Running Cohesive Distributed SRE Teams
- distributed SREs, Distributed SREs
- documentation
- domain-specific languages (DSLs) for configuration, Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua
- domain-specific optimization (excessive) in configurations, Pitfall 3: Building Too Much Domain-Specific Optimization
- draining requests away from buggy system elements, Mitigation delay
- drills (incident management), Drills
- dropping work, Prioritize and triage within one quarter
- durability SLI, A Worked Example
- duration parameter (alerts), incrementing, 3: Incrementing Alert Duration
- ease of implementation and transparency
- emergency response, practicing, Identification delay
- emotion-based models for change management, Emotion-Based Models
- end-to-end measurement for data processing pipeline SLOs, End-to-end measurement
- ending SRE engagements, Ending the Relationship
- Equal-Cost Multi-Path (ECMP) forwarding, Maglev
- error budgets, Introduction of SLOs: A Journey in Progress
- addressing missed SLO caused by a dependency, Modeling Dependencies
- adjusting priorities according to, Adjusting Priorities According to Your SLOs and Error Budget
- agreed-upon by SRE and development teams for a service, Setting Ground Rules
- alerting on burn rate, 4: Alert on Burn Rate
- calculating, What to Measure: Using SLIs
- dashboard showing error budget consumption, Dashboards and Reports
- decision making using, Decision Making Using SLOs and Error Budgets
- documenting error budget policy, Documenting the SLO and Error Budget Policy
- establishing error budget policy, Establishing an Error Budget Policy
- example policy, Service Overview
- for example game service, Error Budget
- low-traffic services and error budget alerting, Low-Traffic Services and Error Budget Alerting
- minimizing risk to by canarying releases, Minimizing Risk to SLOs and the Error Budget
- prerequisites for adopting as SRE approach, Getting Started
- reliability targets and, Reliability Targets and Error Budgets
- rolling back recent changes and, Mitigation delay
- support tickets per day vs. measured loss in budget, Improving the Quality of Your SLO
- errors
- evangelism, SLOs at Home Depot, Evangelizing SLOs, The SLO Culture Project
- events
- Evernote, SLO case study, Evernote’s SLO Story
- extract, transform, load (ETL) model, Event Processing/Data Transformation to Order or Structure Data
- failures, Accidents Are Normal
- feature isolation, Mitigation delay
- features
- feedback from users, toil management with, Use Feedback to Improve
- finger pointing in postmortems, Counterproductive finger pointing
- follow-up to pager alerts, rigor in, Rigor of follow-up
- follow-up to postmortems, Postmortem follow-up
- forming your first SRE team, Forming
- Four Golden Signals, Dependencies, The SLO Culture Project
- freshness
- data freshness SLOs for data processing pipeline, Data freshness
- of data in monitoring system, Speed
- freshness SLI, A Worked Example
- funding and hiring (SREs), SRE Funding and Hiring
- future protection from overload, Protect yourself in the future
- G Suite Team Drive, What We Decided to Do
- GCE, Incident
- GCLB (Google Cloud Load Balancer), Google Cloud Load Balancing
- general availability phase (service lifecycle), SRE engagement in, Phase 4: General Availability
- generic mitigations, What could have been handled better?
- geographical splits (SRE teams), Geographical Splits
- finance, travel budget, Finance: Travel budget
- leadership, joint ownership of a service, Leadership: Joint ownership of a service
- parity, distributing work and avoiding a night shift, Parity: Distributing Work Between Offices and Avoiding a “Night Shift”
- people and projects, seeding the team, People and projects: Seeding the team
- placement, and having three shifts, Placement: What about having three shifts?
- placement, time zones apart, Placement: How many time zones apart should the teams be?
- should both halves of the team start at the same time, Timing: Should both halves of the team start at the same time?
- Global Service Load Balancer (GSLB), Global Service Load Balancer
- glossary in good postmortem example, Clarity, Good Postmortem
- goals, aligning between SRE and development teams, Aligning Goals
- golden signals, Dependencies
- Google AdWords, AdWords Example
- Google Analytics, Data Analytics
- Google Apps Script, Google’s template
- Google Assistant (version 1.88), bug in, Context
- Google Cloud Load Balancer (GCLB), Google Cloud Load Balancing
- Google Cloud Load Balancing, Google Cloud Load Balancing
- Google Cloud Platform (GCP), Moving our on-prem infrastructure to the cloud, SLO Engineering Case Studies
- Google Compute Engine (GCE), Data collection
- Google Front Ends (GFE), Google Front End, Maglev
- Google Home software bug incident response case study, Case Study 1: Software Bug—The Lights Are On but No One’s (Google) Home
- Google Kubernetes Engine (GKE), CreateCluster failure case study, Case Study 2: Service Fault—Cache Me If You Can
- Google’s Colossus File System, What We Decided to Do
- Google’s Customer Reliability Engineering (CRE) team, Engaging with CRE, SLO Engineering Case Studies
- Google’s template for postmortems, Google’s template
- GSLB (Global Service Load Balancer), Global Service Load Balancer
- HashiCorp Configuration Language, Generating Config in Specific Formats
- health checking, automated, problems with, New bugs
- help, asking for, Identification delay
- HIGH_FAST requests, Alerting at Scale
- HIGH_SLOW requests, Alerting at Scale
- Home Depot, SLO case study, The Home Depot’s SLO Story
- horizontal projects, Horizontal Projects
- horizontal SRE team, Assembling a horizontal SRE team
- hotspotting, reducing in data processing pipeline, Reduce Hotspotting and Workload Patterns
- HTTP requests
- HTTP servers, availability and latency, API and HTTP server availability and latency
- HTTP status codes, Clarifications and Caveats, Errors, Problem
- human error causing new bugs in production, New bugs
- human processes in pager load, Pager load inputs
- human-backed interfaces, starting with, Start with Human-Backed Interfaces, Start with human-backed interfaces
- Hyrum’s Law, Measuring Complexity
- IC (Incident Commander), Main Roles in Incident Response
- ICS (Incident Command System), Incident Command System
- idempotent mutations, Idempotent and Two-Phase Mutations
- identification delay for causes of pages, Identification delay
- incentives
- incident response, Incident Response
- basic principles, Incident Response
- best practices, using, Putting Best Practices into Practice
- case study, Google Home software bug and failure to communicate, Case Study 1: Software Bug—The Lights Are On but No One’s (Google) Home
- case study, Google Kubernetes Engine (GKE), cluster creation failure, Case Study 2: Service Fault—Cache Me If You Can
- case study, incident response at PagerDuty, Case Study 4: Incident Response at PagerDuty
- case study, lightning strikes at Google datacenter, Case Study 3: Power Outage—Lightning Never Strikes Twice…Until It Does
- incident management at Google, Incident Management at Google
- practicing in SRE with customers, Step 5: Practice, Practice, Practice
- incidents
- infrastructure-centric view (configuration), Configuration Asks Users Questions
- intent-based configuration, monitoring system, Treat Your Configuration as Code
- interpersonal risk taking, Problem Statement
- interrupts, Identifying and Recovering from Overload
- iterations for SLO quality improvement, Improving the Quality of Your SLO
- JavaScript, Generating Config in Specific Formats
- jitter, Future-proofing
- JSON, Generating Config in Specific Formats
- Jsonnet, Integrating Custom Applications (In-House Software), Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua, Separate Configuration and Resulting Data
- converting existing template to, Integrating the Configuration Language
- generating JSON for configuration evaluation, When to Evaluate Configuration
- guarding against abusive configurations in, Guarding Against Abusive Configuration
- library functions for outputting INI and XML, Generating Config in Specific Formats
- quick introduction to, Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua
- using with Kubernetes configurations, Integrating the Configuration Language
- validating JSON with JSONschema, Integrating Custom Applications (In-House Software)
- versioning in, Versioning
- writing tests as Jsonnet files, Testing
- languages, configuration
- latency
- latency SLI, A Worked Example
- Launch Coordination Engineering (LCE) teams, Launch Coordination Engineering Teams
- leadership in geographically split SRE teams, Leadership: Joint ownership of a service
- legacy systems, Use Feedback to Improve
- Lewin’s three-stage model for managing change, Lewin’s Three-Stage Model
- limited audience in bad postmortem example, Limited audience
- limited availability phase (service lifecycle), SRE engagement in, Phase 3: Limited Availability
- line-card repair
- linters, Configuration syntax
- live version, Our Example Setup
- load
- load balancing
- load balancer monitoring, API and HTTP server availability and latency
- load balancing
- load shedding, Combining Strategies to Manage Load, New bugs
- LogJoiner (AdWords example), LogJoiner
- logs, Sources of Monitoring Data
- long term goals (SRE team), Setting Ground Rules
- LOW (availability) requests, Alerting at Scale
- low-traffic services, error budget alerting on, Low-Traffic Services and Error Budget Alerting
- machine learning, Machine Learning
- Maglev, Anycast
- implementing stabilized anycast with, Stabilized anycast
- packet delivery using consistent hashing and connection tracking, Maglev
- major incident response at PagerDuty, Major incident response at PagerDuty
- manual overrides and kill switches in autoscaling, Including Kill Switches and Manual Overrides
- MapReduce, evaluating for use in AdWords distributed system, MapReduce
- maturity matrix for data processing pipeline, Pipeline maturity matrix
- McKinsey’s 7-S change management model, McKinsey’s 7-S Model
- mean time to repair (MTTR), Move Fast by Reducing the Cost of Failure
- measurements
- decision on what/how to measure for SLOs at Evernote, Introduction of SLOs: A Journey in Progress
- effectiveness of automated tasks, Use Feedback to Improve
- end-to-end measurement for data processing pipeline SLOs, End-to-end measurement
- importance in DevOps, Measurement Is Crucial
- in DevOps and SRE, Compare and Contrast
- in SRE with customers, Step 3: Measure and Renegotiate
- measurability in action items in postmortems, Concrete action items
- measuring the SLIs, Measuring the SLIs
- measuring toil, Identify and Measure Toil, Measuring Toil
- metrics for Home Depot services, The SLO Culture Project
- sources of, for API and HTTP sever availability and latency SLI, API and HTTP server availability and latency
- with service level indicators (SLIs), What to Measure: Using SLIs
- meeting in person to resolve issues, Meet in person (or as close to it as possible) to resolve issues
- metrics, Sources of Monitoring Data
- choosing between metrics-based and logs-based monitoring systems, Problem
- collecting in canary deployments, time limits on, Requirements on Monitoring Data
- establishing to evaluate team workload, Protect yourself in the future
- indicating work overload, Recognizing the Symptoms of Overload
- metric-based monitoring system, Sources of Monitoring Data
- quantifiable, in good postmortem example, Clarity
- restructuring at Evernote for cloud datacenters, Restructuring our monitoring and metrics
- selecting and evaluating for canary deployments, Selecting and Evaluating Metrics
- selection for canarying in noninteractive systems, Canarying in Noninteractive Systems
- visible and useful, from monitoring system, Metrics with Purpose
- microservices
- migrations, Migrations
- Mission Control program (SRE teams), Mission Control
- mistakes, handling appropriately, Handling Mistakes Appropriately
- mitigation delays, reducing, Mitigation delay
- mitigation in incident response, What could have been handled better?
- mobility for SREs, SRE Mobility
- Moira portal, Moira Portal
- monitoring, Monitoring
- auditing and building shared dashboards in SRE with customers, Step 2: Audit the Monitoring and Build Shared Dashboards
- calculating SLOs from data, Introduction of SLOs: A Journey in Progress
- choosing between monitoring systems, examples, Examples
- covering in interviews with SREs, Finding Your First SRE
- data collection for causes of on-call paging, Data quality
- decisions on reliability and, Your Users, Not Your Monitoring, Decide Your Reliability
- desirable features of monitoring strategy, Desirable Features of a Monitoring Strategy
- for data processing pipeline, Pipeline maturity matrix
- implementation in common tooling adoption in SRE case study, Implementation: Monitoring
- managing monitoring systems, Managing Your Monitoring System
- of release pipeline, Release Engineering Principles
- or prevention fixes for pager alerts, Rigor of follow-up
- providing visible and useful metrics, Metrics with Purpose
- requirements in canary evaluation, Requirements on Monitoring Data
- restructuring at Evernote for cloud datacenters, Restructuring our monitoring and metrics
- sources of data, Sources of Monitoring Data
- system monitoring in Spotify case study, System monitoring
- testing alerting logic, Testing Alerting Logic
- Moonwalk, Moonwalk
- multidatacenter design (AdWords example), Multidatacenter
- Murphy-Beyer effect, Automate This Year’s Job Away
- mutations, idempotent and two-phase, Idempotent and Two-Phase Mutations
- Omega, Background
- on-call, On-Call
- operational load (or operational workload), Identifying and Recovering from Overload
- operational overload (or work overload), Identifying and Recovering from Overload
- operations, How SRE Relates to DevOps
- Operations or Ops Lead (OL), Main Roles in Incident Response
- organizational change (positive), rewarding, Reward positive organizational change
- organizational change management in SRE, Organizational Change Management in SRE
- outages
- overload, identifying and recovering from, Identifying and Recovering from Overload
- ownership
- pager load, Anatomy of Pager Load
- PagerDuty, incident response case study, Case Study 4: Incident Response at PagerDuty
- part-time work schedules for on-call engineers, Plan for part-time work schedules
- participatory management, What We Decided to Do
- Paxos consensus algorithm, Multidatacenter
- pDNS (Production DNS), eliminating dependency on itself (case study), Background
- peak-end rule, Your Users, Not Your Monitoring, Decide Your Reliability
- perceived overload, From Load to Overload, Identifying and Recovering from Overload
- performance
- performing (SRE team), Performing
- pets vs. cattle approach, Increase Uniformity
- pipeline components, Types of components
- pipelines, data processing, Data Processing Pipelines
- Piper/Git-on-Borg version control system, What We Decided to Do
- Plan-Do-Check-Act (or PDCA) Cycle, The Deming Cycle
- planning by SRE and development teams, Planning and Executing
- platforms
- playbooks, maintaining, Training roadmap
- point fixes for paging alerts, Rigor of follow-up
- Pokémon GO on GCLB, case study, Case Study 1: Pokémon GO on GCLB
- positive behaviors, Be positive
- postmortems, Postmortem Culture: Learning from Failure
- bad postmortem example, Case Study
- blameless, Compare and Contrast, It’s Better to Fix It Yourself; Don’t Blame Someone Else
- case study, rack decommission leading to service latency, Case Study
- conducting joint postmortems with customers, Step 5: Practice, Practice, Practice
- good postmortem example, Good Postmortem
- organizational incentives for, Organizational Incentives
- results of postmortem analysis, Results of Postmortem Analysis
- templates for, Tools and Templates
- tools for, Postmortem Tooling
- precision in SLO alerts, Alerting Considerations
- preparations for incidents, Prepare Beforehand
- preventative actions in postmortems, Concrete action items, Key action item characteristics missing
- priorities
- prioritizing and triaging issues, Prioritize and triage within one quarter
- probers, Generating Artificial Traffic, Introduction of SLOs: A Journey in Progress, What to Measure: Using SLIs
- problem summary (postmortems), Key details omitted
- problems (simulated), practicing with customers, Step 5: Practice, Practice, Practice
- process documentation for data processing pipelines, Process documentation
- production
- automation, covering in interviews with SREs, Finding Your First SRE
- boundaries between application development and, Share Ownership with Developers
- contributors to high pager load, Pager load inputs
- data processing pipeline readiness for, Pipeline Production Readiness
- managing services in DevOps and SRE, Compare and Contrast
- priorities, communication by SRE and developer teams, Communicating Business and Production Priorities
- production excellence reviews, Production Excellence
- wisdom from, Work to Minimize Toil
- production interrupts, Production Interrupts
- Production Readiness Reviews (PRRs), Aligning Goals, Pipeline Production Readiness, SRE Engagement Model
- productionizing a service, Phase 2: Active Development
- programming languages
- project lifecycle complexity, case study, Case Study 2: Project Lifecycle Complexity
- Prometheus monitoring system, Prefer Loose Coupling
- Prosci ADKAR model, Background, How These Theories Apply to SRE, The Prosci ADKAR Model, What We Decided to Do
- mapping implementation phase of change project, Design
- protocol buffers, Integrating Custom Applications (In-House Software), Pitfall 1: Failing to Recognize Configuration as a Programming Language Problem
- psychological safety
- publication of postmortems, Delayed publication, Promptness
- recall (SLO alerts), Alerting Considerations
- recovery efforts (in postmortems), Key details omitted
- region-level outages in data processing pipeline, Region-level outage
- regional groups of instances (RMiGs), autoscalers balancing instances across, Avoiding Traffic Imbalance
- relationships, sustaining between SRE and other teams, Sustaining an Effective Ongoing Relationship
- release candidate version, Our Example Setup
- release engineering, Canarying Releases
- releases
- reliability
- as a partnership if you run a platform, If You Run a Platform, Then Reliability Is a Partnership
- as most important feature, Reliability Is the Most Important Feature
- balancing release velocity and, Balancing Release Velocity and Reliability
- configuration and, Configuration and Reliability
- considering work on as specialized role, Consider Reliability Work as a Specialized Role
- decided by users, not monitoring, Your Users, Not Your Monitoring, Decide Your Reliability
- experimenting with relaxing SLOs, Experimenting with Relaxing Your SLOs
- improvements in, resulting from postmortems, Highlight improved reliability
- reliability hackathon for products at risk, Reassessing When Ground Rules Start to Slip
- reliability targets and error budgets, Reliability Targets and Error Budgets
- rename-and-shame, Conclusion
- repeating incidents, finding causes of, Repeating incidents
- replication toil, Configuration-Induced Toil
- reports on SLO compliance, Dashboards and Reports
- reproducible builds, Release Engineering Principles
- request-driven components, Types of components
- requests
- grouping request types into buckets of similar availability requirements, Alerting at Scale
- specific mix of, bugs manifesting with, New bugs
- Requiem (postmortem storage tool), Postmortem storage
- reset time (SLO alerts), Alerting Considerations
- resources
- response times to pages, Anatomy of Pager Load
- reusing code in data processing pipelines, Reusing code
- revenue coverage, Be Thoughtful and Disciplined
- reviews, designing in SRE with customers, Step 4: Design Reviews and Risk Analysis
- revision control system, storing monitoring system configuration in, Treat Your Configuration as Code
- risk analysis in SRE with customers, Step 4: Design Reviews and Risk Analysis
- risks and mitigations for first SRE team, Risks and mitigations
- risks, identifying by SRE team to developer team, Identifying Risks
- roadmaps, Planning and Executing
- roll forward deployment vs. simple canary deployment, A Roll Forward Deployment Versus a Simple Canary Deployment
- rollback strategy for releases, New bugs
- rolling update strategy, Safe Configuration Change Application
- root causes and trigger (in postmortems), Blamelessness, Key details omitted
- routing
- RPCs (remote procedure calls)
- satellites, Case Study
- scaling, Autoscaling
- calculations for AdWords NALSD example, Calculations
- calculations for LogJoiner, QueryStore, ClickMap,, and QueryMap in AdWords example, Calculations
- during limited availability phase of service lifecycle, Phase 3: Limited Availability
- of SRE to larger environments, Scaling SRE to Larger Environments
- scalability of data processing pipeline, Pipeline maturity matrix
- Waze case study, Background
- scheduling on-call engineers, On-Call Flexibility
- scripting languages, general-purpose, using for configuration system, Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua
- security policies for data processing pipeline, adhering to, Adhere to Access Control and Security Policies
- self-healing, New bugs
- self-regulating workload (SRE team), Self-regulating workload
- self-service methods for users, Build self-service interfaces, Provide Self-Service Methods
- semantic validation (configuration), Semantic validation
- service complexity and forming more SRE teams, Where to split
- service level agreements (SLAs)
- service level indicators (SLIs), Implementing SLOs
- abstracting your system into types of components, Types of components
- architecture for example mobile phone game, A Worked Example
- automation of collection of data at Home Depot, The SLO Culture Project
- canary metrics and, Metrics Should Indicate Problems
- changing SLI implementation, Improving the Quality of Your SLO
- dashboards showing SLI trends, Dashboards and Reports
- explaining to customers, Step 1: SLOs and SLIs Are How You Speak
- for example game service SLOs, SLIs and SLOs
- implementation, What to Measure: Using SLIs
- measurements with, What to Measure: Using SLIs
- measuring, Measuring the SLIs
- moving from specification to implementation, Moving from SLI Specification to SLI Implementation
- specifications, What to Measure: Using SLIs
- using to calculate starter SLOs, Using the SLIs to Calculate Starter SLOs
- service level objectives (SLOs), Implementing SLOs
- adjusting priorities according to, Adjusting Priorities According to Your SLOs and Error Budget
- alerting on, Alerting on SLOs
- by SRE and development teams for a service, Setting Ground Rules
- continuous improvement of SLO targets, Continuous Improvement of SLO Targets
- decision making using SLOs and error budgets, Decision Making Using SLOs and Error Budgets
- defining and measuring for data processing pipelines, Define and Measure Service Level Objectives
- defining before general availability phase of service lifecycle, Phase 3: Limited Availability
- design and implementation by your first SRE, Bootstrapping Your First SRE
- Evernote case study, Evernote’s SLO Story
- example document for game service, Example SLO Document
- for Spotify event delivery system operation, Event Delivery System Operation
- fundamental importance in SRE, SLO Engineering Case Studies
- getting started with, Getting Started
- Home Depot case study, The Home Depot’s SLO Story
- implementing, worked example, A Worked Example
- in SRE with customers, Step 1: SLOs and SLIs Are How You Speak
- managing by, in SRE, Manage by Service Level Objectives (SLOs)
- minimizing risk to by canarying releases, Minimizing Risk to SLOs and the Error Budget
- refining
- tracking performance over time at Evernote, Tracking our performance over time
- using for SRE practices without SREs, SRE Practices Without SREs
- using to reduce toil, Use SLOs to Reduce Toil
- why SREs need them, Why SREs Need SLOs
- service lifecycle, SRE engagement during, The Service Lifecycle
- service reviews (regular), performing, Performing Regular Service Reviews
- services
- sharded LogJoiner (AdWords example), Calculations
- sharing postmortems openly, Share Postmortems Openly
- shift length for on-call SREs, On-Call Flexibility
- short term goals (SRE team), Setting Ground Rules
- side effects, interleaving with configuration evaluation, Pitfall 4: Interleaving “Configuration Evaluation” with “Side Effects”
- silos, elimination of, No More Silos
- simplicity, Simplicity Is End-to-End, and SREs Are Good for That
- case study, end-to-end API simplicity, Background
- case study, project lifecycle complexity, Case Study 2: Project Lifecycle Complexity
- escaping in configuration, Escaping Simplicity, Integrating the Configuration Language
- regaining
- case study, eliminating pDNS dependency on itself, Background
- case study, running hundreds of microservices on shared platform, Background
- case study, simplifying Display Ads Spiderweb, Background
- site reliability engineering, How SRE Relates to DevOps
- site reliability engineers, Why SREs Need SLOs
- skewness SLO (Spotify case study), Skewness
- SLAs, Breaking Down the SLO Wall Between Customer and Cloud Provider
- SLIs, Implementing SLOs
- SLOs, Implementing SLOs
- software engineering, Finding Your First SRE
- software-defined networking (SDN), Background
- source control, using with configurations, Source Control
- speed of data retrieval (monitoring system), Speed
- Spotify case study, Case Study: Spotify
- SRE (site reliability engineering)
- SRE engagement model, SRE Engagement Model
- SRE teams
- adjusting structures to changing circumstances, Adapting SRE Team Structures to Changing Circumstances
- cohesive distributed SRE teams, running, Running Cohesive Distributed SRE Teams
- forming new team at Google, Initial scenario
- lifecycles, SRE Team Lifecycles
- on-call engineers in survive the week culture, Scenario: A culture of “survive the week”
- single team supporting multiple services, Supporting Multiple Services with a Single SRE Team
- structuring multiple SRE team environment, Structuring a Multiple SRE Team Environment
- work overload when half a team leaves, case study, Background
- SREs (site reliability engineers)
- staging, New bugs
- stakeholders, getting agreement from on SLOs, Getting Stakeholder Agreement
- stateful systems
- statsd metric aggregation daemon, Prefer Loose Coupling
- storage systems, Types of components
- storming (first SRE team), Storming
- stressors (psychological), identifying and alleviating, Identify and alleviate psychosocial stressors
- string interpolation, Pitfall 1: Failing to Recognize Configuration as a Programming Language Problem
- sunk cost fallacy, Problem Statement
- superfluous (undesirable) language in postmortems, Animated language
- support for services, deciding on, When Can Substitute for Whether
- survive the week culture, Scenario: A culture of “survive the week”
- swaps (short-term) in on-call scheduling, Plan for short-term swaps
- symptoms of overload, Recognizing the Symptoms of Overload
- system architecture, covering in interviews with SREs, Finding Your First SRE
- system diagrams for data processing pipeline, System diagrams
- systemic fixes for pager alerts, Rigor of follow-up
- systems
- TCP, Anycast
- team, existing, converting to SRE team, Converting a team in place, Norming
- templates
- testing
- The Home Depot (THD), The Home Depot’s SLO Story
- three Cs of incident management, Incident Command System
- tickets
- tiered SRE engagement, Outcomes
- time for postmortems, lacking, Lacking time to write postmortems
- time windows for SLOs, Choosing an Appropriate Time Window
- timeliness SLO (Spotify case study), Timeliness
- toil
- toil, eliminating, Eliminating Toil
- case study, decommissioning filer-backed home directories, Case Study 2: Decommissioning Filer-Backed Home Directories
- case study, reducing toil in datacenter with automation, Case Study 1: Reducing Toil in the Datacenter with Automation
- background, Background
- decision on the solution, What We Decided to Do
- design first effort, Saturn line-card repair, Design First Effort: Saturn Line-Card Repair
- design second effort, Saturn vs. Jupiter line-card repair, Implementation
- implementation, Jupiter line-card repair, Implementation
- implementation, Saturn line-card repair with automation, Implementation
- lessons learned, Lessons Learned
- problem statement, Problem Statement
- characteristics and examples of toil, What Is Toil?
- measuring toil, Measuring Toil
- taxonomy of toil, Toil Taxonomy
- toil management strategies, Toil Management Strategies
- assessing risks in automation, Assess Risk Within Automation
- automating toil response, Automate Toil Response
- engineering toil out of the system, Engineer Toil Out of the System
- getting support from management and colleagues, Get Support from Management and Colleagues
- identifying and measuring toil, Identify and Measure Toil
- increasing uniformity, Increase Uniformity
- legacy systems, Use Feedback to Improve
- promoting toil reduction as a feature, Promote Toil Reduction as a Feature
- providing self-service methods, Provide Self-Service Methods
- rejecting toil, Reject the Toil
- starting small, then improving, Start Small and Then Improve
- starting with human-backed interfaces, Start with Human-Backed Interfaces
- using feedback to improve, Use Feedback to Improve
- using open source and third-party tools, Use Open Source and Third-Party Tools
- using SLOs to reduce toil, Use SLOs to Reduce Toil
- tooling
- tools
- traffic teeing, Traffic Teeing
- traffic volume metrics, The SLO Culture Project
- training
- travel for SRE teams, Travel
- triaging interrupts, Lessons Learned
- troubleshooting for opaque architectures, Troubleshooting for Opaque Architectures
- truncated exponential backoff, Future-proofing
- two-phase mutations, Idempotent and Two-Phase Mutations
- ulimit utility, Guarding Against Abusive Configuration
- unhealthy instances, handling, Handling Unhealthy Machines
- uniformity, increasing, Increase Uniformity, Melt snowflakes
- unsupported phase, service lifecycle, Phase 7: Unsupported
- user behaviors causing bugs to manifest, New bugs
- user journeys, modeling, Modeling User Journeys
- user-centric view (configuration), Configuration Asks Users Questions
- users
- utilization