Case Studies in Infrastructure Change Management

Written by:
Wendy Look and Mark Dallman

If you’re rolling out a large-scale infrastructure change, you know it can be like swapping out a jet engine while flying. Staying aloft takes coordination and communication with many teams, good processes and documentation, risk identification and management, monitoring, and tracking of the change progress—not to mention dealing with the catastrophic challenges that crop up midflight. In this report, technical program managers in Google SRE take you through case studies that demonstrate how infrastructure change projects are managed at Google.

Authors Wendy Look and Mark Dallman offer an overview of two long-term projects at Google: one to migrate all of Google’s systems from Google File System (GFS) to its successor, Colossus, and the other to move from local disk storage to diskless compute nodes for all jobs. You’ll dive into the tools and processes used to manage the changes, see what worked (and what didn’t), and discover lessons learned along the way. Best of all, you’ll get a preflight checklist drawn from these experiences that will help you keep your own projects on course.

PDF