Bibliography
- [Ada15] Bram Adams, Stephany Bellomo, Christian Bird, Tamara Marshall-Keim, Foutse Khomh, and Kim Moir, "The Practice and Future of Release Engineering: A Roundtable with Three Release Engineers", IEEE Software, vol. 32, no. 2 (March/April 2015), pp. 42–49.
- [Agu10] M. K. Aguilera, "Stumbling over Consensus Research: Misunderstandings and Issues", in Replication, Lecture Notes in Computer Science 5959, 2010.
- [All10] J. Allspaw and J. Robbins, Web Operations: Keeping the Data on Time: O’Reilly, 2010.
- [All12] J. Allspaw, "Blameless PostMortems and a Just Culture", blog post, 2012.
- [All15] J. Allspaw, "Trade-Offs Under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages", MSc thesis, Lund University, 2015.
- [Ana07] S. Anantharaju, "Automating web application security testing", blog post, July 2007.
- [Ana13] R. Ananatharayan et al., "Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams", in SIGMOD '13, 2013.
- [And05] A. Andrieux, K. Czajkowski, A. Dan, et al., "Web Services Agreement Specification (WS-Agreement)", September 2005.
- [Bai13] P. Bailis and A. Ghodsi, "Eventual Consistency Today: Limitations, Extensions, and Beyond", in ACM Queue, vol. 11, no. 3, 2013.
- [Bai83] L. Bainbridge, "Ironies of Automation", in Automatica, vol. 19, no. 6, November 1983.
- [Bak11] J. Baker et al., "Megastore: Providing Scalable, Highly Available Storage for Interactive Services", in Proceedings of the Conference on Innovative Data System Research, 2011.
- [Bar11] L. A. Barroso, "Warehouse-Scale Computing: Entering the Teenage Decade", talk at 38th Annual Symposium on Computer Architecture, video available online, 2011.
- [Bar13] L. A. Barroso, J. Clidaras, and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition, Morgan & Claypool, 2013.
- [Ben12] C. Bennett and A. Tseitlin, "Chaos Monkey Released Into The Wild", blog post, July 2012.
- [Bla14] M. Bland, "Goto Fail, Heartbleed, and Unit Testing Culture", blog post, June 2014.
- [Boc15] L. Bock, Work Rules!, Twelve Books, 2015.
- [Bol11] W. J. Bolosky, D. Bradshaw, R. B. Haagens, N. P. Kusters, and P. Li, "Paxos Replicated State Machines as the Basis of a High-Performance Data Store", in Proc. NSDI 2011, 2011.
- [Boy13] P. G. Boysen, "Just Culture: A Foundation for Balanced Accountability and Patient Safety", in The Ochsner Journal, Fall 2013.
- [Bra15] VM Brasseur, "Failure: Why it happens & How to benefit from it", YAPC 2015.
- [Bre01] E. Brewer, "Lessons From Giant-Scale Services", in IEEE Internet Computing, vol. 5, no. 4, July / August 2001.
- [Bre12] E. Brewer, "CAP Twelve Years Later: How the "Rules" Have Changed", in Computer, vol. 45, no. 2, February 2012.
- [Bro15] M. Brooker, "Exponential Backoff and Jitter", on AWS Architecture Blog, March 2015.
- [Bro95] F. P. Brooks Jr., "No Silver Bullet—Essence and Accidents of Software Engineering", in The Mythical Man-Month, Boston: Addison-Wesley, 1995, pp. 180–186.
- [Bru09] J. Brutlag, "Speed Matters", on Google Research Blog, June 2009.
- [Bul80] G. M. Bull, The Dartmouth Time-sharing System: Ellis Horwood, 1980.
- [Bur99] M. Burgess, Principles of Network and System Administration: Wiley, 1999.
- [Bur06] M. Burrows, "The Chubby Lock Service for Loosely-Coupled Distributed Systems", in OSDI '06: Seventh Symposium on Operating System Design and Implementation, November 2006.
- [Bur16] B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, "Borg, Omega, and Kubernetes" in ACM Queue, vol. 14, no. 1, 2016.
- [Cas99] M. Castro and B. Liskov, "Practical Byzantine Fault Tolerance", in Proc. OSDI 1999, 1999.
- [Cha10] C. Chambers, A. Raniwala, F. Perry, S. Adams, R. Henry, R. Bradshaw, and N. Weizenbaum, "FlumeJava: Easy, Efficient Data-Parallel Pipelines", in ACM SIGPLAN Conference on Programming Language Design and Implementation, 2010.
- [Cha96] T. D. Chandra and S. Toueg, "Unreliable Failure Detectors for Reliable Distributed Systems", in J. ACM, 1996.
- [Cha07] T. Chandra, R. Griesemer, and J. Redstone, "Paxos Made Live—An Engineering Perspective", in PODC '07: 26th ACM Symposium on Principles of Distributed Computing, 2007.
- [Cha06] F. Chang et al., "Bigtable: A Distributed Storage System for Structured Data", in OSDI '06: Seventh Symposium on Operating System Design and Implementation, November 2006.
- [Chr09] G. P. Chrousous, "Stress and Disorders of the Stress System", in Nature Reviews Endocrinology, vol 5., no. 7, 2009.
- [Clos53] C. Clos, "A Study of Non-Blocking Switching Networks", in Bell System Technical Journal, vol. 32, no. 2, 1953.
- [Con15] C. Contavalli, W. van der Gaast, D. Lawrence, and W. Kumari, "Client Subnet in DNS Queries", IETF Internet-Draft, 2015.
- [Con63] M. E. Conway, "Design of a Separable Transition-Diagram Compiler", in Commun. ACM 6, 7 (July 1963), 396–408.
- [Con96] P. Conway, "Preservation in the Digital World", report published by the Council on Library and Information Resources, 1996.
- [Coo00] R. I. Cook, "How Complex Systems Fail", in Web Operations: O’Reilly, 2010.
- [Cor12] J. C. Corbett et al., "Spanner: Google’s Globally-Distributed Database", in OSDI '12: Tenth Symposium on Operating System Design and Implementation, October 2012.
- [Cra10] J. Cranmer, "Visualizing code coverage", blog post, March 2010.
- [Dea13] J. Dean and L. A. Barroso, "The Tail at Scale", in Communications of the ACM, vol. 56, 2013.
- [Dea04] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", in OSDI’04: Sixth Symposium on Operating System Design and Implementation, December 2004.
- [Dea07] J. Dean, "Software Engineering Advice from Building Large-Scale Distributed Systems", Stanford CS297 class lecture, Spring 2007.
- [Dek02] S. Dekker, "Reconstructing human contributions to accidents: the new view on error and performance", in Journal of Safety Research, vol. 33, no. 3, 2002.
- [Dek14] S. Dekker, The Field Guide to Understanding "Human Error", 3rd edition: Ashgate, 2014.
- [Dic14] C. Dickson, "How Embracing Continuous Release Reduced Change Complexity", presentation at USENIX Release Engineering Summit West 2014, video available online.
- [Dur05] J. Durmer and D. Dinges, "Neurocognitive Consequences of Sleep Deprivation", in Seminars in Neurology, vol. 25, no. 1, 2005.
- [Eis16] D. E. Eisenbud et al., "Maglev: A Fast and Reliable Software Network Load Balancer", in NSDI '16: 13th USENIX Symposium on Networked Systems Design and Implementation, March 2016.
- [Ere03] J. R. Erenkrantz, "Release Management Within Open Source Projects", in Proceedings of the 3rd Workshop on Open Source Software Engineering, Portland, Oregon, May 2003.
- [Fis85] M. J. Fischer, N. A. Lynch, and M. S. Paterson, "Impossibility of Distributed Consensus with One Faulty Process", J. ACM, 1985.
- [Fit12] B. W. Fitzpatrick and B. Collins-Sussman, Team Geek: A Software Developer’s Guide to Working Well with Others: O’Reilly, 2012.
- [Flo94] S. Floyd and V. Jacobson, "The Synchronization of Periodic Routing Messages", in IEEE/ACM Transactions on Networking, vol. 2, issue 2, April 1994, pp. 122–136.
- [For10] D. Ford et al, "Availability in Globally Distributed Storage Systems", in Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, 2010.
- [Fox99] A. Fox and E. A. Brewer, "Harvest, Yield, and Scalable Tolerant Systems", in Proceedings of the 7th Workshop on Hot Topics in Operating Systems, Rio Rico, Arizona, March 1999.
- [Fow08] M. Fowler, "GUI Architectures", blog post, 2006.
- [Gal78] J. Gall, SYSTEMANTICS: How Systems Really Work and How They Fail, 1st ed., Pocket, 1977.
- [Gal03] J. Gall, The Systems Bible: The Beginner’s Guide to Systems Large and Small, 3rd ed., General Systemantics Press/Liberty, 2003.
- [Gaw09] A. Gawande, The Checklist Manifesto: How to Get Things Right: Henry Holt and Company, 2009.
- [Ghe03] S. Ghemawat, H. Gobioff, and S-T. Leung, "The Google File System", in 19th ACM Symposium on Operating Systems Principles, October 2003.
- [Gil02] S. Gilbert and N. Lynch, "Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services", in ACM SIGACT News, vol. 33, no. 2, 2002.
- [Gla02] R. Glass, Facts and Fallacies of Software Engineering, Addison-Wesley Professional, 2002.
- [Gol14] W. Golab et al., "Eventually Consistent: Not What You Were Expecting?", in ACM Queue, vol. 12, no. 1, 2014.
- [Gra09] P. Graham, "Maker’s Schedule, Manager’s Schedule", blog post, July 2009.
- [Gup15] A. Gupta and J. Shute, "High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads", in Workshop on Business Intelligence for the Real Time Enterprise, 2015.
- [Ham07] J. Hamilton, "On Designing and Deploying Internet-Scale Services", in Proceedings of the 21st Large Installation System Administration Conference, November 2007.
- [Han94] S. Hanks, T. Li, D. Farinacci, and P. Traina, "Generic Routing Encapsulation over IPv4 networks", IETF Informational RFC, 1994.
- [Hic11] M. Hickins, "Tape Rescues Google in Lost Email Scare", in Digits, Wall Street Journal, 1 March 2011.
- [Hix15a] D. Hixson, "Capacity Planning", in ;login:, vol. 40, no. 1, February 2015.
- [Hix15b] D. Hixson, "The Systems Engineering Side of Site Reliability Engineering", in ;login: vol. 40, no. 3, June 2015.
- [Hod13] J. Hodges, "Notes on Distributed Systems for Young Bloods", blog post, 14 January 2013.
- [Hol14] L. Holmwood, "Applying Cardiac Alarm Management Techniques to Your On-Call", blog post, 26 August 2014.
- [Hum06] J. Humble, C. Read, D. North, "The Deployment Production Line", in Proceedings of the IEEE Agile Conference, July 2006.
- [Hum10] J. Humble and D. Farley, Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation: Addison-Wesley, 2010.
- [Hun10] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, "ZooKeeper: Wait-free coordination for Internet-scale systems", in USENIX ATC, 2010.
- [IAEA12] International Atomic Energy Agency, "Safety of Nuclear Power Plants: Design, SSR-2/1", 2012.
- [Jai13] S. Jain et al., "B4: Experience with a Globally-Deployed Software Defined WAN", in SIGCOMM '13.
- [Jon15] C. Jones, T. Underwood, and S. Nukala, "Hiring Site Reliability Engineers", in ;login:, vol. 40, no. 3, June 2015.
- [Jun07] F. Junqueira, Y. Mao, and K. Marzullo, "Classic Paxos vs. Fast Paxos: Caveat Emptor", in Proc. HotDep '07, 2007.
- [Jun11] F. P. Junqueira, B. C. Reid, and M. Serafini, "Zab: High-performance broadcast for primary-backup systems.", in Dependable Systems & Networks (DSN), 2011 IEEE/IFIP 41st International Conference on 27 Jun 2011: 245–256.
- [Kah11] D. Kahneman, Thinking, Fast and Slow: Farrar, Straus and Giroux, 2011.
- [Kar97] D. Karger et al., "Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web", in Proc. STOC '97, 29th annual ACM symposium on theory of computing, 1997.
- [Kem11] C. Kemper, "Build in the Cloud: How the Build System Works", Google Engineering Tools blog post, August 2011.
- [Ken12] S. Kendrick, "What Takes Us Down?", in ;login:, vol. 37, no. 5, October 2012
- [Kinc09] Kincaid, Jason. "T-Mobile Sidekick Disaster: Danger’s Servers Crashed, And They Don’t Have A Backup." Techcrunch. n.p., 10 Oct. 2009. Web. 20 Jan. 2015, https://techcrunch.com/2009/10/10/t-mobile-sidekick-disaster-microsofts-servers-crashed-and-they-dont-have-a-backup
- [Kin15] K. Kingsbury, "The trouble with timestamps", blog post, 2013.
- [Kir08] J. Kirsch and Y. Amir, "Paxos for System Builders: An Overview", in Proc. LADIS '08, 2008.
- [Kla12] R. Klau, "How Google Sets Goals: OKRs", blog post, October 2012.
- [Kle06] D. V. Klein, "A Forensic Analysis of a Distributed Two-Stage Web-Based Spam Attack", in Proceedings of the 20th Large Installation System Administration Conference, December 2006.
- [Kle14] D. V. Klein, D. M. Betser, and M. G. Monroe, "Making Push On Green a Reality", in ;login:, vol. 39, no. 5, October 2014.
- [Kra08] T. Krattenmaker, "Make Every Meeting Matter", in Harvard Business Review, February 27, 2008.
- [Kre12] J. Kreps, "Getting Real About Distributed System Reliability", blog post, 19 March 2012.
- [Kri12] K. Krishan, "Weathering The Unexpected", in Communications of the ACM, vol. 55, no. 11, November 2012
- [Kum15] A. Kumar et al., "BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing", in SIGCOMM '15.
- [Lam98] L. Lamport, "The Part-Time Parliament", in ACM Transactions on Computer Systems 16, 2, May 1998.
- [Lam01] L. Lamport, "Paxos Made Simple", in ACM SIGACT News 121, December 2001.
- [Lam06] L. Lamport, "Fast Paxos", in Distributed Computing 19.2, October 2006.
- [Lim14] T. A. Limoncelli, S. R. Chalup, and C. J. Hogan, The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems, Volume 2: Addison-Wesley, 2014.
- [Loo10] J. Loomis, "How to Make Failure Beautiful: The Art and Science of Postmortems", in Web Operations: O’Reilly, 2010.
- [Lu15] H. Lu et al, "Existential Consistency: Measuring and Understanding Consistency at Facebook", in SOSP '15, 2015.
- [Mao08] Y. Mao, F. P. Junqueira, and K. Marzullo, "Mencius: Building Efficient Replicated State Machines for WANs", in OSDI '08, 2008.
- [Mas43] A. H. Maslow, "A Theory of Human Motivation", in Psychological Review 50(4), 1943.
- [Mau15] B. Maurer, "Fail at Scale", in ACM Queue, vol. 13, no. 12, 2015.
- [May09] M. Mayer, "This site may harm your computer on every search result?!?!", blog post, January 2009.
- [McI86] M. D. McIlroy, "A Research Unix Reader: Annotated Excerpts from the Programmer’s Manual, 1971–1986".
- [McN13] D. McNutt, "Maintaining Consistency in a Massively Parallel Environment", presentation at USENIX Configuration Management Summit 2013, video available online.
- [McN14a] D. McNutt, "Accelerating the Path from Dev to DevOps", in ;login:, vol. 39, no. 2, April 2014.
- [McN14b] D. McNutt, "The 10 Commandments of Release Engineering", presentation at 2nd International Workshop on Release Engineering 2014, April 2014.
- [McN14c] D. McNutt, "Distributing Software in a Massively Parallel Environment", presentation at USENIX LISA 2014, video available online.
- [Mic03] Microsoft TechNet, "What is SNMP?", last modified March 28, 2003, https://technet.microsoft.com/en-us/library/cc776379%28v=ws.10%29.aspx.
- [Mea08] D. Meadows, Thinking in Systems: Chelsea Green, 2008.
- [Men07] P. Menage, "Adding Generic Process Containers to the Linux Kernel", in Proc. of Ottawa Linux Symposium, 2007.
- [Mer11] N. Merchant, "Culture Trumps Strategy, Every Time", in Harvard Business Review, March 22, 2011.
- [Moc87] P. Mockapetris, "Domain Names - Implementation and Specification", IETF Internet Standard, 1987.
- [Mol86] C. Moler, "Matrix Computation on Distributed Memory Multiprocessors", in Hypercube Multiprocessors 1986, 1987.
- [Mor12a] I. Moraru, D. G. Andersen, and M. Kaminsky, "Egalitarian Paxos", Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-12-108, 2012.
- [Mor14] I. Moraru, D. G. Andersen, and M. Kaminsky, "Paxos Quorum Leases: Fast Reads Without Sacrificing Writes", in Proc. SOCC '14, 2014.
- [Mor12b] J. D. Morgenthaler, M. Gridnev, R. Sauciuc, and S. Bhansali, "Searching for Build Debt: Experiences Managing Technical Debt at Google", in Proceedings of the 3rd Int’l Workshop on Managing Technical Debt, 2012.
- [Nar12] C. Narla and D. Salas, "Hermetic Servers", blog post, 2012.
- [Nel14] B. Nelson, "The Data on Diversity", in Communications of the ACM, vol. 57, 2014.
- [Nic12] K. Nichols and V. Jacobson, "Controlling Queue Delay", in ACM Queue, vol. 10, no. 5, 2012.
- [Oco12] P. O’Connor and A. Kleyner, Practical Reliability Engineering, 5th edition: Wiley, 2012.
- [Ohn88] T. Ohno, Toyota Production System: Beyond Large-Scale Production: Productivity Press, 1988.
- [Ong14] D. Ongaro and J. Ousterhout, "In Search of an Understandable Consensus Algorithm (Extended Version)".
- [Pen10] D. Peng and F. Dabek, "Large-scale Incremental Processing Using Distributed Transactions and Notifications", in Proc. of the 9th USENIX Symposium on Operating System Design and Implementation, November 2010.
- [Per99] C. Perrow, Normal Accidents: Living with High-Risk Technologies, Princeton University Press, 1999.
- [Per07] A. R. Perry, "Engineering Reliability into Web Sites: Google SRE", in Proc. of LinuxWorld 2007, 2007.
- [Pik05] R. Pike, S. Dorward, R. Griesemer, S. Quinlan, "Interpreting the Data: Parallel Analysis with Sawzall", in Scientific Programming Journal vol. 13, no. 4, 2005.
- [Pot16] R. Potvin and J. Levenberg, "The Motivation for a Monolithic Codebase: Why Google stores billions of lines of code in a single repository", in Communications of the ACM, vol. 59, no. 7, 2016. Video available on YouTube.
- [Roo04] J. J. Rooney and L. N. Vanden Heuvel, "Root Cause Analysis for Beginners", in Quality Progress, July 2004.
- [Sai39] A. de Saint Exupéry, Terre des Hommes (Paris: Le Livre de Poche, 1939, in translation by Lewis Galantière as Wind, Sand and Stars.
- [Sam14] R. R. Sambasivan, R. Fonseca, I. Shafer, and G. R. Ganger, "So, You Want To Trace Your Distributed System? Key Design Insights from Years of Practical Experience", Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-14-102, 2014.
- [San11] N. Santos and A. Schiper, "Tuning Paxos for High-Throughput with Batching and Pipelining", in 13th Int’l Conf. on Distributed Computing and Networking, 2012.
- [Sar97] N. B. Sarter, D. D. Woods, and C. E. Billings, "Automation Surprises", in Handbook of Human Factors & Ergonomics, 2nd edition, G. Salvendy (ed.), Wiley, 1997.
- [Sch14] E. Schmidt, J. Rosenberg, and A. Eagle, How Google Works: Grand Central Publishing, 2014.
- [Sch15] B. Schwartz, "The Factors That Impact Availability, Visualized", blog post, 21 December 2015.
- [Sch90] F. B. Schneider, "Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial", in ACM Computing Surveys, vol. 22, no. 4, 1990.
- [Sec13] Securities and Exchange Commission, "Order In the Matter of Knight Capital Americas LLC", file 3-15570, 2013.
- [Sha00] G. Shao, F. Berman, and R. Wolski, "Master/Slave Computing on the Grid", in Heterogeneous Computing Workshop, 2000.
- [Shu13] J. Shute et al., "F1: A Distributed SQL Database That Scales", in Proc. VLDB 2013, 2013.
- [Sig10] B. H. Sigelman et al., "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure", Google Technical Report, 2010.
- [Sin15] A. Singh et al., "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network", in SIGCOMM '15.
- [Skel13] M. Skelton, "Operability can Improve if Developers Write a Draft Run Book", blog post, 16 October 2013.
- [Slo11] B. Treynor Sloss, "Gmail back soon for everyone", blog post, 28 February 2011.
- [Tat99] S. Tatham, "How to Report Bugs Effectively", 1999.
- [Ver15] A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, "Large-scale cluster management at Google with Borg", in Proceedings of the European Conference on Computer Systems, 2015.
- [Wal89] D. R. Wallace and R. U. Fujii, "Software Verification and Validation: An Overview", IEEE Software, vol. 6, no. 3 (May 1989), pp. 10, 17.
- [War14] R. Ward and B. Beyer, "BeyondCorp: A New Approach to Enterprise Security", in ;login:, vol. 39, no. 6, December 2014.
- [Whi12] J. A. Whittaker, J. Arbon, and J. Carollo, How Google Tests Software: Addison-Wesley, 2012.
- [Woo96] A. Wood, "Predicting Software Reliability", in Computer, vol. 29, no. 11, 1996.
- [Wri12a] H. K. Wright, "Release Engineering Processes, Their Faults and Failures", (section 7.2.2.2) PhD Thesis, University of Texas at Austin, 2012.
- [Wri12b] H. K. Wright and D. E. Perry, "Release Engineering Practices and Pitfalls", in Proceedings of the 34th International Conference on Software Engineering (ICSE '12). (IEEE, 2012), pp. 1281–1284.
- [Wri13] H. K. Wright, D. Jasper, M. Klimek, C. Carruth, Z. Wan, "Large-Scale Automated Refactoring Using ClangMR", in Proceedings of the 29th International Conference on Software Maintenance (ICSM '13), (IEEE, 2013), pp. 548–551.
- [Yor11] N. York, "Build in the Cloud: Accessing Source Code", Google Engineering Tools blog post, June 2011.
- [Zoo14] ZooKeeper Project (Apache Foundation), "ZooKeeper Recipes and Solutions", in ZooKeeper 3.4 documentation, 2014.