Bibliography
[Ada15] Bram Adams, Stephany Bellomo, Christian Bird, Tamara Marshall-Keim, Foutse Khomh, and Kim Moir, "The Practice and Future of Release Engineering: A Roundtable with Three Release Engineers" , IEEE Software , vol. 32, no. 2 (March/April 2015), pp. 42–49.
[Agu10] M. K. Aguilera, "Stumbling over Consensus Research: Misunderstandings and Issues" , in Replication , Lecture Notes in Computer Science 5959, 2010.
[All10] J. Allspaw and J. Robbins, Web Operations: Keeping the Data on Time : O’Reilly, 2010.
[All12] J. Allspaw, "Blameless PostMortems and a Just Culture" , blog post, 2012.
[All15] J. Allspaw, "Trade-Offs Under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages" , MSc thesis, Lund University, 2015.
[Ana07] S. Anantharaju, "Automating web application security testing" , blog post, July 2007.
[Ana13] R. Ananatharayan et al., "Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams" , in SIGMOD '13 , 2013.
[And05] A. Andrieux, K. Czajkowski, A. Dan, et al., "Web Services Agreement Specification (WS-Agreement)" , September 2005.
[Bai13] P. Bailis and A. Ghodsi, "Eventual Consistency Today: Limitations, Extensions, and Beyond" , in ACM Queue , vol. 11, no. 3, 2013.
[Bai83] L. Bainbridge, "Ironies of Automation" , in Automatica , vol. 19, no. 6, November 1983.
[Bak11] J. Baker et al., "Megastore: Providing Scalable, Highly Available Storage for Interactive Services" , in Proceedings of the Conference on Innovative Data System Research , 2011.
[Bar11] L. A. Barroso, "Warehouse-Scale Computing: Entering the Teenage Decade" , talk at 38th Annual Symposium on Computer Architecture, video available online, 2011.
[Bar13] L. A. Barroso, J. Clidaras, and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition , Morgan & Claypool, 2013.
[Ben12] C. Bennett and A. Tseitlin, "Chaos Monkey Released Into The Wild" , blog post, July 2012.
[Bla14] M. Bland, "Goto Fail, Heartbleed, and Unit Testing Culture" , blog post, June 2014.
[Boc15] L. Bock, Work Rules! , Twelve Books, 2015.
[Bol11] W. J. Bolosky, D. Bradshaw, R. B. Haagens, N. P. Kusters, and P. Li, "Paxos Replicated State Machines as the Basis of a High-Performance Data Store" , in Proc. NSDI 2011 , 2011.
[Boy13] P. G. Boysen, "Just Culture: A Foundation for Balanced Accountability and Patient Safety" , in The Ochsner Journal , Fall 2013.
[Bra15] VM Brasseur, "Failure: Why it happens & How to benefit from it" , YAPC 2015.
[Bre01] E. Brewer, "Lessons From Giant-Scale Services" , in IEEE Internet Computing , vol. 5, no. 4, July / August 2001.
[Bre12] E. Brewer, "CAP Twelve Years Later: How the "Rules" Have Changed" , in Computer , vol. 45, no. 2, February 2012.
[Bro15] M. Brooker, "Exponential Backoff and Jitter" , on AWS Architecture Blog , March 2015.
[Bro95] F. P. Brooks Jr., "No Silver Bullet—Essence and Accidents of Software Engineering", in The Mythical Man-Month , Boston: Addison-Wesley, 1995, pp. 180–186.
[Bru09] J. Brutlag, "Speed Matters" , on Google Research Blog , June 2009.
[Bul80] G. M. Bull, The Dartmouth Time-sharing System : Ellis Horwood, 1980.
[Bur99] M. Burgess, Principles of Network and System Administration : Wiley, 1999.
[Bur06] M. Burrows, "The Chubby Lock Service for Loosely-Coupled Distributed Systems" , in OSDI '06: Seventh Symposium on Operating System Design and Implementation , November 2006.
[Bur16] B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, "Borg, Omega, and Kubernetes" in ACM Queue , vol. 14, no. 1, 2016.
[Cas99] M. Castro and B. Liskov, "Practical Byzantine Fault Tolerance" , in Proc. OSDI 1999 , 1999.
[Cha10] C. Chambers, A. Raniwala, F. Perry, S. Adams, R. Henry, R. Bradshaw, and N. Weizenbaum, "FlumeJava: Easy, Efficient Data-Parallel Pipelines" , in ACM SIGPLAN Conference on Programming Language Design and Implementation , 2010.
[Cha96] T. D. Chandra and S. Toueg, "Unreliable Failure Detectors for Reliable Distributed Systems" , in J. ACM , 1996.
[Cha07] T. Chandra, R. Griesemer, and J. Redstone, "Paxos Made Live—An Engineering Perspective" , in PODC '07: 26th ACM Symposium on Principles of Distributed Computing , 2007.
[Cha06] F. Chang et al., "Bigtable: A Distributed Storage System for Structured Data" , in OSDI '06: Seventh Symposium on Operating System Design and Implementation , November 2006.
[Chr09] G. P. Chrousous, "Stress and Disorders of the Stress System" , in Nature Reviews Endocrinology , vol 5., no. 7, 2009.
[Clos53] C. Clos, "A Study of Non-Blocking Switching Networks" , in Bell System Technical Journal , vol. 32, no. 2, 1953.
[Con15] C. Contavalli, W. van der Gaast, D. Lawrence, and W. Kumari, "Client Subnet in DNS Queries" , IETF Internet-Draft , 2015.
[Con63] M. E. Conway, "Design of a Separable Transition-Diagram Compiler" , in Commun. ACM 6, 7 (July 1963), 396–408.
[Con96] P. Conway, "Preservation in the Digital World" , report published by the Council on Library and Information Resources, 1996.
[Coo00] R. I. Cook, "How Complex Systems Fail" , in Web Operations : O’Reilly, 2010.
[Cor12] J. C. Corbett et al., "Spanner: Google’s Globally-Distributed Database" , in OSDI '12: Tenth Symposium on Operating System Design and Implementation , October 2012.
[Cra10] J. Cranmer, "Visualizing code coverage" , blog post, March 2010.
[Dea13] J. Dean and L. A. Barroso, "The Tail at Scale" , in Communications of the ACM , vol. 56, 2013.
[Dea04] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters" , in OSDI’04: Sixth Symposium on Operating System Design and Implementation , December 2004.
[Dea07] J. Dean, "Software Engineering Advice from Building Large-Scale Distributed Systems" , Stanford CS297 class lecture, Spring 2007.
[Dek02] S. Dekker, "Reconstructing human contributions to accidents: the new view on error and performance" , in Journal of Safety Research , vol. 33, no. 3, 2002.
[Dek14] S. Dekker, The Field Guide to Understanding "Human Error" , 3rd edition: Ashgate, 2014.
[Dic14] C. Dickson, "How Embracing Continuous Release Reduced Change Complexity" , presentation at USENIX Release Engineering Summit West 2014, video available online.
[Dur05] J. Durmer and D. Dinges, "Neurocognitive Consequences of Sleep Deprivation" , in Seminars in Neurology , vol. 25, no. 1, 2005.
[Eis16] D. E. Eisenbud et al., "Maglev: A Fast and Reliable Software Network Load Balancer" , in NSDI '16: 13th USENIX Symposium on Networked Systems Design and Implementation, March 2016.
[Ere03] J. R. Erenkrantz, "Release Management Within Open Source Projects" , in Proceedings of the 3rd Workshop on Open Source Software Engineering , Portland, Oregon, May 2003.
[Fis85] M. J. Fischer, N. A. Lynch, and M. S. Paterson, "Impossibility of Distributed Consensus with One Faulty Process" , J. ACM , 1985.
[Fit12] B. W. Fitzpatrick and B. Collins-Sussman, Team Geek: A Software Developer’s Guide to Working Well with Others : O’Reilly, 2012.
[Flo94] S. Floyd and V. Jacobson, "The Synchronization of Periodic Routing Messages" , in IEEE/ACM Transactions on Networking, vol. 2, issue 2, April 1994, pp. 122–136.
[For10] D. Ford et al, "Availability in Globally Distributed Storage Systems" , in Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation , 2010.
[Fox99] A. Fox and E. A. Brewer, "Harvest, Yield, and Scalable Tolerant Systems" , in Proceedings of the 7th Workshop on Hot Topics in Operating Systems , Rio Rico, Arizona, March 1999.
[Fow08] M. Fowler, "GUI Architectures" , blog post, 2006.
[Gal78] J. Gall, SYSTEMANTICS: How Systems Really Work and How They Fail , 1st ed., Pocket, 1977.
[Gal03] J. Gall, The Systems Bible: The Beginner’s Guide to Systems Large and Small , 3rd ed., General Systemantics Press/Liberty, 2003.
[Gaw09] A. Gawande, The Checklist Manifesto: How to Get Things Right : Henry Holt and Company, 2009.
[Ghe03] S. Ghemawat, H. Gobioff, and S-T. Leung, "The Google File System" , in 19th ACM Symposium on Operating Systems Principles , October 2003.
[Gil02] S. Gilbert and N. Lynch, "Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services" , in ACM SIGACT News , vol. 33, no. 2, 2002.
[Gla02] R. Glass, Facts and Fallacies of Software Engineering , Addison-Wesley Professional, 2002.
[Gol14] W. Golab et al., "Eventually Consistent: Not What You Were Expecting?" , in ACM Queue , vol. 12, no. 1, 2014.
[Gra09] P. Graham, "Maker’s Schedule, Manager’s Schedule" , blog post, July 2009.
[Gup15] A. Gupta and J. Shute, "High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads" , in Workshop on Business Intelligence for the Real Time Enterprise , 2015.
[Ham07] J. Hamilton, "On Designing and Deploying Internet-Scale Services" , in Proceedings of the 21st Large Installation System Administration Conference , November 2007.
[Han94] S. Hanks, T. Li, D. Farinacci, and P. Traina, "Generic Routing Encapsulation over IPv4 networks" , IETF Informational RFC , 1994.
[Hic11] M. Hickins, "Tape Rescues Google in Lost Email Scare" , in Digits , Wall Street Journal , 1 March 2011.
[Hix15a] D. Hixson, "Capacity Planning" , in ;login: , vol. 40, no. 1, February 2015.
[Hix15b] D. Hixson, "The Systems Engineering Side of Site Reliability Engineering" , in ;login: vol. 40, no. 3, June 2015.
[Hod13] J. Hodges, "Notes on Distributed Systems for Young Bloods" , blog post, 14 January 2013.
[Hol14] L. Holmwood, "Applying Cardiac Alarm Management Techniques to Your On-Call" , blog post, 26 August 2014.
[Hum06] J. Humble, C. Read, D. North, "The Deployment Production Line", in Proceedings of the IEEE Agile Conference , July 2006.
[Hum10] J. Humble and D. Farley, Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation : Addison-Wesley, 2010.
[Hun10] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, "ZooKeeper: Wait-free coordination for Internet-scale systems" , in USENIX ATC , 2010.
[IAEA12] International Atomic Energy Agency, "Safety of Nuclear Power Plants: Design, SSR-2/1" , 2012.
[Jai13] S. Jain et al., "B4: Experience with a Globally-Deployed Software Defined WAN" , in SIGCOMM '13 .
[Jon15] C. Jones, T. Underwood, and S. Nukala, "Hiring Site Reliability Engineers" , in ;login: , vol. 40, no. 3, June 2015.
[Jun07] F. Junqueira, Y. Mao, and K. Marzullo, "Classic Paxos vs. Fast Paxos: Caveat Emptor" , in Proc. HotDep '07 , 2007.
[Jun11] F. P. Junqueira, B. C. Reid, and M. Serafini, "Zab: High-performance broadcast for primary-backup systems." , in Dependable Systems & Networks (DSN), 2011 IEEE/IFIP 41st International Conference on 27 Jun 2011: 245–256.
[Kah11] D. Kahneman, Thinking, Fast and Slow : Farrar, Straus and Giroux, 2011.
[Kar97] D. Karger et al., "Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web" , in Proc. STOC '97 , 29th annual ACM symposium on theory of computing, 1997.
[Kem11] C. Kemper, "Build in the Cloud: How the Build System Works" , Google Engineering Tools blog post, August 2011.
[Ken12] S. Kendrick, "What Takes Us Down?" , in ;login: , vol. 37, no. 5, October 2012
[Kinc09] Kincaid, Jason. "T-Mobile Sidekick Disaster: Danger’s Servers Crashed, And They Don’t Have A Backup." Techcrunch . n.p., 10 Oct. 2009. Web. 20 Jan. 2015, https://techcrunch.com/2009/10/10/t-mobile-sidekick-disaster-microsofts-servers-crashed-and-they-dont-have-a-backup
[Kin15] K. Kingsbury, "The trouble with timestamps" , blog post, 2013.
[Kir08] J. Kirsch and Y. Amir, "Paxos for System Builders: An Overview" , in Proc. LADIS '08 , 2008.
[Kla12] R. Klau, "How Google Sets Goals: OKRs" , blog post, October 2012.
[Kle06] D. V. Klein, "A Forensic Analysis of a Distributed Two-Stage Web-Based Spam Attack" , in Proceedings of the 20th Large Installation System Administration Conference , December 2006.
[Kle14] D. V. Klein, D. M. Betser, and M. G. Monroe, "Making Push On Green a Reality" , in ;login: , vol. 39, no. 5, October 2014.
[Kra08] T. Krattenmaker, "Make Every Meeting Matter" , in Harvard Business Review , February 27, 2008.
[Kre12] J. Kreps, "Getting Real About Distributed System Reliability" , blog post, 19 March 2012.
[Kri12] K. Krishan, "Weathering The Unexpected" , in Communications of the ACM , vol. 55, no. 11, November 2012
[Kum15] A. Kumar et al., "BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing" , in SIGCOMM '15 .
[Lam98] L. Lamport, "The Part-Time Parliament" , in ACM Transactions on Computer Systems 16, 2 , May 1998.
[Lam01] L. Lamport, "Paxos Made Simple" , in ACM SIGACT News 121, December 2001.
[Lam06] L. Lamport, "Fast Paxos" , in Distributed Computing 19.2, October 2006.
[Lim14] T. A. Limoncelli, S. R. Chalup, and C. J. Hogan, The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems, Volume 2 : Addison-Wesley, 2014.
[Loo10] J. Loomis, "How to Make Failure Beautiful: The Art and Science of Postmortems", in Web Operations : O’Reilly, 2010.
[Lu15] H. Lu et al, "Existential Consistency: Measuring and Understanding Consistency at Facebook" , in SOSP '15, 2015.
[Mao08] Y. Mao, F. P. Junqueira, and K. Marzullo, "Mencius: Building Efficient Replicated State Machines for WANs" , in OSDI '08 , 2008.
[Mas43] A. H. Maslow, "A Theory of Human Motivation", in Psychological Review 50(4), 1943.
[Mau15] B. Maurer, "Fail at Scale" , in ACM Queue , vol. 13, no. 12, 2015.
[May09] M. Mayer, "This site may harm your computer on every search result?!?!" , blog post, January 2009.
[McI86] M. D. McIlroy, "A Research Unix Reader: Annotated Excerpts from the Programmer’s Manual, 1971–1986" .
[McN13] D. McNutt, "Maintaining Consistency in a Massively Parallel Environment" , presentation at USENIX Configuration Management Summit 2013, video available online.
[McN14a] D. McNutt, "Accelerating the Path from Dev to DevOps" , in ;login: , vol. 39, no. 2, April 2014.
[McN14b] D. McNutt, "The 10 Commandments of Release Engineering" , presentation at 2nd International Workshop on Release Engineering 2014, April 2014.
[McN14c] D. McNutt, "Distributing Software in a Massively Parallel Environment" , presentation at USENIX LISA 2014, video available online.
[Mic03] Microsoft TechNet, "What is SNMP?", last modified March 28, 2003, https://technet.microsoft.com/en-us/library/cc776379%28v=ws.10%29.aspx .
[Mea08] D. Meadows, Thinking in Systems : Chelsea Green, 2008.
[Men07] P. Menage, "Adding Generic Process Containers to the Linux Kernel" , in Proc. of Ottawa Linux Symposium , 2007.
[Mer11] N. Merchant, "Culture Trumps Strategy, Every Time" , in Harvard Business Review , March 22, 2011.
[Moc87] P. Mockapetris, "Domain Names - Implementation and Specification" , IETF Internet Standard , 1987.
[Mol86] C. Moler, "Matrix Computation on Distributed Memory Multiprocessors", in Hypercube Multiprocessors 1986 , 1987.
[Mor12a] I. Moraru, D. G. Andersen, and M. Kaminsky, "Egalitarian Paxos" , Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-12-108 , 2012.
[Mor14] I. Moraru, D. G. Andersen, and M. Kaminsky, "Paxos Quorum Leases: Fast Reads Without Sacrificing Writes" , in Proc. SOCC '14 , 2014.
[Mor12b] J. D. Morgenthaler, M. Gridnev, R. Sauciuc, and S. Bhansali, "Searching for Build Debt: Experiences Managing Technical Debt at Google" , in Proceedings of the 3rd Int’l Workshop on Managing Technical Debt , 2012.
[Nar12] C. Narla and D. Salas, "Hermetic Servers" , blog post, 2012.
[Nel14] B. Nelson, "The Data on Diversity" , in Communications of the ACM , vol. 57, 2014.
[Nic12] K. Nichols and V. Jacobson, "Controlling Queue Delay" , in ACM Queue , vol. 10, no. 5, 2012.
[Oco12] P. O’Connor and A. Kleyner, Practical Reliability Engineering , 5th edition: Wiley, 2012.
[Ohn88] T. Ohno, Toyota Production System: Beyond Large-Scale Production : Productivity Press, 1988.
[Ong14] D. Ongaro and J. Ousterhout, "In Search of an Understandable Consensus Algorithm (Extended Version)" .
[Pen10] D. Peng and F. Dabek, "Large-scale Incremental Processing Using Distributed Transactions and Notifications" , in Proc. of the 9th USENIX Symposium on Operating System Design and Implementation , November 2010.
[Per99] C. Perrow, Normal Accidents: Living with High-Risk Technologies , Princeton University Press, 1999.
[Per07] A. R. Perry, "Engineering Reliability into Web Sites: Google SRE" , in Proc. of LinuxWorld 2007 , 2007.
[Pik05] R. Pike, S. Dorward, R. Griesemer, S. Quinlan, "Interpreting the Data: Parallel Analysis with Sawzall" , in Scientific Programming Journal vol. 13, no. 4, 2005.
[Pot16] R. Potvin and J. Levenberg, "The Motivation for a Monolithic Codebase: Why Google stores billions of lines of code in a single repository" , in Communications of the ACM , vol. 59, no. 7, 2016. Video available on YouTube .
[Roo04] J. J. Rooney and L. N. Vanden Heuvel, "Root Cause Analysis for Beginners" , in Quality Progress , July 2004.
[Sai39] A. de Saint Exupéry, Terre des Hommes (Paris: Le Livre de Poche, 1939, in translation by Lewis Galantière as Wind, Sand and Stars .
[Sam14] R. R. Sambasivan, R. Fonseca, I. Shafer, and G. R. Ganger, "So, You Want To Trace Your Distributed System? Key Design Insights from Years of Practical Experience" , Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-14-102, 2014.
[San11] N. Santos and A. Schiper, "Tuning Paxos for High-Throughput with Batching and Pipelining" , in 13th Int’l Conf. on Distributed Computing and Networking , 2012.
[Sar97] N. B. Sarter, D. D. Woods, and C. E. Billings, "Automation Surprises", in Handbook of Human Factors & Ergonomics , 2nd edition, G. Salvendy (ed.), Wiley, 1997.
[Sch14] E. Schmidt, J. Rosenberg, and A. Eagle, How Google Works : Grand Central Publishing, 2014.
[Sch15] B. Schwartz, "The Factors That Impact Availability, Visualized" , blog post, 21 December 2015.
[Sch90] F. B. Schneider, "Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial" , in ACM Computing Surveys , vol. 22, no. 4, 1990.
[Sec13] Securities and Exchange Commission, "Order In the Matter of Knight Capital Americas LLC" , file 3-15570, 2013.
[Sha00] G. Shao, F. Berman, and R. Wolski, "Master/Slave Computing on the Grid" , in Heterogeneous Computing Workshop , 2000.
[Shu13] J. Shute et al., "F1: A Distributed SQL Database That Scales" , in Proc. VLDB 2013 , 2013.
[Sig10] B. H. Sigelman et al., "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" , Google Technical Report, 2010.
[Sin15] A. Singh et al., "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network" , in SIGCOMM '15 .
[Skel13] M. Skelton, "Operability can Improve if Developers Write a Draft Run Book" , blog post, 16 October 2013.
[Slo11] B. Treynor Sloss, "Gmail back soon for everyone" , blog post, 28 February 2011.
[Tat99] S. Tatham, "How to Report Bugs Effectively" , 1999.
[Ver15] A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, "Large-scale cluster management at Google with Borg" , in Proceedings of the European Conference on Computer Systems , 2015.
[Wal89] D. R. Wallace and R. U. Fujii, "Software Verification and Validation: An Overview" , IEEE Software , vol. 6, no. 3 (May 1989), pp. 10, 17.
[War14] R. Ward and B. Beyer, "BeyondCorp: A New Approach to Enterprise Security" , in ;login: , vol. 39, no. 6, December 2014.
[Whi12] J. A. Whittaker, J. Arbon, and J. Carollo, How Google Tests Software : Addison-Wesley, 2012.
[Woo96] A. Wood, "Predicting Software Reliability" , in Computer , vol. 29, no. 11, 1996.
[Wri12a] H. K. Wright, "Release Engineering Processes, Their Faults and Failures" , (section 7.2.2.2) PhD Thesis, University of Texas at Austin, 2012.
[Wri12b] H. K. Wright and D. E. Perry, "Release Engineering Practices and Pitfalls" , in Proceedings of the 34th International Conference on Software Engineering (ICSE '12) . (IEEE, 2012), pp. 1281–1284.
[Wri13] H. K. Wright, D. Jasper, M. Klimek, C. Carruth, Z. Wan, "Large-Scale Automated Refactoring Using ClangMR" , in Proceedings of the 29th International Conference on Software Maintenance (ICSM '13) , (IEEE, 2013), pp. 548–551.
[Yor11] N. York, "Build in the Cloud: Accessing Source Code" , Google Engineering Tools blog post, June 2011.
[Zoo14] ZooKeeper Project (Apache Foundation), "ZooKeeper Recipes and Solutions" , in ZooKeeper 3.4 documentation, 2014.