June 6, 2020 Volume 18, issue 2 PDF Debugging Incidents in Google's Distributed Systems How experts debug production issues in complex distributed systems Charisma Chan and Beth Cooper Google has published two books about SRE (Site Reliability Engineering) principles, best practices, and practical applications.1,2 In the heat of the moment when handling a production incident, however, a team's act