Resilience Engineering: Learning to Embrace Failure
ACM Queue by Tom Limoncelli · · Article
"You can't choose whether or not you're going to have failures — they are going to happen no matter what — but you can choose in many cases when you're going to learn the lessons."
Jesse Robbins (Amazon), Kripa Krishnan (Google), and John Allspaw (Etsy) discuss how they built organizations that deliberately trigger failure to get stronger — powering off data centers, running 96-hour disaster simulations, and transforming blame cultures into learning cultures.
A roundtable discussion in ACM Queue where Jesse Robbins (Amazon), John Allspaw (Etsy), Kripa Krishnan (Google), and moderator Tom Limoncelli (Google) discuss the principles of resilience engineering — deliberately triggering failure to build stronger systems.
Robbins describes how Amazon’s GameDay exercises would literally power off data centers without warning to expose latent defects, drawing on his training as a firefighter to design realistic disaster simulations. Krishnan details Google’s equivalent: 72-to-96-hour exercises with hundreds of engineers working around the clock, testing everything from full data-center destruction to rendering entire teams incommunicado. Allspaw argues for blameless postmortems and introduces the “substitution test” — proving that almost any engineer would have made the same decision in context.
The discussion traces a cultural transformation: from organizations where failure meant blame and firing, to ones where deliberately triggering failure became the path to building truly resilient systems. As Robbins puts it: “You can’t choose whether or not you’re going to have failures — they are going to happen no matter what — but you can choose in many cases when you’re going to learn the lessons.”