ACM Queue logo

Resilience Engineering: Learning to Embrace Failure

ACM Queue by Tom Limoncelli · · Article

"You can't choose whether or not you're going to have failures — they are going to happen no matter what — but you can choose in many cases when you're going to learn the lessons."

— Jesse Robbins

Jesse Robbins (Amazon), Kripa Krishnan (Google), and John Allspaw (Etsy) discuss how they built organizations that deliberately trigger failure to get stronger — powering off data centers, running 96-hour disaster simulations, and transforming blame cultures into learning cultures.

A roundtable discussion in ACM Queue where Jesse Robbins (Amazon), John Allspaw (Etsy), Kripa Krishnan (Google), and moderator Tom Limoncelli (Google) discuss the principles of resilience engineering — deliberately triggering failure to build stronger systems.

Robbins describes how Amazon’s GameDay exercises would literally power off data centers without warning to expose latent defects, drawing on his training as a firefighter to design realistic disaster simulations. Krishnan details Google’s equivalent: 72-to-96-hour exercises with hundreds of engineers working around the clock, testing everything from full data-center destruction to rendering entire teams incommunicado. Allspaw argues for blameless postmortems and introduces the “substitution test” — proving that almost any engineer would have made the same decision in context.

The discussion traces a cultural transformation: from organizations where failure meant blame and firing, to ones where deliberately triggering failure became the path to building truly resilient systems. As Robbins puts it: “You can’t choose whether or not you’re going to have failures — they are going to happen no matter what — but you can choose in many cases when you’re going to learn the lessons.”