Resilience Engineering: Learning to Embrace Failure

ACM Queue by Tom Limoncelli · September 12, 2012 · Article

"You can't choose whether or not you're going to have failures — they are going to happen no matter what — but you can choose in many cases when you're going to learn the lessons."

Jesse Robbins (Amazon), Kripa Krishnan (Google), and John Allspaw (Etsy) discuss how they built organizations that deliberately trigger failure to get stronger: powering off data centers, running 96-hour disaster simulations, and transforming blame cultures into learning cultures.

A roundtable discussion in ACM Queue where Jesse Robbins (Amazon), John Allspaw (Etsy), Kripa Krishnan (Google), and moderator Tom Limoncelli (Google) discuss the principles of resilience engineering: deliberately triggering failure to build stronger systems.

Robbins describes how Amazon’s GameDay exercises would literally power off data centers without warning to expose latent defects, drawing on his training as a firefighter to design realistic disaster simulations. Krishnan details Google’s equivalent: 72-to-96-hour exercises with hundreds of engineers working around the clock, testing everything from full data-center destruction to rendering entire teams incommunicado. Allspaw argues for blameless postmortems and introduces the “substitution test,” proving that almost any engineer would have made the same decision in context.

The discussion traces a cultural transformation: from organizations where failure meant blame and firing, to ones where deliberately triggering failure became the path to building truly resilient systems. As Robbins puts it: “You can’t choose whether or not you’re going to have failures, they are going to happen no matter what, but you can choose in many cases when you’re going to learn the lessons.”

More Mentions

GeekWire

Q&A: Ex-Amazon 'Master of Disaster' Jesse Robbins on the Power of 'Relentless Optimism' in Startups

October 27, 2012

GeekWire profiles Jesse Robbins as CEO of Opscode, where his motto 'don't fight stupid, make more awesome' drives a company culture founded on relentless positivity, from firefighting to Amazon to building enterprise software.

“When you're trying to change the way big organizations work, a lot of people say no a lot. Rather than try to fight them, you've got to find a way to make them say yes. Being a force for awesome in the world is finding ways to say yes.”

— Jesse Robbins

Thoughtworks

Jesse Robbins Discusses DevOps and Cloud Computing

July 20, 2012 · Video

Jez Humble interviews Jesse Robbins on DevOps, continuous delivery, measuring operations maturity, and infrastructure as code with Chef. Part of a Thoughtworks series with Eric Ries, Elizabeth Hendrickson, and John Allspaw.

Also Mentioned

More Mentions

Q&A: Ex-Amazon 'Master of Disaster' Jesse Robbins on the Power of 'Relentless Optimism' in Startups

Jesse Robbins Discusses DevOps and Cloud Computing