ACM Queue logo

Resilience Engineering: Learning to Embrace Failure

Tom Limoncelli

ACM Queue by Tom Limoncelli · · Article

"You can't choose whether or not you're going to have failures — they are going to happen no matter what — but you can choose in many cases when you're going to learn the lessons."

— Jesse Robbins

Jesse Robbins (Amazon), Kripa Krishnan (Google), and John Allspaw (Etsy) discuss how they built organizations that deliberately trigger failure to get stronger: powering off data centers, running 96-hour disaster simulations, and transforming blame cultures into learning cultures.

I'm a contributor here, not a coauthor. Tom Limoncelli moderated and wrote it. Kripa, John, and I described what we'd built. Worth saying clearly because people sometimes describe me as a coauthor and I'm not.

What I told Tom about GameDay at Amazon. What Kripa told him about Google's DiRT exercises. What John told him about blameless postmortems at Etsy. Three teams, same problem, different attacks. None of us had read each other's work first. That's the part that mattered.

The line of mine that travels: "You can't choose whether or not you're going to have failures. They are going to happen no matter what. You can choose in many cases when you're going to learn the lessons."

Three teams. Three companies. Same answer arrived at independently. That is the part of this article that still matters.

Tom Limoncelli moderated the discussion and wrote it. Tom Limoncelli is extraordinarily accomplished. His books shaped our profession as it evolved. He pulled three of us into the same room: Kripa Krishnan from Google, John Allspaw from Etsy, and me from Amazon. None of us had read each other first. We had built versions of the same discipline because the problem was the same.

Kripa Krishnan ran Google’s program, which they called DiRT. By 2012 she had been doing it for about six years. Her exercises ran 72 to 96 hours, hundreds of engineers around the clock, war rooms staffed by about fifty rotating volunteers. The details I have never forgotten are the failures that surfaced. Google brought down a network in São Paulo and watched the links die in Mexico, because nobody knew the dependency was there. A data center where the machines refused to come back online because they had run out of DHCP leases. Kripa had Ben Treynor as her executive sponsor. Ben went on to create a similar program at Google called Site Reliability Engineering (SRE). Without that air cover, none of this happens.

John Allspaw was running technical operations at Etsy after stops at Salon.com, Friendster, and Flickr, where he was engineering manager. John brought the academic frame to the conversation. He was the one who put Erik Hollnagel’s four cornerstones of resilience on the table: anticipation, monitoring, response, learning. He was the one who said the thing the industry took years to absorb. He had announced publicly that he would not fire an engineer for taking down a site he was responsible for. He described the substitution test Etsy used in postmortems. Pull in an uninvolved engineer, give them the same context, ask what they would have done. Almost every time, the answer is the same command. The problem is not the person. The Brooklyn Bridge line is his too. You do not shut down the whole bridge because one lane is out.

What I described was GameDay at Amazon, which I had started in 2003 and 2004, when horizontal scalability across unreliable hardware was not yet a settled idea. We were feeling our way. The exercises were not simulated. We powered facilities off without notice and let the systems fail naturally. In one of them I used my fire service training to script a simulated fire down to the minute, with operators posing as facilities staff calling operations with updates. I had executive support I never took for granted. Werner Vogels was my exec sponsor. Jeff Bezos thought the idea was interesting. The Amazon ops team was the village that actually did the work. Tim O’Reilly later gave this community a place to find each other in public, the Velocity Conference. I cofounded Velocity with Steve Souders after I left Amazon, and later passed the torch to John Allspaw.

The discipline framework did not come from tech. I trained as a volunteer firefighter with the Seattle Fire Department before I stepped away from tech. Fire service teaches you that you do not get good at incident response by having an opinion about it. You get good by drilling, and then drilling more, and then drilling again under conditions you do not control. Build the muscle continuously, because the only way to find the failures hiding inside a complex system is to trigger them on purpose, on your terms, before they trigger themselves on theirs.

That is the line I gave Tom that has traveled the furthest. You do not get to choose whether you have failures. You get to choose, in many cases, when you learn the lessons.

The convergence is what still holds up. Three organizations, no shared playbook, all arriving at the same answer. Stop trying to prevent failure. Build the muscle to absorb it. Never blame the human at the end of the chain for the system around them. What happened after this article ran is the next part of the story.

People

Since this came out…

  1. AWS shipped Fault Injection Service. The same patterns I ran by hand at Amazon, now a managed service anyone can call.
  2. Google shipped the SRE book. Kripa's piece of this roundtable became a chapter in how Google runs production.
  3. Casey Rosenthal and team at Netflix published the Principles of Chaos Engineering. The discipline got a name, a manifesto, and a community.
  4. Ex-Amazon engineers at Netflix turned Chaos Monkey into the Simian Army. They automated what we did by hand.

Further reading

Topics