"You don't choose the moment, the moment chooses you. You only choose how prepared you are when it does."
In this USENIX LISA'11 talk, Jesse Robbins explains GameDay: deliberately injecting failures into production systems to build organizational resilience before real outages happen.
This is the talk where I laid out GameDay for the first time to a room full of sysadmins. USENIX LISA 2011 in Boston. I had been running these exercises at Amazon for years, and I had been telling the story at Velocity and in smaller rooms, but this was the first time I walked through the full methodology from beginning to end on a conference stage.
I opened with the story my fire chief told me on my first day of firefighter academy. He said when you go home tonight and tell your neighbor you are becoming a firefighter, something changes. At 2:00 in the morning when their kid starts choking, they will not dial 911. They will pound on your door with a slumped over kid, looking to you to do something. “Welcome to the other side of 911.” And then he said the thing that set the arc of my entire career: “You don’t choose the moment. The moment chooses you. You only choose how prepared you are when it does.”
That is operations. We are the ones they call. We are the ones people look to when things are broken.
The core argument of the talk is simple. Resilience is not a property of your software. It is a property of your entire system, and the system includes people, culture, processes, applications, infrastructure, and hardware. People and culture are the most important part. That is weird for a room full of us who tend to shy away from human interaction, but it is the truth. GameDay is about changing people to be resilient.
I created GameDay at Amazon after joining in 2001 while testing for the Seattle Fire Department. My title was “Master of Disaster.” I owned website availability for every property that bore the Amazon name. I realized the way we were running operations was not going to scale, and I began adapting fire service incident management processes, training, drilling, and fire prevention concepts to make Amazon operate like a fire department. I substituted the word “management” for “command” and created an incident management system that was a word-for-word adaptation of the fire department’s incident command system. Werner Vogels was my internal executive sponsor.
The methodology has three stages. First, preparation. You identify and mitigate risks. You walk through the system and find the stupid stuff. The oil barrels next to the core database servers. The single points of failure nobody talks about. This alone reduces failure frequency and improves recovery time.
Second, you run the drill. This is where most organizations fail. They do a lightweight tabletop exercise, declare victory, and never go further. If you never go all the way to a full-scale live failure exercise, you do not get the confidence that comes from responding to a stressful situation at full speed. As a firefighter, I have been through countless live fire drills. The fire will kill you. People die in training. I can only be an effective firefighter having fought fire. You also discover that systems and processes you thought would work do not. Cell phone systems have single points of failure. Nobody has a printout of everyone’s phone number. The most basic things only surface under real stress.
Third, you expose latent defects. These are the impossible failures. The ones that cannot happen until they do. They sit underneath the waterline of your systems, and you cannot discover them any other way. They are immediately recognizable in hindsight. “Oh, we totally should have known there was a dependency on that developer’s desktop.” You do not get to choose whether you have latent defects. You do get to choose when you discover some of them. Run the exercise off-peak rather than finding out on your most important traffic day.
The progression matters. You start small. You work with the smallest group of developers who are receptive, and you break something that is a little scary but not too scary. Think of the kid with the fire hose. These hoses produce 80 to 100 pounds of back pressure. You do not hand a recruit a high-pressure fire line on day one. You give them a garden hose, a small pan fire, and you let them succeed. Then they tell everyone how awesome it was. You build on those successes.
Then you move up to the full-scale live fire exercise. You pick the worst survivable scenario. I recommend a full data center power-down. It will terrify everyone. Everyone will hate you during the planning phase. It is going to be good for them. You give them a couple months notice. You tell them the date, the time, the facility. They have months to remediate. And then the week comes, and people ask if you are really going to power it down. Yes. You are really going to power it down. You can slip a date, but never cancel. Otherwise no one will ever believe you again. I always power it down.
The first time will be a disaster. You will learn more from that one exercise than from years of tabletop reviews. Probably you will learn that your database masters do not come back up after an EPO reset. You will be glad you chose when to learn that.
The reason all of this works is the OODA loop, the observe-orient-decide-act cycle that John Boyd described from fighter pilot training. In any crisis, you follow a predictable response. You observe what is going on. You orient yourself based on your training, your experience, whether you have been exposed to situations like this before. If you have never been through a full-scale outage, you will lock up. There are predictable failure modes. I can tell you what you will do. The only reason I am different is I have been through it a lot.
GameDay became an internal competitive advantage at Amazon. One team would say, “Go ahead, rip it out. We can power stuff off all day.” The other teams wanted to get there. That is the currency for change that John Allspaw talks about. Not the crisis-driven kind where you have a terrible outage and a short window to make sweeping change. The better kind. The kind where people say, “We are better because this happened.”
Every large-scale web operation has since learned some version of this or perished. Google adopted it. Engineers who had been at Amazon brought the practice to Netflix and built Chaos Monkey, which randomly terminated instances in production. They expanded it into the Simian Army and later Chaos Kong for regional failover testing. Facebook, Yahoo, and dozens of others built their own programs.
AWS turned GameDay into a core operational practice and eventually a customer-facing program. The Well-Architected Framework now defines “game day” as a formal reliability concept. AWS Fault Injection Service is a managed chaos engineering service that automates the kind of fault injection I was doing by hand. The FIS team runs their own game days using the service they built. That is the right kind of recursion.
The practices I described in this talk, controlled fault injection, pre-announced failure exercises, progressive escalation, and blameless post-incident review, became core patterns in what the industry later formalized as chaos engineering.
While I'm a founder of Opscode, I'm not going to talk about automation today. I'm going to talk about a thing called GameDay, which is a program I created. Certainly not new to the world, but new to web operations, to increase availability using large-scale fault injection.
To begin, I want to tell a story about my very first day as a firefighter in firefighter academy. I took a brief break from being in tech to go become a firefighter, and I ended up getting pulled back into tech. On the very first day of firefighter academy, my chief told me something that has changed my entire life and set the arc of my career development. He said, "When you go home tonight and you tell your neighbor that you're becoming a firefighter, all they will hear is you're going to be a firefighter, and something changes for you in the world. At 2:00 in the morning when their kid starts choking, instead of dialing 911, what they will do is come and pound on your door with a slumped over kid, looking to you to do something. You need to do something." And he said, "Welcome to the other side of 911."
Then he said, "For the rest of your life, remember that you don't choose the moment. The moment chooses you. You only get to choose how prepared you are when it does."
That is operations to me. That is what we do. We're the ones they call. We're the ones that people look to to have the answers when things are broken. And we're the ones who at least initially make a difference in how we're prepared inside of an organization for all sorts of things.
I believe very deeply that our work in operations is work that matters. It is a noble calling. And it is important to remember when we think about what we do organizationally, that more and more of the world depends on the web, and so more and more of the world depends on us. That means we have to think differently about how we approach our jobs as the world's infrastructure shifts and people now need us in a way that before we were just maintaining hosts. Now we're maintaining infrastructures, and those infrastructures are the fabric of society.
I'm going to talk to you today about how to make that infrastructure more reliable through something we call GameDay. Very simply, GameDay is an exercise designed to increase resilience through the injection of large-scale, catastrophic-level faults into the critical systems that we depend on. It is part of a larger discipline called resilience engineering, which has not historically been directly applied to the large-scale systems administration space, although there are a number of people, including John Allspaw and Artur Bergman and Paul Hammond and a number of other folks, that are beginning to map these concepts. It is not new. It's just new to us.
When we talk about resilience, what we're talking about is the ability of a system to adapt to changes, failures, and disturbances. When we say system, it's not just a host. I won't really talk about hosts at all in this talk. I'm talking about a system that is both the technology and the networks, servers, applications, processes, and people. Most importantly, people. So this will be on the test. When you go away and you say, "What did I learn from this?" Resilience is a function of people and culture predominantly, which is weird for a group of us that tend to shy away from lots of human interaction. This is mostly about how you change people to be resilient.
In 2001, I joined Amazon. I was testing for the Seattle Fire Department. I was not planning on going back into tech. My title was Master of Disaster, and I owned website availability for every property that bore the Amazon name. I came to own it. My first job at Amazon was running the backup systems. Who here loves running backup systems?
So, like everyone in this room, I hated running backup systems too, but knew how to do it. And it got me a job that paid pretty well. I was volunteering, doing a bunch of stuff with the fire service, which you have to do in order to get hired. It's a long process and a whole other story. I got hired on August 20th of 2001. And on September 11th, I woke up in the hospital having had emergency surgery the night before. And I realized that as I was watching what was unfolding, there were thousands and thousands of employees of Amazon at the time who would depend in the coming days on the infrastructure that we provided. I realized that if we had come under attack, I would need to put in place these systems and processes that I had just been creating and updating at the time. It was the first time that I realized there might be an application between my fire service background and the tech world.
That led to me working on a lot of the availability stuff, keeping site reliability up. At larger and larger scale, complexity results in lots and lots of failures happening. In the past decade, as an engineering culture, we've learned that scaling up, throwing vertical scaling, has not resulted in us seeing lots of increased availability. Scaling out is what resulted in that. The reason is that failure happens in environments that we build. As you increase in scale and complexity, we see an increasing number of failures. They can't be fully eliminated. Lots of vendors like to promise it. They're either foolish or lying. Systems as they increase in size, connectedness, and complexity have an increased failure rate.
The realities of this mirror a bunch of other industries. If you want to learn everything that I started off learning, read this book called Normal Accidents. In that book, Charles Perrow said, "Multiple unexpected interactions of failures are inevitable." You can reduce them, but you're always going to have failures. There's no amount of engineering that you can apply to a problem to eliminate failures.
What we found, and really the evolution of engineering in our industry has followed, is that mean time to recovery is way more awesome than mean time between failure. It's much harder to get to a single system that never fails than to have lots of systems that can tolerate individual failures of components.
As I was making my transition at Amazon, I realized that there was potential to apply fire service discipline, rigor, safety techniques to Amazon's operating models, and basically kind of turn Amazon into a fire department. I began adapting fire service incident management processes, training, drilling, fire prevention concepts, and doing things to make individuals within the company think like firefighters. That included putting developers on call for their services. In exchange for being able to deploy really fast without a lot of process and constraint, you have to run your own services. That meant they had to be able to respond like our ops teams could respond within certain windows. That required training and drilling.
As the organization became more complicated, larger scale, more interconnected systems, it also required us to use incident management techniques. I used a system called incident command. I substituted the word "command" for "management." I created this incident management system which was a word-for-word adaptation of the fire department incident command modular adaptive system for dealing with incidents.
As we did this, I got some great support from executives. Werner was my internal executive sponsor, and Bezos certainly thought it was interesting. I realized that part of what we needed culturally and systemically was a way of training and drilling people so that they would improve both the availability of their own sites and their effectiveness in working together. We also understood that there are certain types of failures that many people have a hard time conceptualizing. One of those is data center level failures. I have lived in a world for 15 years where data center failures are a regular occurrence. In Charles Perrow's Normal Accidents world, it's a normal accident. I have a whole slide deck of my favorite data center failures. Particularly ones where, if your company promises or if you ever say you have a bulletproof data center, that's the next day you will have a bunch of taser-wielding thieves breaking in and stealing stuff out of your facility, or it'll be on fire, or a tornado will come and knock it over. Like, that's how you invite that. Just FYI. So never say that.
There were many failure types that could only be exposed using an operating model where we broke things intentionally or waited for real disasters to occur.
So I created this program called GameDay. There's a preparation phase. Most of you have done this when you're doing those pointless DR tabletop exercises. How many people have participated in a worthless DR exercise? Awesome. How many of you then burned the building down?
In the preparation phase, you're identifying and mitigating risks. You're saying, "Look, we're going to think about this, walk through, and find the stupid things." Like, we really probably should remove those oil barrels from our core database servers, or we really should rearchitect this particular system for increased availability. This has two properties. One, it reduces the frequency of failure. Two, it reduces the duration of recovery, although change tends to make things more complex.
Then you run the drill. This is where most organizations fail. Most organizations end up doing a very lightweight version of an exercise. It's not bad to do that. It's a starting place. But if you never go all the way through to a full-scale destruction, a live failure exercise, you don't get the benefit of a couple of things. The first is the confidence that you get responding to a stressful situation, recovering from an actual failure at full speed. As a firefighter, I have been through countless live fire drills where I've been called to a place, we put out a fire. We're wearing gear, the fire will kill you, people die in training. It's an actual thing that happens. I can only be an effective firefighter having fought fire.
The second is that when you do the actual exercise, you see that there is an individual and cultural ability to think about what actually happens in an incident. One of the most common things when you do the tabletop exercise versus a real disaster is that people find all kinds of systems and processes they thought would work, and oh, it turns out that our cell phone systems do have a single point of failure. We don't have a printout of everyone's phone number. Even the most basic things that in a tabletop drill you never capture.
The last thing when you actually run these exercises is being able to trigger and expose latent defects. Latent defects are those gotchas, those "impossible" failures. How many of you have had an impossible failure? It can never happen. Latent defects are sitting there underneath the waterline of the systems that we build. You can't discover them any other way. They're immediately recognizable in hindsight, like, "Oh, we totally should have known that there was a dependency on that guy's desktop." We never turn off the office network, and so we would never find this problem.
You don't get to choose whether or not you have latent defects. You're always going to have them. You do get to choose when you discover some of them. And so you get to choose how much it's going to cost you. Finding that developer desktop that has that DNS dependency that he knows about but he's on vacation, that's going to cost you money in an outage. So the reason you run these exercises is, let's run an exercise off-peak rather than finding out about that on your most important traffic day.
The way that I've done this, both at that large retailer and elsewhere, is we start small. Understand that you're changing culture and that lots of people are going to resist you initially. You want to make the exercises achievable. Think of that kid with the fire hose. Those hoses when they're actually flowing water produce about 80 to 100 pounds of back pressure. That kid can go flying really high in the air. What you don't want to do when you're getting that kid excited about their future in the fire department is hurt them or scare them to death.
What I do is go through the system and find out what's the scariest thing in the organization. Take a survey. Find what's the one thing that if I turned it off, disaster, chaos, and pain would erupt. Then I make that the second exercise. The first one is achievable. We're going to power off this data center for 5 minutes and then power it back on, or we're going to shut off this one network at 2:00 in the morning for 10 minutes and see what breaks. We don't know what will break. There'll be all these people that will tell you why you can't do it. You got to start small.
You want this kid to leave and tell everyone, "Oh my god, it was so cool. There was a little fire that they made in a pan and I got to spray the hose and I was like a firefighter. It was awesome." You do that with your developers. You take them through and you say, "Look, we're going to power off this web server." You give them a little bit of a taste. Then you help them tell that story. You increase awareness within the organization.
Over time, what happens is you start building confidence. This is crucial. I said that this is about humans and culture. Making change requires what John Allspaw calls the currency for change. There's two kinds. One is, "Oh crap, we've had this terrible thing happen," and for a very short window we can make sweeping systemic change, because right now we recognize things are so bad. There's another type, which is actually the much more powerful but slower type: "Oh my goodness, this thing happened. It's so much more awesome, and we learned something. We're better because this happened." That's the way I recommend unless you have had a series of really bad outages.
You build confidence, and what happens is it becomes an internal competitive advantage. One team says, "Yeah, we can power stuff off all day long. Go ahead, rip it out." And the other one's like, "Well, we'll get there." You don't want to point fingers. What you do want to do is, "Look how badass this team is." You get there by making it really safe to play and ensuring that people have realistic, achievable challenges.
Then you move up to the full-scale live fire exercise. You've gone from working at small scale, taken a one system, one service, maybe a couple of nodes, and you've broken them in some way that you find a little bit scary. Now you're going to do a full data center power-down. You have to be multi-data-center resilient to do that without experiencing an outage. However, even if you're not, you have a lot of assumptions about what a restart in that facility is going to be like anyway.
For the first full-scale exercise, pick the worst survivable scenario, which I recommend a full-scale data center power-down. It will terrify everyone, and you will learn who are the crazy people and who are the awesome people. Everyone will hate you for a period of time when you're planning this exercise. It's going to be good for them. I promise.
You plan it out. You give them a couple months and you say, "On this date, at this time, we're going to power the thing down. You're going to know what facility, what time, shut it off." And so you have months to remediate all those little gotchas.
Then the day comes and you meet all these people who tell you, "You're not really going to power it down, right?" Yes, you are really going to power it down. The reason you have to do it is just like live fire. You can't learn any other way. You can slip a date, but never stop doing it. Otherwise, no one will believe that you are actually going to do it again. I always power it down.
The first time will be a disaster, and you will learn so many things. Probably mostly about your own ability to recover. Sys admins are terrible about maintaining our own tools the right way. Even if you have something awesome like Chef that does all of your automation and deployment, there's going to be that ops master box that oops was the first machine powered down and suddenly chaos ensues.
What happens from these exercises is that you start increasing awareness. You identify problems in a way that could not be possible any other way. You identify things that become safety standards and building codes. Every one of the safety features in this room that would save your life in the event of a fire is there because at some point somebody died. We don't actually improve building codes unless there is loss of life. It's a sad fact. The panic hardware on those doors is there because of hundreds and hundreds of people dying. So the reason you do these exercises is so that you identify the stuff that has to become part of your culture, and you get the currency to make those changes.
The OODA loop. Observe, orient, decide, and act. John Boyd, a fighter pilot, described the Boyd loop. In any crisis situation, you follow a predictable response. You observe what's going on. You have to orient yourself to that. When it occurs, you don't get to take in anything new. There are personal factors: your training, your experience, your philosophical beliefs, whether you've been exposed to situations like this before. These determine why and how you act. This is about building ops culture. You want to take people through crisis situations so they have seen this, they've observed this type of situation before. They're able to orient to it. They're able to say, "Yep, I know what's going on here. The data center is on fire. What do we do?" If that's the first time you've ever experienced an outage of that scale, and you've never been through a game day, you're screwed. You're going to lock up. There are predictable failure modes. I can tell what you'll do. The only reason why I'm different is I've been through it a lot.
As we evolve as an ops discipline, this becomes crucial. The orientation stage allows you to move into the decide stage to plan your course of action and then actually do things. The reason you run these drills, this exposure and training, is to build that core base of competence so that people react appropriately in a crisis and think about future crises and how they can be avoided. This is part of a longer-term topic called resilience engineering. Great studies going across many different disciplines. I highly recommend that everybody read John's postmortem paper and also the Normal Accidents book.