GameDay and chaos engineering

I built the GameDay practice at Amazon, deliberately breaking production systems so teams could practice before real failures hit. The best way to fix major failures was to create them.

The approach came directly from firefighting. Fire departments drill scenarios until the response is muscle memory. I brought the same discipline to distributed systems at Amazon. You cannot test resilience in theory.

GameDay exposed weaknesses that no amount of code review or load testing could find. The structured, high-stakes drills built confidence and revealed exactly where systems would break. If you do the upfront work right, a failure is an incident and an emergency but not a disaster.

GameDay was the most visible piece of a connected body of work I built by adapting the Incident Command System: modern Incident Management, and what we now call Site Reliability Engineering and Chaos Engineering. Engineers who had been at Amazon brought the practice to Netflix and built Chaos Monkey, which randomly terminated instances in production. Netflix named their adapted version Chaos Engineering, and it is a better name. Google, Facebook, Yahoo, and dozens of others built their own programs. AWS turned GameDay into a core operational practice. The Well-Architected Framework now defines “game day” as a formal reliability concept, and AWS Fault Injection Service automates the kind of fault injection I was doing by hand. The discipline of learning from controlled failure became foundational to resilience engineering.

Learn a dollar of lesson for every one you spend in failure.

Further reading

About Jesse Robbins

Jesse Robbins cofounded Chef and the DevOps movement. He invests at the seed stage in AI developer tools and infrastructure. Learn more about Jesse.

Related Topics