GameDay and chaos engineering
The best way to fix major failures was to create them. That was the idea behind GameDay, a practice I created at Amazon in which teams deliberately inject major failures into production systems under controlled conditions.
The approach came directly from firefighting. Fire departments do not wait for a building to catch fire to find out whether their teams can handle it. They drill. They run scenarios. They practice until the response is muscle memory. I applied the same principle to distributed systems at Internet scale: you cannot test resilience in theory. You have to test it in practice, under conditions that simulate real failure.
GameDay exposed weaknesses that no amount of code review or load testing could find. Teams practiced incident response before real outages forced them to learn under pressure. The structured, high-stakes drills built confidence and revealed exactly where systems would break. If you do the upfront work right, a failure is an incident and an emergency, not a disaster.
GameDay pioneered what the industry now calls chaos engineering, incident management, and site reliability engineering. Engineers who had been at Amazon brought the practice to Netflix and built Chaos Monkey, which randomly terminated instances in production. Google, Facebook, Yahoo, and dozens of others built their own programs. AWS turned GameDay into a core operational practice. The Well-Architected Framework now defines “game day” as a formal reliability concept, and AWS Fault Injection Service automates the kind of fault injection I was doing by hand. The philosophy of learning from controlled failure became one of the foundational ideas in resilience engineering.
Try to learn a dollar’s worth of lesson for every dollar spent in failure.
Further reading
- GameDay: Creating Resiliency Through Destruction — USENIX, 2011
- Resilience Engineering: Learning to Embrace Failure — ACM Queue, 2012
- The DevOps Origin Story — how GameDay fit into the broader movement
- Five Whys — Venture Hacks
- AWS Well-Architected: Conduct Game Days Regularly — GameDay as an AWS reliability best practice
- Chaos Engineering on AWS — AWS Prescriptive Guidance