What I built at Amazon
I joined Amazon in 2001 as “Master of Disaster.” I just typed it into a form one day, and then it stuck. I was responsible for the availability of every property bearing the Amazon brand.
I’d been building ISPs and running servers since high school. When I joined Amazon, I was simultaneously testing for the fire department. Tech was the day job while I pursued emergency services. That dual path turned out to be the thing that made everything else possible. I brought the discipline of incident command to a company that desperately needed it. Clear roles, practiced procedures, calm under pressure.
I created Amazon’s Incident Management program and the GameDay practice of deliberately breaking production systems so teams could practice before real failures hit. The best way to fix major failures was to create them. GameDay pioneered what the industry now calls chaos engineering, and the incident management framework I built became a template for site reliability engineering as a discipline.
The turning point was watching a junior engineer physically shaking during an outage, terrified of being blamed. That moment convinced me that the punitive culture around failure was the real reliability problem. I shifted Amazon’s approach from blame to learning, making it safe to experiment, safe to fail, and safe to report problems honestly. You only get to do really big, great things when you are able to take great risks safely.
Further reading
- Ex-Amazon ‘Master of Disaster’ Animates Server Chef — The Register, 2009
- The Origins of Amazon Cloud Computing — GigaOM, 2011
- GameDay: Creating Resiliency Through Destruction — USENIX, 2011
- Fireside Chat with Kolton Andrus — Failover Conf, 2020
- AWS Well-Architected: Game Day — GameDay is now a formal AWS reliability concept
- AWS Fault Injection Service — the managed service that automates the fault injection I created at Amazon