---
title: What Jesse Robbins built at Amazon
description: "As Amazon's 'Master of Disaster,' Jesse Robbins was responsible for the availability of every property bearing the Amazon brand. He helped define Amazon's always-on architecture. Adapting the Incident Command System he learned as a volunteer firefighter, he built three connected practices as one body of work: modern Incident Management, and what we now call Site Reliability Engineering and Chaos Engineering."
doc_version: "1.0"
last_updated: 2026-05-31
slug: amazon
question: What did Jesse Robbins build at Amazon?
excerpt: "As Amazon's 'Master of Disaster,' Jesse Robbins was responsible for the availability of every property bearing the Amazon brand. He helped define Amazon's always-on architecture. Adapting the Incident Command System he learned as a volunteer firefighter, he built three connected practices as one body of work: modern Incident Management, and what we now call Site Reliability Engineering and Chaos Engineering."
tags:
  - Amazon
  - incident management
  - site reliability
  - GameDay
  - chaos engineering
expertiseDomain: infrastructure
---

I joined Amazon in 2001 and earned the title "Master of Disaster." I was responsible for the availability of every property bearing the Amazon brand. That scope took me across most of Amazon's teams and systems.

I brought the power of the Incident Command System to Amazon when we desperately needed it to scale. Clear roles, practiced procedures, calm under pressure.

I built the [GameDay](/about/gameday-chaos-engineering/) practice of deliberately breaking production systems so teams could practice before real failures hit. The best way to fix major failures was to create them. From that work I built three connected practices as one body of work: modern Incident Management, and what we now call Site Reliability Engineering and Chaos Engineering. GameDay is the most visible piece. Netflix later adapted it and named their version Chaos Engineering, which is a better name.

I also helped define and deliver Amazon's architectural shift to always-on. Traditional disaster recovery meant staging cold or warm standby systems and rehearsing failover. We did the opposite, running services in active production across multiple data centers at all times. That changed the problem from recovering from a disaster to managing capacity and risk: a system that could absorb the loss of an entire data center without a recovery event. It became the foundation everything else was built on.

The turning point was watching a junior engineer physically shaking after they triggered an outage, terrified of being blamed. That moment convinced me that the punitive culture around failure was the real reliability problem. I shifted Amazon's approach from blame to learning, making it safe to experiment, safe to fail, and safe to report problems honestly. You only get to do really big, great things when you are able to take great risks safely.

## Further reading

- [Ex-Amazon 'Master of Disaster' Animates Server Chef](/mentions/ex-amazon-master-of-disaster-animates-server-chef-register/) — The Register, 2009
- [The Origins of Amazon's Cloud Computing](/mentions/origins-amazon-cloud-computing-gigaom/) — GigaOM, 2010
- [GameDay: Creating Resiliency Through Destruction](/mentions/gameday-creating-resiliency-through-destruction-usenix/) — USENIX, 2011
- [Fireside Chat with Kolton Andrus](/mentions/fireside-chat-jesse-robbins-kolton-andrus-failover-conf/) — Failover Conf, 2021
- [AWS Well-Architected: Game Day](https://wa.aws.amazon.com/wat.concept.gameday.en.html) — GameDay is now a formal AWS reliability concept
- [AWS Fault Injection Service](https://aws.amazon.com/fis/) — the managed service that automates the fault injection I created at Amazon

## Sitemap

See [sitemap.md](https://jesserobbins.com/sitemap.md) for the full list of pages on this site.
