I joined Amazon in 2001 and earned the title “Master of Disaster.” I was responsible for the availability of every property bearing the Amazon brand. That scope took me across most of Amazon’s teams and systems.

I brought the power of the Incident Command System to Amazon when we desperately needed it to scale. Clear roles, practiced procedures, calm under pressure.

I built the GameDay practice of deliberately breaking production systems so teams could practice before real failures hit. The best way to fix major failures was to create them. From that work I built three connected practices as one body of work: modern Incident Management, and what we now call Site Reliability Engineering and Chaos Engineering. GameDay is the most visible piece. Netflix later adapted it and named their version Chaos Engineering, which is a better name.

I also helped define and deliver Amazon’s architectural shift to always-on. Traditional disaster recovery meant staging cold or warm standby systems and rehearsing failover. We did the opposite, running services in active production across multiple data centers at all times. That changed the problem from recovering from a disaster to managing capacity and risk: a system that could absorb the loss of an entire data center without a recovery event. It became the foundation everything else was built on.

The turning point was watching a junior engineer physically shaking after they triggered an outage, terrified of being blamed. That moment convinced me that the punitive culture around failure was the real reliability problem. I shifted Amazon’s approach from blame to learning, making it safe to experiment, safe to fail, and safe to report problems honestly. You only get to do really big, great things when you are able to take great risks safely.

What I built at Amazon

Further reading

About Jesse Robbins

Related Topics