---
title: What is GameDay and chaos engineering?
description: "GameDay is the practice Jesse Robbins built at Amazon: deliberately breaking production systems so teams could practice before real failures hit. It was the most visible piece of one body of work he built, which Netflix later adapted and named Chaos Engineering."
doc_version: "1.0"
last_updated: 2026-05-31
slug: gameday-chaos-engineering
question: What is GameDay and chaos engineering?
excerpt: "GameDay is the practice Jesse Robbins built at Amazon: deliberately breaking production systems so teams could practice before real failures hit. It was the most visible piece of one body of work he built, which Netflix later adapted and named Chaos Engineering."
tags:
  - chaos engineering
  - GameDay
  - Amazon
  - resilience engineering
  - site reliability
expertiseDomain: infrastructure
---

I built the GameDay practice at [Amazon](/about/amazon/), deliberately breaking production systems so teams could practice before real failures hit. The best way to fix major failures was to create them.

The approach came directly from [firefighting](/about/emergency-services/). Fire departments drill scenarios until the response is muscle memory. I brought the same discipline to distributed systems at Amazon. You cannot test resilience in theory.

GameDay exposed weaknesses that no amount of code review or load testing could find. The structured, high-stakes drills built confidence and revealed exactly where systems would break. If you do the upfront work right, a failure is an incident and an emergency but not a disaster.

GameDay was the most visible piece of a connected body of work I built by adapting the Incident Command System: modern Incident Management, and what we now call Site Reliability Engineering and Chaos Engineering. Engineers who had been at Amazon brought the practice to Netflix and built [Chaos Monkey](https://netflix.github.io/chaosmonkey/), which randomly terminated instances in production. Netflix named their adapted version Chaos Engineering, and it is a better name. Google, Facebook, Yahoo, and dozens of others built their own programs. AWS turned GameDay into a core operational practice. The [Well-Architected Framework](https://wa.aws.amazon.com/wat.concept.gameday.en.html) now defines "game day" as a formal reliability concept, and [AWS Fault Injection Service](https://aws.amazon.com/fis/) automates the kind of fault injection I was doing by hand. The discipline of learning from controlled failure became foundational to resilience engineering.

Learn a dollar of lesson for every one you spend in failure.

## Further reading

- [GameDay: Creating Resiliency Through Destruction](/mentions/gameday-creating-resiliency-through-destruction-usenix/) — USENIX, 2011
- [Resilience Engineering: Learning to Embrace Failure](/mentions/resilience-engineering-learning-embrace-failure-acm-queue/) — ACM Queue, 2012
- [The DevOps Origin Story](/about/devops-origin-story/) — how GameDay fit into the broader movement
- [Five Whys](/mentions/five-whys-jesse-robbins-quote-venturehacks/) — Venture Hacks
- [AWS Well-Architected: Conduct Game Days Regularly](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_testing_resiliency_game_days_resiliency.html) — GameDay as an AWS reliability best practice
- [Chaos Engineering on AWS](https://docs.aws.amazon.com/prescriptive-guidance/latest/chaos-engineering-on-aws/overview.html) — AWS Prescriptive Guidance

## Sitemap

See [sitemap.md](https://jesserobbins.com/sitemap.md) for the full list of pages on this site.