"Resilience Engineering"

Read first

What is GameDay and chaos engineering?
GameDay is the practice Jesse Robbins built at Amazon: deliberately breaking production systems so teams could practice before real failures hit. It was the most visible piece of one body of work he built, which Netflix later adapted and named Chaos Engineering.

Articles and mentions

Investing in Vibrant Labs: AI Agent Simulation Infrastructure

December 3, 2025

Annoucing investment in Vibrant Labs, which builds production-grade simulation and verifier-driven evaluation for long-horizon AI agents.

“Vibrant Labs opens a new frontier in AI infrastructure: production-grade, RL-ready simulation and verifier-driven evaluation built for long-horizon agents.”

— Jesse Robbins

Heavybit

Generative AI in DevOps and Incident Response: What the Experts Actually Think

October 12, 2023

I interviewed Nora Jones, Jeremy Edberg, Mandi Walls, and Brent Chapman on what generative AI actually does in incident response, and where humans have to stay in the loop.

“GenAI is good at confidently delivering text that is pleasant to read, but not always complete, or correct.”

Gremlin

Fireside Chat with Jesse Robbins and Kolton Andrus • Failover Conf 2021

April 29, 2021 · Video · 28:23

At Gremlin's Failover Conf 2021, Kolton Andrus and I covered GameDay origins at Amazon, the evolution of chaos engineering, and where reliability practices were headed.

▶ YouTube

O'Reilly Media

Incident Management for Operations (foreword by Jesse Robbins)

July 1, 2017 · Other

I wrote the foreword to Schnepp, Vidal, and Hawley's O'Reilly book bringing fire-service incident command into IT operations. The lineage runs from my work at Amazon as Master of Disaster through the first Web Ops/Fire Ops summit I convened in 2012.

“This groundbreaking book is the foundation to building an effective operations culture for organizations of any size, with systems of any complexity, and failures of any severity.”

— Jesse Robbins, from the foreword

ACM Queue

Resilience Engineering: Learning to Embrace Failure

September 12, 2012

Jesse Robbins (Amazon), Kripa Krishnan (Google), and John Allspaw (Etsy) discuss how they built organizations that deliberately trigger failure to get stronger: powering off data centers, running 96-hour disaster simulations, and transforming blame cultures into learning cultures.

“You can't choose whether or not you're going to have failures — they are going to happen no matter what — but you can choose in many cases when you're going to learn the lessons.”

— Jesse Robbins

USENIX

GameDay: Creating Resiliency Through Destruction

December 20, 2011 · Talk · 52:50

My USENIX LISA'11 talk on GameDay: deliberately inject failures into production to build organizational resilience before real outages happen. I had been running these exercises at Amazon since 2003.

“You don't choose the moment, the moment chooses you. You only choose how prepared you are when it does.”

— Jesse Robbins

▶ YouTube

MIT Technology Review

MIT Technology Review TR35: Innovators Under 35

January 1, 2011

The MIT Technology Review TR35 listing for 2011, citing my work on web operations, cloud, and resilience engineering at Amazon and Opscode.

Venture Hacks

Five Whys: Try to Learn a Dollar's Worth of Lesson for Every One You Spend in Failure

November 17, 2008 · Quote

Eric Ries quoted me in his Venture Hacks guide to Five Whys: try to learn a dollar's worth of lesson for every dollar spent in failure. The line came from Amazon GameDay practice.

“Try to learn a dollar's worth of lesson for every one you spend in failure.”

— Jesse Robbins