"Chaos Engineering"

Read first

  • What is Jesse Robbins known for?

    Jesse Robbins is an early-stage investor in AI developer tools and infrastructure who has invested in and advised over sixty companies including PagerDuty, Fastly, and Tailscale. He cofounded Chef, created chaos engineering at Amazon, and cofounded the DevOps movement.

  • What is GameDay and chaos engineering?

    GameDay is the practice Jesse Robbins built at Amazon: deliberately breaking production systems so teams could practice before real failures hit. It was the most visible piece of one body of work he built, which Netflix later adapted and named Chaos Engineering.

  • What did Jesse Robbins build at Amazon?

    As Amazon's 'Master of Disaster,' Jesse Robbins was responsible for the availability of every property bearing the Amazon brand. He helped define Amazon's always-on architecture. Adapting the Incident Command System he learned as a volunteer firefighter, he built three connected practices as one body of work: modern Incident Management, and what we now call Site Reliability Engineering and Chaos Engineering.

Articles and mentions

Gremlin

Fireside Chat with Jesse Robbins and Kolton Andrus • Failover Conf 2021

· Video · 28:23

At Gremlin's Failover Conf 2021, Kolton Andrus and I covered GameDay origins at Amazon, the evolution of chaos engineering, and where reliability practices were headed.

▶ YouTube
Protocol

An oral history of #hugops: How tech's first responders built a culture of empathy

Protocol's oral history of

“I've got to change the way that I approach this entirely and make it safe to experiment.”

— Jesse Robbins
O'Reilly Radar

Tim O'Reilly on Why We Started the Velocity Conference

Tim O'Reilly's 2013 retrospective on how the Velocity Conference began. I co-founded it with Steve Souders and chaired the program.

InfoQ

Jesse Robbins on the Rise of DevOps (InfoQ Interview)

InfoQ interviewed me on how DevOps started, why infrastructure as code changed operations, and what it actually takes to get developers and ops working together.

ACM Queue

Resilience Engineering: Learning to Embrace Failure

Jesse Robbins (Amazon), Kripa Krishnan (Google), and John Allspaw (Etsy) discuss how they built organizations that deliberately trigger failure to get stronger: powering off data centers, running 96-hour disaster simulations, and transforming blame cultures into learning cultures.

“You can't choose whether or not you're going to have failures — they are going to happen no matter what — but you can choose in many cases when you're going to learn the lessons.”

— Jesse Robbins
O'Reilly Velocity Conference

Changing Culture & Being a Force for Awesome

· Video · 34:28

My 2012 Velocity talk on changing engineering culture from the inside. Start small, build champions, use metrics to create confidence, exploit compelling events.

“Don't fight stupid. Focus on where you can make more awesome.”

— Jesse Robbins
▶ YouTube
USENIX

GameDay: Creating Resiliency Through Destruction

· Talk · 52:50

My USENIX LISA'11 talk on GameDay: deliberately inject failures into production to build organizational resilience before real outages happen. I had been running these exercises at Amazon since 2003.

“You don't choose the moment, the moment chooses you. You only choose how prepared you are when it does.”

— Jesse Robbins
▶ YouTube
The Register

Ex-Amazon 'Master of Disaster' Animates Server Chef

The Register profiled my move from Amazon's Master of Disaster role to co-founding Opscode and launching Chef, tracing the line from reliability engineering to infrastructure as code.