"Incident Response"

Jesse Robbins convenes a panel of incident management veterans to examine where generative AI genuinely helps in SRE and DevOps — and where humans must stay in the loop.

“GenAI is good at confidently delivering text that is pleasant to read, but not always complete, or correct.”

Heavybit

What to Know About the Modern Incident Response Lifecycle

November 11, 2022

Heavybit's guide to modern incident management quotes Jesse Robbins on why teams only master incident response when they embrace the whole process — detection, response, and learning.

“Teams only get good at this when they embrace the whole process and each of its steps.”

— Jesse Robbins

ACM Queue

Resilience Engineering: Learning to Embrace Failure

September 12, 2012

Jesse Robbins (Amazon), Kripa Krishnan (Google), and John Allspaw (Etsy) discuss how they built organizations that deliberately trigger failure to get stronger: powering off data centers, running 96-hour disaster simulations, and transforming blame cultures into learning cultures.

“You can't choose whether or not you're going to have failures — they are going to happen no matter what — but you can choose in many cases when you're going to learn the lessons.”

— Jesse Robbins

USENIX

GameDay: Creating Resiliency Through Destruction

December 20, 2011 · Talk · 52:50

In this USENIX LISA'11 talk, Jesse Robbins explains GameDay: deliberately injecting failures into production systems to build organizational resilience before real outages happen.

“You don't choose the moment, the moment chooses you. You only choose how prepared you are when it does.”

— Jesse Robbins

▶ YouTube

O'Reilly Radar

Understanding Operations Culture (Part 1)

June 14, 2008

Jesse Robbins draws on his firefighting background to define web operations culture — the mindset, habits, and discipline that separate teams who handle incidents well from those who don't.

“You don't choose the moment, the moment chooses you. You only choose how prepared you are when it does.”

— Fire Chief Mike Burtch

"Incident Response"