"Incident Response"
Articles and mentions
Generative AI in DevOps and Incident Response: What the Experts Actually Think
I interviewed Nora Jones, Jeremy Edberg, Mandi Walls, and Brent Chapman on what generative AI actually does in incident response, and where humans have to stay in the loop.
“GenAI is good at confidently delivering text that is pleasant to read, but not always complete, or correct.”
What to Know About the Modern Incident Response Lifecycle
Heavybit's incident management guide quotes me on why teams only get good at incident response when they treat the whole lifecycle as one discipline.
“Teams only get good at this when they embrace the whole process and each of its steps.”
Fireside Chat with Jesse Robbins and Kolton Andrus • Failover Conf 2021
At Gremlin's Failover Conf 2021, Kolton Andrus and I covered GameDay origins at Amazon, the evolution of chaos engineering, and where reliability practices were headed.
Incident Management for Operations (foreword by Jesse Robbins)
I wrote the foreword to Schnepp, Vidal, and Hawley's O'Reilly book bringing fire-service incident command into IT operations. The lineage runs from my work at Amazon as Master of Disaster through the first Web Ops/Fire Ops summit I convened in 2012.
“This groundbreaking book is the foundation to building an effective operations culture for organizations of any size, with systems of any complexity, and failures of any severity.”
Resilience Engineering: Learning to Embrace Failure
Jesse Robbins (Amazon), Kripa Krishnan (Google), and John Allspaw (Etsy) discuss how they built organizations that deliberately trigger failure to get stronger: powering off data centers, running 96-hour disaster simulations, and transforming blame cultures into learning cultures.
“You can't choose whether or not you're going to have failures — they are going to happen no matter what — but you can choose in many cases when you're going to learn the lessons.”
GameDay: Creating Resiliency Through Destruction
My USENIX LISA'11 talk on GameDay: deliberately inject failures into production to build organizational resilience before real outages happen. I had been running these exercises at Amazon since 2003.
“You don't choose the moment, the moment chooses you. You only choose how prepared you are when it does.”
Understanding Operations Culture (Part 1)
I wrote this in 2008 to define web operations culture using what I had learned from the fire service: the habits that separate teams who handle incidents well from teams who don't.
“You don't choose the moment, the moment chooses you. You only choose how prepared you are when it does.”