Modern Incident Response Lifecycle | Jesse Robbins quoted, Heavybit

What to Know About the Modern Incident Response Lifecycle

Heavybit by Andrew Park · November 11, 2022 · Article

"Teams only get good at this when they embrace the whole process and each of its steps."

Heavybit's incident management guide quotes me on why teams only get good at incident response when they treat the whole lifecycle as one discipline.

Andrew Park asked me how teams actually get good at incident response. I told him the same thing I told the operations engineers I trained at Amazon and the founders I have worked with since: you do not get to pick the parts of the lifecycle you like. Detection, response, resolution, retrospective. They are one thing.

Teams that try to skip the retrospective never learn. Teams that treat detection as someone else's problem stay surprised. Teams that wall off resolution from the people who built the system spend years rebuilding trust they did not need to break. The discipline is the cycle, not any single step.

This is the same argument I made at Amazon as Master of Disaster and have kept making since. It is the argument behind the foreword I wrote for Incident Management for Operations in 2017. Andrew's piece is a working summary of where the practice landed inside operations-mature teams a decade after we started naming it.

Since this came out

The pattern keeps showing up. AWS turned a lot of what we ran by hand into Fault Injection Service. SRE and chaos engineering both standardized retrospective practice. The investors and operators I work with treat incident response as a function of how the company is run, not a tool category. The lifecycle frame is the part worth remembering. Tools change. The discipline does not.

Andrew Park’s guide for Heavybit on modern incident management quotes me on why teams only get good at this when they treat the full lifecycle as a single discipline. The piece walks readers through the practical implications: normalize incidents by talking about them often, be honest about the state of the infrastructure, and treat the practice itself as the source of mastery. The line he pulled from our conversation is the one I have been saying since Amazon: skip any step in the cycle and you never fully develop the muscle for any of them.

Since this came out…

March 15, 2021

AWS shipped Fault Injection Service. The patterns I ran by hand at Amazon, now a managed service anyone can call.

July 1, 2017

I wrote the foreword to Incident Management for Operations by Schnepp, Vidal, and Hawley (O'Reilly). The book is the long-form version of the lifecycle argument I made in this piece.

DevOps is dead? Nope, it is maturing ft. Jesse Robbins

April 7, 2023 · Video · 37:57

DevOps is not dead. It's maturing. Platform engineering is the next layer of the same idea, not a replacement for it. My conversation with Rob Zuber on what's actually changing and what isn't.

“Organizations evolve like cities. You start with a few shacks in the woods. Eventually you have enough at stake that you need building codes, fire codes, a fire department, and someone who actually tests the sprinklers.”

— Jesse Robbins

▶ YouTube

Gremlin

Fireside Chat with Jesse Robbins and Kolton Andrus • Failover Conf 2021

April 29, 2021 · Video · 28:23

At Gremlin's Failover Conf 2021, Kolton Andrus and I covered GameDay origins at Amazon, the evolution of chaos engineering, and where reliability practices were headed.

▶ YouTube

What to Know About the Modern Incident Response Lifecycle

Since this came out

People

Since this came out…

Further reading

Topics

More Mentions

DevOps is dead? Nope, it is maturing ft. Jesse Robbins

Fireside Chat with Jesse Robbins and Kolton Andrus • Failover Conf 2021