Heavybit article

What to Know About the Modern Incident Response Lifecycle

Heavybit by Andrew Park · · Article

"Teams only get good at this when they embrace the whole process and each of its steps."

— Jesse Robbins

Heavybit's incident management guide quotes me on why teams only get good at incident response when they treat the whole lifecycle as one discipline.

Andrew Park asked me how teams actually get good at incident response. I told him the same thing I told the operations engineers I trained at Amazon and the founders I have worked with since: you do not get to pick the parts of the lifecycle you like. Detection, response, resolution, retrospective. They are one thing.

Teams that try to skip the retrospective never learn. Teams that treat detection as someone else's problem stay surprised. Teams that wall off resolution from the people who built the system spend years rebuilding trust they did not need to break. The discipline is the cycle, not any single step.

This is the same argument I made at Amazon as Master of Disaster and have kept making since. It is the argument behind the foreword I wrote for Incident Management for Operations in 2017. Andrew's piece is a working summary of where the practice landed inside operations-mature teams a decade after we started naming it.

Since this came out

The pattern keeps showing up. AWS turned a lot of what we ran by hand into Fault Injection Service. SRE and chaos engineering both standardized retrospective practice. The investors and operators I work with treat incident response as a function of how the company is run, not a tool category. The lifecycle frame is the part worth remembering. Tools change. The discipline does not.

Heavybit’s guide to modern incident management quotes Jesse Robbins on why teams only develop operational excellence when they treat the full lifecycle as a single discipline. The article walks readers through the practical implications: normalize incidents by talking about them often, be honest about the state of the infrastructure, and treat the practice itself as the source of mastery. Jesse’s quoted framing is that teams who skip any step in the cycle never fully develop the muscle for any of them.

People

Since this came out…

  1. AWS shipped Fault Injection Service. The patterns I ran by hand at Amazon, now a managed service anyone can call.
  2. I wrote the foreword to Incident Management for Operations by Schnepp, Vidal, and Hawley (O'Reilly). The book is the long-form version of the lifecycle argument I made in this piece.

Further reading

Topics