
What to Know About the Modern Incident Response Lifecycle
Heavybit by Andrew Park · · Article
"Teams only get good at this when they embrace the whole process and each of its steps."
Heavybit's incident management guide quotes me on why teams only get good at incident response when they treat the whole lifecycle as one discipline.
Andrew Park asked me how teams actually get good at incident response. I told him the same thing I told the operations engineers I trained at Amazon and the founders I have worked with since: you do not get to pick the parts of the lifecycle you like. Detection, response, resolution, retrospective. They are one thing.
Teams that try to skip the retrospective never learn. Teams that treat detection as someone else's problem stay surprised. Teams that wall off resolution from the people who built the system spend years rebuilding trust they did not need to break. The discipline is the cycle, not any single step.
This is the same argument I made at Amazon as Master of Disaster and have kept making since. It is the argument behind the foreword I wrote for Incident Management for Operations in 2017. Andrew's piece is a working summary of where the practice landed inside operations-mature teams a decade after we started naming it.
Since this came out
The pattern keeps showing up. AWS turned a lot of what we ran by hand into Fault Injection Service. SRE and chaos engineering both standardized retrospective practice. The investors and operators I work with treat incident response as a function of how the company is run, not a tool category. The lifecycle frame is the part worth remembering. Tools change. The discipline does not.
Heavybit’s guide to modern incident management quotes Jesse Robbins on why teams only develop operational excellence when they treat the full lifecycle as a single discipline. The article walks readers through the practical implications: normalize incidents by talking about them often, be honest about the state of the infrastructure, and treat the practice itself as the source of mastery. Jesse’s quoted framing is that teams who skip any step in the cycle never fully develop the muscle for any of them.
People
Since this came out…
- AWS shipped Fault Injection Service. The patterns I ran by hand at Amazon, now a managed service anyone can call.
- I wrote the foreword to Incident Management for Operations by Schnepp, Vidal, and Hawley (O'Reilly). The book is the long-form version of the lifecycle argument I made in this piece.
Further reading
- Incident Management for Operations (foreword) — The book that codified the operational incident lifecycle. I wrote the foreword.
- GameDay and chaos engineering — Where the practice of deliberately exercising the lifecycle came from
- Resilience engineering — The body of practice the lifecycle frame sits inside
- What I built at Amazon — Master of Disaster, incident management, and the origin of the lifecycle frame
- Resilience Engineering: Learning to Embrace Failure — ACM Queue roundtable with Kripa Krishnan and John Allspaw
- Original article at Heavybit — Andrew Park's full piece↗
