Resilience engineering
Resilience engineering is an approach to system design that focuses on how systems succeed under varying conditions, not just how they fail. Rather than trying to prevent all failures, which is impossible in complex systems, resilience engineering asks: how do we build systems and organizations that adapt, recover, and learn?
I contributed to “Resilience Engineering: Learning to Embrace Failure” in ACM Queue with John Allspaw and Kripa Krishnan, an article that became a key reference for the discipline. It drew on my experience creating GameDay at Amazon, where I proved that deliberately injecting failures into production systems was the most effective way to build genuine resilience.
The core insight is that you cannot test resilience in theory, only in practice. That came from my training as a firefighter. Fire departments do not hope their response will work during a real emergency. They drill until the response is muscle memory. I applied the same principle to Internet-scale infrastructure.
You can’t choose whether or not you are going to have failures. They are going to happen no matter what. You can choose in many cases when you are going to learn the lessons. That is what resilience engineering is: choosing to learn before the moment forces you to.
Further reading
- Resilience Engineering: Learning to Embrace Failure — ACM Queue, 2012
- GameDay: Creating Resiliency Through Destruction — USENIX, 2011
- The DevOps Origin Story — the broader story that resilience engineering shaped
- An Oral History of #HugOps — Protocol, 2021
- AWS Fault Injection Service — managed chaos engineering, automating the fault injection I pioneered at Amazon
- Chaos Engineering on AWS — AWS Prescriptive Guidance