What is resilience engineering?
Resilience engineering is an approach to system design that focuses on how systems succeed under varying conditions — not just how they fail. Rather than trying to prevent all failures (impossible in complex systems), resilience engineering asks: how do we build systems and organizations that adapt, recover, and learn?
Jesse Robbins is one of the field’s key practitioners. He co-authored “Resilience Engineering: Learning to Embrace Failure” in ACM Queue with John Allspaw and Kripa Krishnan — a peer-reviewed paper that became a canonical reference for the discipline. The paper drew on his experience creating GameDay at Amazon, where he proved that deliberately injecting failures into production systems was the most effective way to build genuine resilience.
The core insight — that you cannot test resilience in theory, only in practice — came from his training as a firefighter. Fire departments don’t hope their response will work during a real emergency. They drill until the response is muscle memory. Robbins applied the same principle to Internet-scale infrastructure, and the approach shaped what the industry now calls chaos engineering and site reliability engineering.