"Incident Response"
Incident Response and DevOps in the Age of Generative AI
Jesse Robbins convenes a panel of incident management veterans to examine where generative AI genuinely helps in SRE and DevOps — and where humans must stay in the loop.
“GenAI is good at confidently delivering text that is pleasant to read, but not always complete, or correct.”
What to Know About the Modern Incident Response Lifecycle
Heavybit's guide to modern incident management quotes Jesse Robbins on why teams only master incident response when they embrace the whole process — detection, response, and learning.
“Teams only get good at this when they embrace the whole process and each of its steps.”
Resilience Engineering: Learning to Embrace Failure
Jesse Robbins (Amazon), Kripa Krishnan (Google), and John Allspaw (Etsy) discuss how they built organizations that deliberately trigger failure to get stronger: powering off data centers, running 96-hour disaster simulations, and transforming blame cultures into learning cultures.
“You can't choose whether or not you're going to have failures — they are going to happen no matter what — but you can choose in many cases when you're going to learn the lessons.”
GameDay: Creating Resiliency Through Destruction
In this USENIX LISA'11 talk, Jesse Robbins explains GameDay: deliberately injecting failures into production systems to build organizational resilience before real outages happen.
“You don't choose the moment, the moment chooses you. You only choose how prepared you are when it does.”
Understanding Operations Culture (Part 1)
Jesse Robbins draws on his firefighting background to define web operations culture — the mindset, habits, and discipline that separate teams who handle incidents well from those who don't.
“You don't choose the moment, the moment chooses you. You only choose how prepared you are when it does.”