"Site Reliability Engineering"

Articles and mentions

Heavybit

Generative AI in DevOps and Incident Response: What the Experts Actually Think

I interviewed Nora Jones, Jeremy Edberg, Mandi Walls, and Brent Chapman on what generative AI actually does in incident response, and where humans have to stay in the loop.

“GenAI is good at confidently delivering text that is pleasant to read, but not always complete, or correct.”

The Confident Commit

DevOps is dead? Nope, it is maturing ft. Jesse Robbins

· Podcast · 37:57

Rob Zuber hosted me on CircleCI's Confident Commit to push back on the 'DevOps is dead' narrative. The hard problems now are organizational, not technical.

♪ Apple Podcasts
Heavybit

What to Know About the Modern Incident Response Lifecycle

Heavybit's incident management guide quotes me on why teams only get good at incident response when they treat the whole lifecycle as one discipline.

“Teams only get good at this when they embrace the whole process and each of its steps.”

— Jesse Robbins
Gremlin

Fireside Chat with Jesse Robbins and Kolton Andrus • Failover Conf 2021

· Video · 28:23

At Gremlin's Failover Conf 2021, Kolton Andrus and I covered GameDay origins at Amazon, the evolution of chaos engineering, and where reliability practices were headed.

▶ YouTube
Protocol

An oral history of #hugops: How tech's first responders built a culture of empathy

Protocol's oral history of

“I've got to change the way that I approach this entirely and make it safe to experiment.”

— Jesse Robbins
O'Reilly Radar

Tim O'Reilly on Why We Started the Velocity Conference

Tim O'Reilly's 2013 retrospective on the origins of Velocity, with me as co-founder and conference chair. The story of the gathering place we built for our community.

ACM Queue

Resilience Engineering: Learning to Embrace Failure

Jesse Robbins (Amazon), Kripa Krishnan (Google), and John Allspaw (Etsy) discuss how they built organizations that deliberately trigger failure to get stronger: powering off data centers, running 96-hour disaster simulations, and transforming blame cultures into learning cultures.

“You can't choose whether or not you're going to have failures — they are going to happen no matter what — but you can choose in many cases when you're going to learn the lessons.”

— Jesse Robbins
USENIX

GameDay: Creating Resiliency Through Destruction

· Talk · 52:50

In this USENIX LISA'11 talk, Jesse Robbins explains GameDay: deliberately injecting failures into production systems to build organizational resilience before real outages happen.

“You don't choose the moment, the moment chooses you. You only choose how prepared you are when it does.”

— Jesse Robbins
▶ YouTube