O'Reilly Media logo

Web Operations: Keeping the Data on Time

O'Reilly Media by John Allspaw, Jesse Robbins · · Other

Web Operations: Keeping the Data on Time. Front cover (schooling barracuda photograph) and back cover with foreword pull-quote and contributors list. O'Reilly Media, June 2010, edited by John Allspaw and Jesse Robbins.
Web Operations: Keeping the Data on Time. O'Reilly, June 2010. Front and back covers. (O'Reilly Media)

"The Web is changing the way we live and touches every person alive. As more and more people depend on the Web, they depend on us. Web Operations is work that matters."

— Jesse Robbins, from the foreword

John Allspaw and I co-edited the O'Reilly Web Operations book that defined the discipline. Essays from practitioners at Amazon, Google, and the companies that set the stage for DevOps.

John and I co-edited this book to give web operations a canonical text. The contributors were the people doing the work: practitioners from Amazon, Google, Flickr, and the companies that were inventing modern operations under load.

The book is a snapshot of where the discipline stood right before it got renamed DevOps. Reading it now, it is striking how much of what became standard practice was already in the room. We were just trying to write it down before it got lost.

John Allspaw and I co-edited Web Operations: Keeping the Data on Time and O’Reilly published it in June 2010. It is a book of essays from people who were actually doing this work at the time, written for the people who would do it next.

The book grew out of the Velocity Conference community I cofounded at O’Reilly in 2008. Velocity gave the people running the largest sites on the web a place to compare notes in public for the first time. Web Operations is what those notes looked like in book form.

The contributors are the people who built the discipline. John Allspaw on capacity planning and the relationship between development and operations. Theo Schlossnagle on dealing with unexpected traffic. Baron Schwartz on databases under load. Eric Ries on continuous deployment. Andrew Clay Shafer and Patrick Debois on the cultural shift that would soon get the name DevOps. Adam Jacob on infrastructure as code. Alistair Croll, Heather Champ, Paul Hammond, Richard Cook, Mike Christian, Eric Florenzano, Justin Huff, Jake Loomis, Matt Massie, Brian Moon, Anoop Nagwani, and Sean Power on the rest of what it took to keep a real site running. Every contributor was practicing what they wrote about.

The thesis was that operating large systems is its own engineering discipline, not a chore tacked onto development. That position was contested at the time. It is now the consensus, and the lineage from this book runs through DevOps, the SRE books from Google, and the platform engineering work the CNCF formalized a decade later.

I am proud of how it came together and prouder of who I got to do it with.

My foreword to the book

It’s been over a decade since the first websites reached real scale. We were there then, in those early days, watching our sites growing faster than anyone had seen before or knew how to manage. It was up to us to figure out how to keep everything running, to make things happen, to get things done.

While everyone else was at the launch party, we were deep in the bowels of the datacenter racking and stacking the last servers. Then we sat at our desks late into the night, our faces lit with the glow of logfiles and graphs streaming by.

Our experiences were universal. Our software crashed or couldn’t scale. The databases crashed and data was corrupted, while every server, disk, and switch failed in ways the manufacturer absolutely, positively said it wouldn’t. Hackers attacked, first for fun and then for profit. And just when we got things working again, a new feature would be pushed out, traffic would spike, and everything would break all over again.

In the early days, we used what we could find because we had no budget. Then we grew from mismatched, scavenged machines hidden in closets to megawatt-scale datacenters spanning the globe filled with the cheapest machines we could find.

As we got to scale, we had to deal with the real world and its many dangers. Our datacenters caught fire, flooded, or were ripped apart by hurricanes. Our power failed. Generators didn’t kick in, or started and then ran out of fuel, or were taken down when someone hit the Emergency Power Off. Cooling failed. Sprinklers leaked. Fiber was cut by backhoes and squirrels and strange creatures crawling along the seafloor. Man, machine, and Mother Nature challenged us in every way imaginable and then surprised us in ways we never expected.

We worked from the instant our pagers woke us up or when a friend innocently inquired, “is the site down?” or when the CEO called scared and furious. We were always the first ones to know it was down and the last to leave when it was back up again.

Always.

Every day we got a little smarter, a little wiser, and learned a few more tricks. The scripts we wrote a decade ago have matured into tools and languages of their own, and whole industries have emerged around what we do. The knowledge, experiences, tools, and processes are growing into an art we call Web Operations.

We say that Web Operations is an art, not a science, for a reason. There are no standards, certifications, or formal schooling (at least not yet). What we do takes a long time to learn and longer to master, and everyone at every skill level must find his or her own style. There’s no “right way,” only what works (for now) and a commitment to doing it even better next time.

The web is changing the way we live and touches every person alive. As more and more people depend on the web, they depend on us.

Web Operations is work that matters.

— Jesse Robbins

The chapters

From John Allspaw’s preface, “How This Book Is Organized”:

  • Chapter 1, Web Operations: The Career by Theo Schlossnagle. What this field actually encompasses, and why the skills needed are gained by experience more than by formal education.
  • Chapter 2, How Picnik Uses Cloud Computing: Lessons Learned by Justin Huff. How Picnik.com deployed and sustained its infrastructure on a mix of on-premise hardware and cloud services.
  • Chapter 3, Infrastructure and Application Metrics by Matt Massie and John Allspaw. The importance of gathering metrics from both your application and your infrastructure, and considerations on how to gather them.
  • Chapter 4, Continuous Deployment by Eric Ries. The advantages of deploying code to production in small batches, frequently.
  • Chapter 5, Infrastructure as Code by Adam Jacob. An overview of the theory and approaches for configuration and deployment management.
  • Chapter 6, Monitoring by Patrick Debois. The various considerations when designing a monitoring system.
  • Chapter 7, How Complex Systems Fail by Dr. Richard Cook. His whitepaper on systems failure and the nature of complexity often found in web architectures, with web operations-specific notes added to the original.
  • Chapter 8, Community Management and Web Operations. John Allspaw’s interview with Heather Champ on how outages and degradations should be handled on the human side of things.
  • Chapter 9, Dealing with Unexpected Traffic Spikes by Brian Moon. Experiences with huge traffic deluges at Dealnews.com and what they did to mitigate disaster.
  • Chapter 10, Dev and Ops Collaboration and Cooperation by Paul Hammond. Places where development and operations can come together to enable the business, both technically and culturally.
  • Chapter 11, How Your Visitors Feel: User-Facing Metrics by Alistair Croll and Sean Power. Metrics that can be used to illustrate what the real experience of your site is.
  • Chapter 12, Relational Database Strategy and Tactics for the Web by Baron Schwartz. Common approaches to database architectures and some pitfalls that come with increasing scale.
  • Chapter 13, How to Make Failure Beautiful: The Art and Science of Postmortems by Jake Loomis. What makes or breaks a good postmortem and root cause analysis process.
  • Chapter 14, Storage by Anoop Nagwani. The gamut of approaches and considerations when designing and maintaining storage for a growing web application.
  • Chapter 15, Nonrelational Databases by Eric Florenzano. Considerations and advantages of using a growing number of “nonrelational” database technologies.
  • Chapter 16, Agile Infrastructure by Andrew Clay Shafer. The human and process sides of operations, and how agile philosophy and methods map (or not) to the operational space.
  • Chapter 17, Things That Go Bump in the Night (and How to Sleep Through Them) by Mike Christian. The various levels of availability and Business Continuity Planning (BCP) approaches and dangers.

People

John Allspaw
John Allspaw
Former Etsy CTO, co-editor of Web Operations
Theo Schlossnagle
Theo Schlossnagle
Founder of OmniTI (1997) and Circonus (2010); web operations veteran and longtime Velocity Conference chair
B
Baron Schwartz
MySQL performance expert; co-author of High Performance MySQL (O'Reilly); founder of VividCortex; former VP of Consulting at Percona
Eric Ries
Eric Ries
Author of The Lean Startup
Andrew Clay Shafer
Andrew Clay Shafer
Co-founder of Puppet Labs; longtime DevOps community organizer and one of the originators of the DevOps movement
Patrick Debois
Patrick Debois
Founder of DevOpsDays, coined the term DevOps
Adam Jacob
Adam Jacob
Co-founder and former CTO of Chef
Alistair Croll
Alistair Croll
Co-author of Lean Analytics; chair of O'Reilly Strata, Cloud Connect, Startupfest, and FWD50; co-founder of Coradiant
Heather Champ
Heather Champ
Former Director of Community at Flickr (2005-2010); co-author of Flickr's community guidelines; co-founder of JPG Magazine and Fertile Medium
Paul Hammond
Paul Hammond
Engineering leader at Flickr, Typekit, Adobe, and Slack; co-presenter of the seminal "10+ Deploys Per Day" Velocity talk with John Allspaw (2009)
Richard Cook
Richard Cook
Physician, anesthesiologist, and system safety researcher; author of "How Complex Systems Fail" (1998); co-founder of Adaptive Capacity Labs with John Allspaw and David Woods (1953-2022)
M
Mike Christian
Eric Florenzano
Eric Florenzano
Django community contributor and instructor; co-founder of Convore (Y Combinator W11)
Justin Huff
Justin Huff
Early engineer at Picnik who built the backend infrastructure; later a Google engineer after the Picnik acquisition
J
Jake Loomis
Former VP of Production Engineering at Yahoo!; contributing author to Web Operations (O'Reilly, 2010)
M
Matt Massie
Open-sourced Ganglia in 2000 while at UC Berkeley; co-author of "Monitoring with Ganglia" (O'Reilly) and contributor to Web Operations (O'Reilly, 2010)
Brian Moon
Brian Moon
Principal Software Architect at DealNews; contributing author to Web Operations (O'Reilly, 2010)
A
Anoop Nagwani
Storage engineering practitioner; contributing author of the Storage chapter in Web Operations (O'Reilly, 2010)
Sean Power
Sean Power
Co-founder of Musical AI
O'Reilly Media
O'Reilly Media
Technology media company and publisher

Since this came out…

  1. The CNCF Platforms White Paper from the App Delivery TAG names Platform Engineering as a discipline, extending the operations lineage this book started.
  2. Google publishes The Site Reliability Workbook, the practitioner companion to the SRE book.
  3. Incident Management for Operations by Rob Schnepp, Ron Vidal, and Chris Hawley is published by O'Reilly with a foreword by Jesse Robbins. O'Reilly catalog.
  4. Google publishes the Site Reliability Engineering book, codifying SRE as a named discipline at scale.

Further reading

Topics