John and I co-edited this book to give web operations a canonical text. The contributors were the people doing the work: practitioners from Amazon, Google, Flickr, and the companies that were inventing modern operations under load.

The book is a snapshot of where the discipline stood right before it got renamed DevOps. Reading it now, it is striking how much of what became standard practice was already in the room. We were just trying to write it down before it got lost.

John Allspaw and I co-edited Web Operations: Keeping the Data on Time and O’Reilly published it in June 2010. It is a book of essays from people who were actually doing this work at the time, written for the people who would do it next.

The book grew out of the Velocity Conference community I cofounded at O’Reilly in 2008. Velocity gave the people running the largest sites on the web a place to compare notes in public for the first time. Web Operations is what those notes looked like in book form.

The contributors are the people who built the discipline. John Allspaw on capacity planning and the relationship between development and operations. Theo Schlossnagle on dealing with unexpected traffic. Baron Schwartz on databases under load. Eric Ries on continuous deployment. Andrew Clay Shafer and Patrick Debois on the cultural shift that would soon get the name DevOps. Adam Jacob on infrastructure as code. Alistair Croll, Heather Champ, Paul Hammond, Richard Cook, Mike Christian, Eric Florenzano, Justin Huff, Jake Loomis, Matt Massie, Brian Moon, Anoop Nagwani, and Sean Power on the rest of what it took to keep a real site running. Every contributor was practicing what they wrote about.

The thesis was that operating large systems is its own engineering discipline, not a chore tacked onto development. That position was contested at the time. It is now the consensus, and the lineage from this book runs through DevOps, the SRE books from Google, and the platform engineering work the CNCF formalized a decade later.

I am proud of how it came together and prouder of who I got to do it with.

My foreword to the book

It’s been over a decade since the first websites reached real scale. We were there then, in those early days, watching our sites growing faster than anyone had seen before or knew how to manage. It was up to us to figure out how to keep everything running, to make things happen, to get things done.

While everyone else was at the launch party, we were deep in the bowels of the datacenter racking and stacking the last servers. Then we sat at our desks late into the night, our faces lit with the glow of logfiles and graphs streaming by.

Our experiences were universal. Our software crashed or couldn’t scale. The databases crashed and data was corrupted, while every server, disk, and switch failed in ways the manufacturer absolutely, positively said it wouldn’t. Hackers attacked, first for fun and then for profit. And just when we got things working again, a new feature would be pushed out, traffic would spike, and everything would break all over again.

In the early days, we used what we could find because we had no budget. Then we grew from mismatched, scavenged machines hidden in closets to megawatt-scale datacenters spanning the globe filled with the cheapest machines we could find.

As we got to scale, we had to deal with the real world and its many dangers. Our datacenters caught fire, flooded, or were ripped apart by hurricanes. Our power failed. Generators didn’t kick in, or started and then ran out of fuel, or were taken down when someone hit the Emergency Power Off. Cooling failed. Sprinklers leaked. Fiber was cut by backhoes and squirrels and strange creatures crawling along the seafloor. Man, machine, and Mother Nature challenged us in every way imaginable and then surprised us in ways we never expected.

We worked from the instant our pagers woke us up or when a friend innocently inquired, “is the site down?” or when the CEO called scared and furious. We were always the first ones to know it was down and the last to leave when it was back up again.

Always.

Every day we got a little smarter, a little wiser, and learned a few more tricks. The scripts we wrote a decade ago have matured into tools and languages of their own, and whole industries have emerged around what we do. The knowledge, experiences, tools, and processes are growing into an art we call Web Operations.

We say that Web Operations is an art, not a science, for a reason. There are no standards, certifications, or formal schooling (at least not yet). What we do takes a long time to learn and longer to master, and everyone at every skill level must find his or her own style. There’s no “right way,” only what works (for now) and a commitment to doing it even better next time.

The web is changing the way we live and touches every person alive. As more and more people depend on the web, they depend on us.

Web Operations is work that matters.

— Jesse Robbins

The chapters

From John Allspaw’s preface, “How This Book Is Organized”:

Chapter 1, Web Operations: The Career by Theo Schlossnagle. What this field actually encompasses, and why the skills needed are gained by experience more than by formal education.
Chapter 2, How Picnik Uses Cloud Computing: Lessons Learned by Justin Huff. How Picnik.com deployed and sustained its infrastructure on a mix of on-premise hardware and cloud services.
Chapter 3, Infrastructure and Application Metrics by Matt Massie and John Allspaw. The importance of gathering metrics from both your application and your infrastructure, and considerations on how to gather them.
Chapter 4, Continuous Deployment by Eric Ries. The advantages of deploying code to production in small batches, frequently.
Chapter 5, Infrastructure as Code by Adam Jacob. An overview of the theory and approaches for configuration and deployment management.
Chapter 6, Monitoring by Patrick Debois. The various considerations when designing a monitoring system.
Chapter 7, How Complex Systems Fail by Dr. Richard Cook. His whitepaper on systems failure and the nature of complexity often found in web architectures, with web operations-specific notes added to the original.
Chapter 8, Community Management and Web Operations. John Allspaw’s interview with Heather Champ on how outages and degradations should be handled on the human side of things.
Chapter 9, Dealing with Unexpected Traffic Spikes by Brian Moon. Experiences with huge traffic deluges at Dealnews.com and what they did to mitigate disaster.
Chapter 10, Dev and Ops Collaboration and Cooperation by Paul Hammond. Places where development and operations can come together to enable the business, both technically and culturally.
Chapter 11, How Your Visitors Feel: User-Facing Metrics by Alistair Croll and Sean Power. Metrics that can be used to illustrate what the real experience of your site is.
Chapter 12, Relational Database Strategy and Tactics for the Web by Baron Schwartz. Common approaches to database architectures and some pitfalls that come with increasing scale.
Chapter 13, How to Make Failure Beautiful: The Art and Science of Postmortems by Jake Loomis. What makes or breaks a good postmortem and root cause analysis process.
Chapter 14, Storage by Anoop Nagwani. The gamut of approaches and considerations when designing and maintaining storage for a growing web application.
Chapter 15, Nonrelational Databases by Eric Florenzano. Considerations and advantages of using a growing number of “nonrelational” database technologies.
Chapter 16, Agile Infrastructure by Andrew Clay Shafer. The human and process sides of operations, and how agile philosophy and methods map (or not) to the operational space.
Chapter 17, Things That Go Bump in the Night (and How to Sleep Through Them) by Mike Christian. The various levels of availability and Business Continuity Planning (BCP) approaches and dangers.