Operations Is a Competitive Advantage | Jesse Robbins, O'Reilly Radar (2007)

A note from Jesse

This post started it all. I wrote it for O’Reilly Radar in October 2007 after a summit where I kept making the same argument over lunch: operations is not a cost center. It is a competitive advantage and occasionally a strategic weapon. The original URL redirects now, so I am preserving it here.

I had no idea what this post would set in motion. John Willis read it and later credited it as one of the three threads that created DevOps. Luke Kanies showed up in the comments and introduced me to Adam Jacob. That introduction led to Chef. The comment thread reads like a who’s who of the people who built the modern infrastructure stack: John Allspaw, Steve Loughran, Peter van Hardenberg.

The tools mattered. The movements they created mattered more.

My lunchtime conversations at the Summit centered around operations as a competitive advantage (and occasionally a “strategic weapon”). This advantage is the ability to consistently create and deploy reliable software to an unreliable platform that scales horizontally.

Many people think of operations as “a bunch of boring work… which I’m hoping someone else is doing.” It often takes less time to set up a development environment than the tools and infrastructure needed to test, deploy, monitor, and scale new software. The survival of most projects depends on working software, at least initially, and so if there is money or time many people will spend it on development. Unfortunately, people say they will “figure that ops stuff out soon”, but what they mean is “when we’re totally screwed!!!” It doesn’t have to be that way.

Side-by-side chart comparing Traditional Operations versus Secret Sauce Operations over 12 weeks, showing dramatic reduction in manual hours when infrastructure is automated early

Consider two Web 2.0 startups scaling to 20 systems during their first three months. The first team starts writing software and installing systems as they go, waiting to deal with the “ops stuff” until they have an “ops person”. The second team dedicates someone to infrastructure for the first few weeks and ramps up from there. They won’t need to hire an “ops person” for a long time and can focus on building great technology.

In my experience it takes about 80 hours to bootstrap a startup. This generally means installing and configuring an automated infrastructure management system (Puppet), version control system (Subversion), continuous build and test (frequently CruiseControl.rb), software deployment (Capistrano), and monitoring. Once this is done the “install time” is reduced to nearly zero and requires no specialized knowledge. This is the first ingredient in “Operations Secret Sauce”.

This kind of scalability becomes really interesting when you find yourself suddenly popular, as iLike did when it launched its Facebook app and had to scale up fast. From their blog:

In our first 20 hours of opening doors we had 50,000 users sign up, and it is only accelerating. (10,000 users joined in the first 12 hrs. 10,000 more users in the next 3 hrs. 30,000 more users in the next 5 hrs!!)

We started the system not knowing what to expect, with only 2 servers, but ready with backup. Facebook’s rabid userbase chewed up our 2 servers almost instantly. We doubled our capacity to catch up. And then we doubled it again. And again. And again. Oh crap, we ran out of servers!!

We just emailed everybody we know across over a dozen Bay Area startups, corporations, and venture firms in a desperate plea to find spare servers so we can triple our capacity for the continued onslaught. Tomorrow we are picking up over 100 servers from different companies to have them installed just to handle the weekend’s traffic.

Not being able to acquire hardware fast enough is by far a better problem than not being able to install it.

Are any VCs out there including effective operations in their due diligence? Are startups incorporating this in their pitch?

Update: Luke Kanies points out Adam Jacob’s post about implementing Puppet for iLike. (Disclosure: I’m discussing collaboration with Adam’s company, HJK Solutions.) Adam wrote:

Puppet enables us to get a huge jump-start on building automated, scaleable, easy to manage infrastructures for our clients. Using Puppet, we:

Automate as much of the routine systems administration tasks as possible.

Get 10 minute unattended build times from bare metal. Puppet takes it the rest of the way. It’s down to two and a half minutes for Xen.

Bootstrap our clients’ production environments while building their development environment. Because we are expressing the infrastructure at a higher level, when it comes time to deploy your production systems, it’s really a non-event.

Cross-pollinate between clients with similar architectures. When we solve a problem for one client, we’ve effectively solved it for the others.

Sounds good to me.

Update #2: John Allspaw of Yahoo/Flickr fame has great commentary on procurement and capacity management challenges for successful startups.

From the original comment thread

Several of these people would go on to shape the DevOps movement.

John Willis wrote:

There are some really interesting companies working around S3/EC2. Now you add a little Puppet with elastic computing cloud with a touch of services, now you got yourself a VC play.

Luke Kanies, creator of Puppet, wrote:

This article isn’t clear on it, but iLike actually used Puppet to scale with all of those new machines. (I’m the author of Puppet, but iLike isn’t a client of mine and I’m not the author of that post.)

John Allspaw, Yahoo/Flickr, wrote:

IMHO, bringing up ops considerations early in the process of product design should be a priority. Having at least some awareness of operational constraints can pay off significantly later on.

I might even add that while not being able to acquire hardware fast enough is by far a better problem, streamlining your procurement process should also be considered part of the whole system. We’ve had to learn a bit of that at Flickr, just like iLike. :)

As usual: great post, Jesse.

Steve Loughran, SmartFrog team, wrote:

Deployment should be part of every project, be it startup or in-house. If you can bring up a system and deploy to it during the build, then you get to do functional testing on it. If your app’s diagnostics are in a form that the ops team can use and understand, then everyone benefits.

I think where startups are special is there is less of a barrier between ops and dev; ideally: none.

Peter van Hardenberg, founder of 3Tera, wrote:

A software startup should be able to focus on the main value they provide and how to sell it better.

Note to VCs: how long are you going to have the same 1MM or so spent in each startup for the same thing, namely operations? Find a way to not reinvent the wheel for each company.

Operations Is a Competitive Advantage (Secret Sauce for Startups!)

From the original comment thread

Further Reading

Also Mentioned

More Mentions

Understanding Operations Culture (Part 1)