Jesse Robbins on DevOps as Business Alignment

Thoughtworks by Jez Humble · August 17, 2012 · Video · 33:39

"The role of operations is the role of enabling as much awesome as you can."

— Jesse Robbins

Jez Humble interviewed me at Thoughtworks on DevOps as business alignment: developers, operations, and the company shipping faster without giving up reliability.

Watch at Thoughtworks

Jez Humble and I had this conversation at the moment DevOps was getting widely adopted but still being misunderstood as a collaboration practice between two teams. My argument was that it is a business alignment problem first. The work of operations is to enable as much of the company's ambition as the systems can hold.

Jez Humble interviewed me for Thoughtworks in 2012 as part of his continuous delivery video series. The conversation lays out how I was framing DevOps then: business alignment first, collaboration second.

DevOps as business alignment

The starting point is the incentive problem. Developers create value when code reaches production. Operations teams were traditionally rewarded for keeping production unchanged. Most companies treated both groups as support functions instead of seeing delivery and operations as part of the same value stream. DevOps is the shift where operations becomes a way to help the business move faster without giving up reliability.

“Code that is written and not deployed is worthless.”

Operations as enablement

I walked Jez through my own pivot from gatekeeper to platform builder. At Amazon, I blocked a launch because I was worried about availability. Neil Roseman overrode me. The site went down for a couple of days. The stock and order rates went up. That was when I realized the better role for operations was to make launches safer and more repeatable, not harder to do.

“The role of operations is the role of enabling as much awesome as you can.”

Chef and infrastructure as code

Chef sits inside the larger move toward programmable infrastructure. EC2, Rackspace, OpenStack, VMware, and private-cloud APIs made it possible to provision infrastructure through software. Chef was the glue that let teams describe the desired state of infrastructure and applications together. Infrastructure as code is the moment infrastructure becomes part of the application: configurable, repeatable, testable, close enough to the product that teams can move without waiting on manual provisioning.

EC2 and the cloud operating model

I told Jez the story of EC2’s origins inside Amazon, including my own first reaction: I tried to block it. Chris Pinkham and Christopher Brown built the project away from Seattle, in Cape Town, and I gave them a hard time about exposing what I considered my operational perimeter to the public internet. EC2 changed who could ask for infrastructure, how fast they could get it, and what operations teams had to become.

Simple services, fast feedback

The cloud architecture advice from the back half still reads cleanly. Build small services. Keep APIs simple. Push complexity up the stack. Avoid monoliths that force every scaling problem into one place. Prefer rough consensus with running code over elaborate first designs. If the system is split into clear services, the teams can own, operate, and improve those services. Architecture is a way of making responsibility visible.

Full Transcript (AI-generated)

DevOps is a big deal because it is a cultural revolution aligning three groups that have historically had pretty big barriers between them: developers, operations teams, and the business. Developers write code that allegedly creates value for the business by being deployed to production. Historically, operations, including my own history, has been about keeping a website up, or keeping what is working currently from changing. That is a very intense conflict between the two. What we are seeing now is that you can deploy from development to production in a minute, and you can do that all day long. That means businesses are getting value faster from the new stuff development is creating. Infrastructure is scaling now because of infrastructure as a service like Amazon EC2, Rackspace, and private cloud things that are getting built out. This revolution is being driven by the fact that you can build organizations where operations' job is to help drive the business faster. That means helping developers write code that gets into production and creates new value for the business as fast as they possibly can. Code that is written and not deployed is worthless. It is just wasting time. That is DevOps. That is essentially the force driving this change. It is creating a competitive advantage for businesses that are able to have those teams working together and deploy to production really quickly, get value out of that, and maintain availability and the other things required to be successful. This is a new capability where speed is an advantage. It is enabled by tools, process, and technology that were not there before, but were required to succeed on the web. I will start by telling you when I had this awakening. There is an executive named Neil Roseman. I called him the VP of Awesome. I still call him the VP of Awesome, although I think he is the GM of Zynga now. Neil and I had a fight. I blocked a launch of his because I was afraid we had not gone through all the checks. I was concerned about performance, and it was likely to take our website out. Neil and I had a conversation in which he said, look, shipping the code is more important and will result in a better share price and more business benefit, taking that risk, than not shipping the code. You are never going to be ready. It is never going to be quite good enough. You are never going to be perfect. Ship it and figure out how to make it work. What he actually said was: website down, stock price up, customers happy quickly. That was my religious moment. That was the moment I realized I was on the wrong side of the problem. What I needed to do was enable people like Neil and others to ship faster and more repeatably without posing as big a risk, and without having these big projects that were dangerous. For me, that was the shift when I went from being operations, the preventer of what we would classically have called those developers who do not care about availability, to someone who saw it as my job to make a very careful platform that they are able to stand on and do great things. In every company, there needs to be a compelling event for this kind of change. It is not going to happen if there is not a business driver. I get asked a lot by Opscode enterprise customers, how do we drive this kind of agility to our legacy systems or other environments? They have a lot of reasons why they cannot do this. What I tend to say is, you have to have some reason why you can, and why it is worth doing. When I talk about this with larger organizations, I focus on the places where they have a compelling reason to make a cultural change. If you have a large organization that is used to doing things a certain way, and it is working pretty well, and the business is working just fine, and the software is working just fine, there is not a lot of reason for change. I have come to have a personal mantra: do not fight stupid. Focus on making more awesome instead. A friend of mine said, yeah, because stupid fights itself. Once you understand that there is tremendous transformative potential available, and that there is a better way to work together to achieve objectives that are important to all of us, it is really hard to be stuck in a large organization that does not want to change. It is hard to be that change agent. I have done that all my life. Do not fight it. Instead, look for those little tiny places where you are able to grab on and say, we have permission to try something new. Once you do it and it works, tell that story. Compared to another group, people realize that these teams are doing something different, and it is something better. It is not that they are better people. It is that the way they approached the problem is working better. That is how big organizations begin to change. Find the place. Do not fight from the top down. Find a bottom-up approach that will work. Make sure you capture metrics about how it was different, and make a compelling story you can repeat again and again, and more importantly, others can repeat again and again. Tie that back to things important to the business, and suddenly it starts to spread from the base up. I tend to focus on developer productivity as the bar. The metrics I would use are measures of how quickly you are able to write something new and push it into production, and what the success of that rate of change is. It gives you a simple number that tells a lot of stories. We wrote a new project and it took three weeks, or three months, to get it from written to deployed. There is a whole bunch of stuff to work back from there. Within an organization you will be able to tell that powerfully when one or two groups can do it in a week, a day, or a minute. The differences between those will tend to show where the bottlenecks are. I would start with the time between when software is ready and when software is deployed to production. I do not have a benchmark against which to compare. Instead, look for places where you are seeing something obviously different. On the operations side, there are some interesting metrics. How long does it take to get a server from not existing to deployed and ready for production? If you are looking at EC2 or Rackspace, you are looking at minutes. If you have an effective virtualization model within your company and a big virtualization fabric, it should be minutes again. If you are having to deploy new hardware reactively, buy it, and rack it, it is going to be weeks if you are great at it, and probably months. That is an organization-shifting question: how long does it take to make the resource we need in order to deliver new business value? If it is months, you have a problem. It does not matter how great everything else is going to work. There is a big bottleneck, and you are not going to move faster than that bottleneck. If it is minutes, great. You are going to have another problem, which is how to make sure it happens in a reasonably predictable and appropriately controlled way. Chef is Opscode's open-source infrastructure automation framework. We created it in 2008. Opscode makes its money by providing a hosted Chef service and a private version of the hosted Chef service. We are open source. That is how we make our money. In order to make these processes faster, in order to make DevOps work, it requires a new class of tools and a new kind of resources. One requirement is a data-center-level API. We get better at that every day. EC2 is awesome. Rackspace is awesome. OpenStack is awesome. VMware has stuff that is awesome. The point is that you need to make it possible to programmatically provision and manage infrastructure resources without lots of manual intervention. Once you have those core APIs, you have an opportunity to provide a new kind of glue that manages them. Chef is one of the tools that can do that. It starts to blur the line between the application and the infrastructure. For me, that is what infrastructure as code is really about. There are lots of attributes you want: scalability, repeatability, manageability. What really matters is the moment when you are able to focus not just on what a server does, but what your infrastructure, inclusive of an application, does. With tools like Chef, we enable sharing common components. We have a cookbooks library with hundreds of things ranging from MySQL to Git to application stacks like Java, Tomcat, and Ruby on Rails. That matters because we want to lower the amount of time people spend thinking about how to provision things, and increase the time they spend building applications, deploying those applications, and realizing the benefit. Infrastructure as code, at its core, is turning the infrastructure into just another aspect of your application. It should just be another library. Ultimately, people will wrap Chef into their libraries at the beginning and say, here is how we want to configure things, here are the services we want to connect to, and here is the state we expect things to be in. That results in a profound organizational shift. One customer, Cycle Computing, enabled large bio companies to do protein folding for drug discovery and cancer research. They wrote software that manages protein folding systems. It is very computationally intensive, and they use Chef to manage their scaling process. They were able to spin up 10,000-core protein-folding arrays on EC2 in minutes and solve the problem they were trying to solve. To me, that is the profound component of infrastructure as code. The app is an app that must consume thousands and thousands of systems in order to solve the problem it was written to solve. You should not expect a scientist to think about how to spin up and spin down 10,000 cores. You want them to focus on the drug discovery part. When you are developing that application to serve your target user, the infrastructure follows you along automatically. Assuming you are able to get those resources from Amazon, Rackspace, or an internal cloud environment, you can build an application-specific supercomputer for a couple of hours, run a job, find the answer to an important question, spin it all down, and do something else with it. That requires you to manage every aspect of your infrastructure programmatically. Every decade there is a shift back and forth from mainframe computing to some distributed version. This does sound like mainframe computing because in many cases it reflects what people going back to the 1960s were talking about in terms of utility computing. I do think it looks a little like that, although with the flexibility of internet scale. Mainframe computing was designed with an assumption of a closed system. You had so many compute resources and so much disk space available, and you engineered to tightly optimize for the resources that would be available over a very long term. In the coming environment, people assume much more frequently that they have far greater resources at their disposal. Those resources are not infinite by any means. I think it is funny when people call them infinite, because they are not. What matters is that the system is constantly expanding. It is an open system that gets bigger and more powerful every day and is able to span other environments as well. When we think about the way the mainframe worked, that was a closed model. The problems you could solve were the problems solvable with resources that were not going to get bigger. Today, we assume the system is getting bigger and richer constantly. The role of operations is the role of enabling as much awesome as you can. Operations has always been the place where the expertise around everything that happens in the system lived. My hiring interview at Amazon was: tell me in as much detail as you can everything that happens after you type www.amazon.com in a browser and press return. That is an answer that should take an hour, and you still will not have covered everything. That is the operations engineer of the future and present in many cases: a person who understands the whole stack, how the components connect together, and who provides a core service to the development organization that enables them. They manage deployment fabrics, manage compute resources, understand the glue, and provide clear escalation when developers get into trouble and need to know how performance systems work, how internet routing works, how DNS works, why something is not working, or how it can be built better for higher performance or higher availability. Ops is about to get really awesome for people. Traditionally our job in operations was, I got 167 pages in a day, and I went days without sleep. That seemed important to my value. Then it became: how many pages did I reduce? How many times did I enable a developer or a whole team to be successful? How much more efficiently did I help my organization scale? The shift in operations results in people doing a whole lot less of the busy work they hated and a whole lot more of a new class of work, which is much more important and much more complicated. It takes more time to master, but it makes a much bigger difference. You understand the glue, and you understand how all the pieces connect together. That skill is going to be crucial as we go into the next decade. The backstory for EC2 was that a group of principal engineers wrote a series of papers internally about primitive services that could be created at Amazon. One was something like a utility computing service. It looked more like an internal platform as a service, and toward the end there was text saying one of the things we could do is potentially vend these services out to the world. Over the next couple of months there was further discussion of it. Chris Pinkham, who was the VP of infrastructure at the time, wanted to go home to South Africa. He got an offer: why don't you take this one project and build that in South Africa? It was a remote development center, cut off from the mother ship. From everyone's perspective, Chris Pinkham and Opscode's now CTO Christopher Brown went to South Africa and came back essentially a year later with what would be EC2. They had taken this initial idea, which was very brief, and come up with a far more interesting solution. EC2 was a skunk works project, if anything. It was built outside of company control, so much so that when Christopher Brown asked me for resources from my ops team, not only did I deny him resources, I actively tried to kill EC2 because I was that ops guy. There are many people who claim to have been part of Amazon's early success. I can only claim to have tried to kill one of Amazon's early successes. They created EC2. They demoed it internally, but it was conceived as something for customer use from day one. There are many myths about it, some promoted by Amazon and some not. Amazon announced at Velocity last week that the Amazon websites, the Amazon retail business, had finally moved to EC2 as of November 10 of last year. EC2 has always been a standalone entity. It has grown substantially and they have innovated tremendously, but the seed that allowed it to be created was a wacky idea thrown around in an early proposal, and a desire to keep a key engineer from leaving the company. It would not have happened if they had been trying to do it inside the walls of Amazon in Seattle. They really had to go very far away to pave the way for it to work. At that time, Amazon was working very hard on profitability and infrastructure efficiency, so there were many contradictory priorities. They had to get special resources that were out of band because people like me, running availability programs and needing additional capacity, certainly did not want an infrastructure budget with this wacky beta project. The success of EC2 and S3 transformed aspects of Amazon's operating model. Jeff Bezos and the whole company are very comfortable operating businesses that have extraordinarily low margins, businesses that require extreme scale in order to be effective. That is why EC2 is successful at Amazon. Culturally, the company knows how to do that. The ability to innovate in web services the way it did was not accidental, because it fits into the Jeff Bezos model that everyone should experiment and compete internally. It certainly was not a major strategic initiative in the way you would think of a different kind of company launching it. It was an experiment that proved successful, and then a lot of resources were put behind it. Chris Pinkham is a founder and CEO of a company called Nimbula. Christopher Brown was one of our first employees at Opscode and is our CTO. What they set out to do, I think they achieved at Amazon, and they have since gone on to build new things that hopefully provide a similar level of crucial internet innovation. One of the first practices for architecting cloud-based systems is service-oriented architecture. It is the reality of successful scalable systems, particularly systems that focus on providing the simplest possible thing at each level of the stack. You want to push complexity up the stack. It is far better to build a distributed system that scales horizontally than to build applications that become monolithic and are very hard to scale. After you have gotten the core pieces like automation right, the job of the architect is to figure out how to split the teams organizationally so that when your widget server that does widget foo and your widget server that does widget bar breaks, one of them broke and you know which one. You are able to service those components and work on them individually. Service-oriented architecture was a huge buzzword over the past decade. What we have ended up building now does not use SOAP or complicated XML ontologies. We focus on very simple systems: RESTful interfaces, JSON as a data format, and composable primitive services where you push complexity higher into the application. That is the basic pattern we see for successful systems. You do not want large-scale monolithic databases that have all your data. You would rather break those up into smaller databases so you can scale the ones you need to, and put appropriate resources around the other ones. There is a lot of excitement about tools like Hadoop and NoSQL data stores, but the reality is that MySQL gets you a really long way in many cases. Many people focus on trying to use the newest tool, which is a poor fit. It is better to break things into small pieces that can scale horizontally, or scale vertically if they have to, as a first step. It is better to make a lot of little things a little bit bigger, or a couple of little things bigger, than to have to do that across the board. With a real service-oriented architecture, you get reuse of components. You have to focus on API versioning. Software versioning matters less than API versioning. You want to provide consistent answers over time. One of the big challenges is load balancing, and once you start building these new kinds of architectures, you need a coherent answer. The other thing that is crucial to get right is your messaging strategy internally: how you pass messages around. Simple is better than very, very complex. There has been a lot of focus, particularly for enterprise architects, on building complicated diagrams, processes, and event flows. I am sure that was all time and energy well spent, and also a lot of the successful projects have focused on simplicity and the ability to iterate rapidly. Do not try to get everything into the V1 spec of a thing. Take more of a rough consensus and running code approach. That is a winning pattern when you are building internet-scale systems, when you are building enterprise-scale systems, and it is a good place to start.

Also Mentioned

Jez Humble

Author of Continuous Delivery, co-author of Accelerate

Martin Fowler

Chief Scientist at Thoughtworks; author of Refactoring and Patterns of Enterprise Application Architecture; signatory of the Agile Manifesto

Eric Ries

Author of The Lean Startup

Elisabeth Hendrickson

Elisabeth Hendrickson

Director of Quality Engineering for Cloud Foundry at Pivotal; author of "Explore It!"; Gordon Pask Award recipient from the Agile Alliance

John Allspaw

Former Etsy CTO, co-editor of Web Operations

Thoughtworks

Pearson

Black Girls Code

Black Girls Code

Opscode

Creator of Chef (renamed to Chef Software in 2013)

Chef

Infrastructure automation platform

Amazon Web Services

Amazon Web Services

Amazon EC2

Amazon's elastic compute cloud, launched in 2006

Rackspace

OpenStack

Topics

Jesse RobbinsThoughtworksJez Humble DevOps Continuous Delivery Chef Infrastructure as Code Amazon Web Services Operations

More Mentions

Resilience Engineering: Learning to Embrace Failure

September 12, 2012

Jesse Robbins (Amazon), Kripa Krishnan (Google), and John Allspaw (Etsy) discuss how they built organizations that deliberately trigger failure to get stronger: powering off data centers, running 96-hour disaster simulations, and transforming blame cultures into learning cultures.

“You can't choose whether or not you're going to have failures — they are going to happen no matter what — but you can choose in many cases when you're going to learn the lessons.”

— Jesse Robbins

O'Reilly Velocity Conference

Changing Culture & Being a Force for Awesome

June 28, 2012 · Video · 34:28

My 2012 Velocity talk on changing engineering culture from the inside. Start small, build champions, use metrics to create confidence, exploit compelling events.

“Don't fight stupid. Focus on where you can make more awesome.”

— Jesse Robbins