DevOps Culture Hacks: Infecting your Boss & your Business with Awesome

Name: DevOps Culture Hacks: Infecting your Boss & your Business with Awesome
Start: 2011-03-08
Location: DevOpsDays Boston

DevOpsDays by Jesse Robbins · March 8, 2011 · Talk

"Don't fight stupid, make more awesome."

— Jesse Robbins

The original DevOps culture hacks talk at DevOpsDays Boston 2011. Jesse Robbins shares the formula for changing engineering culture from the inside, drawn from his years as Amazon's Master of Disaster.

View at DevOpsDays

This is the first time I presented the full culture hacks framework to a room. DevOpsDays Boston, March 2011. No slides. No video. Just me talking through the pattern I had figured out the hard way at Amazon.

We did not call it DevOps at the time. We did not have a word for it. I had been the stereotypical evil, nasty, mean ops guy. I took every outage personally. I had a record of 167 pages for heightened severity incidents in a single 24-hour period. I had multiple stretches of working 72 hours or more recovering from outages. People were afraid of me. I was proud of that, which tells you something about where my head was.

The story that made the room understand the problem was about Neil Roseman. I called Neil the VP of Awesome. Every cool project at Amazon seemed to be under Neil. Kindle. Search Inside the Book. A bunch of others. We had a battle over a deploy that I knew would take the site down. I did what every good ops person would do. I said absolutely not. Neil overrode me. He said, “The website may go down, but the stock price will go up.” The site went live, and a few seconds later it crashed. Two days of chaos. And the stock and order rates went up.

At year end, I was penalized for the outage and for getting in the way of development. The dev teams were rewarded for shipping. That is the fundamental disconnect. Penalized for something out of my control. Rewarded for deploying and creating value. That misalignment of incentives is the root cause of most of the stupid things organizations do.

I became so famous for saying “No” that I would sign the Amazon launch posters with a big “No” and a scribble. It was the only value I could create. That was not progress.

So I figured out the formula. Five steps, each building on the last.

Start small. Find the smallest group of people who are already excited. Call it an experiment. Do not trigger the organizational immune system. I learned this in the fire service. I always tell people I will take 100 percent of the blame for whatever goes wrong as long as we make space to try.

Create champions. Get your boss on board first. Give everyone else the credit. You can accomplish anything you want so long as you do not require credit or compensation. At Amazon, I created the Call Leader Program to train senior people to run high-severity incidents. It became a high-status thing. Managers wanted in.

Use metrics to build confidence. Find a number that supports your change and use it ruthlessly. Tell the story with data. Have your champions evangelize on your behalf.

Celebrate successes. Create moments in time where people recognize that a change has occurred and that change is good.

Exploit compelling events. Big outages create cultural permission to make sweeping changes. And sometimes you create the compelling event yourself. That is what GameDay was. I got executive sponsorship, created a program where we broke critical parts of the infrastructure, and suddenly everyone needed something from me. Compelling event, manufactured.

I refined this talk at Velocity in 2012, but this was the first time I said it all out loud to a room of practitioners. The pattern has not changed. Don’t fight stupid. Make more awesome.

Changing Culture and Being a Force for Awesome — the refined Velocity 2012 version of this talk, with video
GameDay: Creating Resiliency Through Destruction — USENIX, 2011
Operations Is a Competitive Advantage — the 2007 O’Reilly Radar post that framed operations as strategic
The Convergence of DevOps — John Willis traces the threads that created DevOps
An Oral History of #HugOps — Protocol, 2021

Full Transcript AI-generated

Good morning, everyone. To begin, I just want to get a show of hands. How many people have been to a DevOps days event before? Awesome. Almost no one. How about been to Velocity? OK. How about Surge? OK. Great. So how many of you know what DevOps is? [laughter] Notice I didn't raise my hand, which is an interesting kind of thing. So let me talk a little bit about... First let me introduce me and how I got here. My background. I am the co-founder and CEO of Opscode. We make Chef and the Opscode platform. I'm not going to talk about that at all today. I'm the co-chair of Velocity conference which, if you like this event, you should go to; as well as the Surge conference, which is pretty awesome. I'm not the co-chair of the Surge conference. Prior to Opscode, I was Master of Disaster at Amazon, responsible for website availability for every website that bore the Amazon.com name. And before that, I was actually a firefighter. Technically I still am a firefighter. You never really stop being one, although my certs have started to lapse. So you know, it's always a little exciting. Today we're going to talk about making change happen in organizations. DevOps is really about making change in organizations. There's a lot of people that try to talk about it as either being a set of tools or processes or best practices, but really and fundamentally DevOps is about culture and about enabling business. My assumption is that you're here because you want to make changes in the way that your organizations work. Over the next two days, we are going to hear, and discuss, and participate in... And by the way, open space is an amazing format for discussion and learning new things, so I'm particularly excited to see that proliferating. One thing that's universal for open space style events is that you leave energized and ready to export really good ideas into other organizations. One of the things that I've found, actually, being an open space facilitator and having run Velocity and other things is that usually when you walk out of the building and take something into your new environment there's a really ugly mismatch between your passion and excitement for making change and the organization that you walk into who is like, "Whoa, whoa. Hey, let's slow down there, hippie." The thing that I'm going to talk about today is actually my own experiences making... We didn't call it DevOps at the time, but making that happen in organizations. I'm going to share a fair amount of pain. One other quick question, how many of you would classify yourselves as being more on the ops side of the house? OK. And dev? Wow, that's pretty awesome, actually. That's really DevOps. It's pretty good. How many would classify yourselves as "others?" Executives, PMs? OK, I was worried it was just you in the back. I'm an ops guy. During my time with my former employer I really took my firefighter identity and tried to bring it into my role in operations. I had a record of 167 pages for heightened severity incidents in a single 24 hour period. I had multiple instances of being awake for 72... Working for 72 hours or more recovering from outages. In that process, I became the stereotypical evil, nasty, mean ops guy. It was really a source of some pride for me. People were actually afraid of me for extended periods of time. One of the things that I did pretty early on was I sort of started taking every outage personally. I became what I called at the time the mascot for availability. I would run these very process intensive things called workings, website availability reviews, which was where I basically get anyone that made a change that broke the site together in a room and yell at them for two hours. We had an interesting problem, which was at the time, deploying our very large application was difficult. This is in a 2002-2003 timeframe. The reason was we had to deploy this big monolithic binary. Operations took the code from development after it went through a build process. We would deploy it. The site would break. We'd roll back. It's sort of a sad story that many people had to deal with. It made it really hard for people to ship things safely. Part of my job was keeping the website up and protecting the company from new problems created by bad deploys. This put me into fights with VPs and managers of teams who wanted to actually get new stuff deployed to the site. One of my favorites was Neil Roseman -- it probably still is Neil Roseman -- who I called the VP of Awesome. Basically every cool project at Amazon seemed to be under Neil, like the Kindle, and search inside the book, and a bunch of other ones. I can't actually say which feature it was in this case, but we had a battle. We knew that when Neil's team would deploy this software it would take the site out. I did what every good ops person would do. I said, "Absolutely no way are we deploying this. The moment it goes up the website is going down. It's going to be horrible." There was a whole bunch of fighting that went on. We actually had no go. Well, Neil overrode me. And the way that he did it was he said to me, "The website may go down, but the stock price will go up." The site went live, and a few seconds later it crashed. There were two days of chaos. And indeed, the stock and the order rates went up. Let me say that again. Website down, stock price, order rates up during the brief periods of time that it was up. There were two days of pain and suffering, and long con calls, and yelling, and frustration, but after the end of the two days it was all there and everyone seemed happy except for, of course, us in ops who had been paying the price. What was interesting was the celebration that occurred for the dev team that deployed the software to the production line. They shipped. It didn't really matter that the ops team said no. It exposed an interesting problem, which was that at year end, I was penalized for the outage because my job was keeping the website up and available. So we were held accountable for availability. I also was penalized for getting in the way of development. The other teams were rewarded because they deployed new stuff. They created new value for the company. This is the fundamental disconnect that we all have been suffering. Penalized for something that was out of my control versus rewarded for deploying and creating value for the company. The interesting piece was I actually became so famous for saying "No" to things that I would write down "No." We had these big launch posters. I would write "No" as my thing, as my contribution to the little slogans on the launch poster. So if you ever get to go to see launch posters inside their buildings, you'll see a couple of them and you'll see a big "No" and my signature, which is like a little scribble. That's me. It was a source of pride. It was the only kind of value that I could create. As the company grew in size and the website grew in complexity and additional features, we did the only thing we could to contain the problem, which is we added more process. We added more process, more change control reviews. More points where I could yell at people. We implemented deployment freezes that lasted months. You couldn't deploy software during certain times of the year where there was expected growth in certain ways. We created standards and control boards and we got really, really brutally good. I particularly got brutally good at figuring out root cause analysis and finding out where to put blame so that I could put a dollar cost to the screw-ups that people were making. This seemed like progress. This seemed like progress, and it wasn't. It was not progress in the least. It got worse and worse and worse, and every time we had something that we missed we had to then deploy a new piece of process to catch the process miss that occurred before. It was awful. It was unsustainable. It also is what caused the series of epiphanies that led me to be standing here today, and more importantly it became a pattern that I believe is pretty universal. Well, the good news is we've learned at least the beginnings of what to do about it. "DevOps" is a word for that thing. It doesn't actually matter what words you use. So we made a series of shifts and part of that was we needed to realign operations so that operations became an advantage for the company instead of just a cost center. The way that you need to start talking about DevOps or simply change is very simple. It's turning our function into a competitive advantage. Being able to deploy better, being able to work better, having increased organizational agility. And that is why DevOps is so important. I decided to come up with a businessy definition for what DevOps is. DevOps is aligning an organization around common goals, functions and incentives while reducing friction and constraints and continuously improving. OK, is everyone a little sick in their mouth right now? Just a little bit? The environment that I was in was a big organization. About 4,000 software developers supported by a team of about 50 some odd what you would call SysAdmins, network engineers and SysOps. The very first thing that I did to try to break us out of the misaligned incentives and fix the series of problems was create additional processes that were predominantly punitive in nature and so satisfying. John Allspaw has a great presentation, which I steal liberally from, in which he describes the problem of operations. Which is "But you never tell me what's going on." "Well, that's because you always say no." "Well that's because the site breaks when you do things." "But you never tell me what's going on." And it's a nice little loop. The first thing I tried was a series of processes and review boards designed to control using punitive measures. The beatings will continue until availability improves. That doesn't work. Well, it does work actually, but it works for a very short period of time. You make a lot of enemies. But I finally came up with a solution: Don't fight stupid, make more awesome. One of the things we found very important about putting devs on call was that the primary frustration everyone experienced was this. What they wanted to do was deploy faster, and what we wanted to do was not have the site break. So we said "All right, well you can deploy to production from your desktop. Desktop to production without operations in the way. But in order to do that, you have to be on call for the services that you support." The role of operations very quickly shifted from being the people that just dealt with all the problems to providing a set of expert resources around the common platform. We were escalation, basically tier two for developers. We defined on call procedures, we did training. And then ultimately my team and I distributed about 4,000 pagers. Which made me particularly popular. Through trial and error, through top down fiat, through audits, through every carrot and stick approach, I came up with what I think is the formula for creating this class of organizational change. Begin small and build on trust and safety. Create champions. Use metrics to build confidence. Celebrate successes. Exploit compelling events. One thing, another sort of Jesse axiom, is that you can accomplish anything you want so long as you don't require credit or compensation. There is an excellent book about this. The book is called "Crucial Conversations." It sounds super woohoo; it is not. It is probably the single most important book that I've read in business and in interpersonal relationships. Another kind of champion that you need to create pretty quickly is your boss. I've always actually been pretty bad at this and it has caused a fair amount of friction in my life. If they don't support you, don't fight stupid, make more awesome. Find a place that you can actually go and do that. Google actually one upped the industry. Site reliability engineers. You can tell who they are because they wear these leather bomber jackets that have SRE patches on both sides. They get special pay. They get basically hazard pay. They get special icons in their corporate directory. They have special parties. They are considered to be elevated status within the Google organization. At Amazon, I did this thing called the Call Leader Program which was a training program that we built to train senior people to run high severity incidents. It became a very high status thing where there actually became organizational pressure for managers and senior managers to join the program. John Allspaw has a great talk on MTBF, mean time between failure, and MTTR, mean time to recover. By the way, MTTR is better. I used to have t-shirts that said, "I heart MTTR." It's much easier to recover quickly than to build systems that don't break. The last thing in this chain is exploit compelling events. One trick that I use is I will create them. As part of the availability program, I created something called game day. I created a series of events where we broke critical parts of the infrastructure in order to train people on how to think about, respond to, and recover from incidents of large scale including full scale data center failures. The trick was I got executive sponsorship in order to implement the program. That's how to create cultural change in organizations. Start small and build safety and trust. Create champions. Use metrics to build confidence. Celebrate successes. Exploit and occasionally create compelling events. Those are the secrets that I have. They probably work for any kind of change that you want to make.