Changing Culture & Being a Force for Awesome

Jesse Robbins

O'Reilly Velocity Conference by Jesse Robbins · · Video · 34:28

"Don't fight stupid. Focus on where you can make more awesome."

— Jesse Robbins

Jesse Robbins on how to change engineering culture from the inside. Start small, build champions, use metrics to create confidence, and exploit compelling events. The biggest barrier to operational improvement is not technology. It is organizational resistance.

At the O’Reilly Velocity Conference in 2012, Jesse Robbins, co-founder of the conference itself and of Chef (Opscode), delivers a practitioner’s guide to changing engineering culture from the inside. The talk distills years of hard-won lessons from Amazon, Chef, and the broader DevOps community into a repeatable five-step framework.

The Framework

Jesse’s model for culture change has five steps, each building on the last:

  1. Start small. Pick the smallest possible project with receptive people. Call it an experiment. Don’t trigger the organizational immune system.
  2. Create champions. Get your boss on board first. Then spread credit as widely as possible. Let others feel ownership of the change.
  3. Use metrics to build confidence. Find a number that supports your change (time from commit to deploy, cost of an outage) and use it ruthlessly to build the business case.
  4. Celebrate successes. Tell the story with data. Be positive about people. Leave room for resistors to come around without losing face.
  5. Exploit compelling events. When the site goes down or a compliance mandate lands, use that moment to push for the change you’ve been building toward.

The Origin: Amazon and GameDay

Jesse illustrates the framework with its first major application: Amazon’s availability program. He created GameDay, exercises where teams would deliberately inject large-scale faults into critical infrastructure, including powering off data centers. The trick was starting small: the smallest groups of developers who were receptive, with achievable exercises, building trust and competency before scaling to full-scale disaster simulations.

“Basically GameDay is an exercise where we injected large-scale faults into critical components of the infrastructure, in some cases pressing the big red button.”

Permission and the Katrina Lesson

Jesse closes with a story from his deployment as a task force leader during Hurricane Katrina. A volunteer kitchen staffed by anarchists was feeding thousands of people a day, but FEMA kept trying to shut them down because no one was “in charge.” The solution: make every volunteer a “site director.” When FEMA asked who was in charge, someone would answer “I’m a site director,” and FEMA would deliver supplies.

The lesson: “Most of the time when people are saying no, what they’re really saying is, I don’t know how to say yes.” Jesse applied this at Amazon by typing “Master of Disaster” into a form as his job title, and it stuck.

Jesse’s Rule

The talk keeps returning to a single principle Jesse calls his “rule of happiness and survival”:

“Don’t fight stupid. Focus on where you can make more awesome.”

Full Transcript AI-generated
So this is the first time in four years that I have had stage fright, and it's at my own conference. It's actually really terrifying. So I'm really glad to be here today. I'm a co-founder of Opscode and obviously I helped get this conference started with a lot of other people, and this is going to be a talk about culture and culture hacking. Opscode and Velocity are probably my two biggest culture hacks, but I'm going to focus a lot on what you guys need in order to be productive hacking the cultures that you're going to go back to when this conference is over. Just a couple of quick things. So hopefully one of these three slides will sound familiar. Either you're this girl who "tests work fine, it's ops' problem now." Or you're this sysadmin. I was that sysadmin. Or you're this dev or business person who is making Steve very sad. So the interesting thing about being here together is this is an event where when we talk about the various components that make up performance and operations, we talk a lot about technology, but like Allspaw and I will both say, really probably to an absurd level, it's really culture. And so you're going to spend like, or you've already spent a day and you'll have another half day of like getting this injected into you. This is the original DevOps love poster that they put up. You will have learned all of Steve's 28 rules. How many know all of Steve's 28 rules? Raise your hands. What about the four secret ones that he can't ever tell you? He's sworn to secrecy. Ooh, you have to ask him about that. You will definitely have automated all of the things. I hope you use Chef but you might use something else to do it. How many here believe in metrics at this point? Raise your hands. Okay, who put your hands down. Now who doesn't believe in metrics? Who thinks that that's a stupid idea, you shouldn't measure stuff? Any? Okay. You. So you will, you'll go back and you'll be like, oh man, our metrics totally suck right now but we can use Ganglia, we use Theo's stuff with Circonus, it is going to be incredible. Some of you are going to begin continuously deploying code before it is even written. Like it is 10 deploys a day? No, continuous, meaning it just comes from the future. You will deploy all Allspaw's code to Etsy before they even think of it. That's how cool you're going to be. If you've ever heard anyone of my people talk, you're going to think about, you know what, we're not even afraid of a GameDay. We, I will press that button. I will receive the candy. So that candy, that thing does not dispense candy by the way. It's actually an emergency power off button. Some people are confused by that. It's weird, I don't know. The most important thing that you should take away from Velocity is an idea, which is we've kind of vectored towards the right shape, the right culture for what is effective to survive and thrive on the web. It's sort of a function of organizational constraints, in many ways in the same way that a bird's wing is optimized for flight, a bat's wing is optimized for flight. We kind of know generally what the operations culture should be, and you're going to have this kind of deeply inside you, you're going to be super pumped, and then you are going to go back to the office, which will suck. Unless you work for Opscode, in which case it's awesome. But so you're going to go back and you are going to know stuff that is going to change your life and change everybody else's life, and the very first thing that you're going to want to do is completely shake everyone up. You know, be like, oh, we're doing it all wrong and you know we've got to immediately make all these changes. And then a little while later you're going to send me this note. I get about 50 of these every conference cycle, and it's basically the, you know, oh, I talk to people and they said there's absolutely no way this would work. Or, I tried to implement and now I'm like on some kind of personal improvement plan or something. There's a big outage, I don't know what happened, the compliance people got involved, and... So the kind of sucky part is that changing culture actually takes time, and I've gotten good at hacking culture mostly because I've made some really really stupid mistakes, which I'm going to tell you about. And the biggest being a belief that, you know, hey, we're engineers, we're operators, we're people that care about infrastructure, and the desire to just rip it all out is super fun. Like you don't want to be stuck with the crud. So, well, actually, sorry, anyone, do you like being stuck with crud? Mr. No No Metrics over there, I got my eye on you. So I, in my career over the past decade, have had a history of choosing battles extremely poorly. It's like almost weapons grade. I was the guy that always said no to the cool new stuff, and then once I got excited about actually saying yes to things, I pretty consistently would fight over the stupidest things imaginable. One of my favorite things that I tried to do was kill EC2 in its infancy, because I was an ops guy and it was, you know, a waste of resources and a security threat is how I perceived it. So I've been the Dr. No guy, but I've also fought every single one of the stupid large organizational battles you can and lost almost all of them. And one day I realized Jesse's rule of happiness and survival, which is: do not fight stupid, focus on where you can make more awesome. And when I say that and I think about you guys going back, for those of you that are in organizations that are making change quickly right now, it's great. For those of you that aren't and you're unhappy, the job boards are overflowing. And the interesting thing to know is that you don't need to be stuck somewhere where you're fighting stupid. There is plenty of room for more awesome for every single one of you. So just keep that in mind when you're thinking about that. Don't go back and quit, all quit jobs, but the interesting thing is that we are in the middle of this massive change which makes all of our lives better. So just remember that. Here is how you actually change culture effectively. So the first, and again this irritates the crap out of me, is you start small. You start at the least common, most likely to succeed denominator. And I'm going to go through this as a list, I'm going to give you some examples, and then I'm going to give you the hacks for them. The second thing, this is particularly hard for people who are sort of harder-core engineers who do not socialize well. You need to create champions. And by this, when you're pushing these changes, like if you're trying to talk about how awesome what Etsy's doing is, or you know, any of the new things that you've learned about that you want to import into your environments, it's going to need to come from more than you. You don't want to be that one person who's trying to kind of be the mascot for that. I was the mascot for availability in one of my jobs. It was a terrible thing to do to myself. And the thing that you gain power from doing is getting a lot of people excited about what you're seeing. And that means getting them to see the world with them using whatever the new cool thing you want to implement is, having their life be better, them feeling better, being more popular, and you know, getting raises and all kinds of other stuff. Using metrics to build confidence. So one of the, really one of the things we did early on with Velocity was we made sure that lots of the large companies published useful data that you can use and take back to your executives to build a case for why you should be able to do something. So if you go back through, you'll see like, this is the cost of an outage, this is the cost of one millisecond of latency, and you can use that in order to build cases. The Shopzilla example that they used showed like, you know, a huge improvement in revenue as a result of improving front-side performance. So you're going to need to build a language of business metrics, and Mandy Walls is going to talk a little bit about this later today in some detail. She's an MBA, I'm a firefighter. So you know, your mileage may vary. But you want to use metrics in order to prove your case and more importantly allow others to subscribe to it. You want to celebrate successes, and I'll talk a little bit about this. And you want to do this one thing which people get a little weird when I say this: you want to exploit compelling events. So when the site goes down and everything is broken for a really long time and everyone's yelling at each other, you have this unique opportunity to change your organization. And the good news is that those kind of things happen all the time, and so you have so many chances to say, you know what, what we really need is a new incident management program, or better metrics, or I think we should try Ganglia out, or whatever it is. And so exploiting compelling events is a super trick which we'll go into. For me, where I applied this first was during my time at Amazon, where I worked on this project called the availability program, and I created something called GameDay. How many of you have heard of GameDay as just the phrase or word? Okay, some. So it's been spreading out more and more. The other, like Netflix has a version of this, Chaos Monkey, it's pretty cool. Basically GameDay is an exercise where we injected large-scale faults into critical components of the infrastructure, in some cases pressing the big red button, which is pretty fun. How many of you have had a major data center failure, by the way? I love this question. Raise your hands. Yeah. Okay. So you know, everyone basically. So it's part of a larger discipline. It's not new to us, this type of work, but it's definitely something that scares the living crap out of every single person you talk about. So you say, hey guys, we know that we want to be resilient to a single data center failure or multiple data center failures, and to get there, right now we're going to fail one just a little, just a little bit. We're going to light a little fire and we're going to see what happens, we're going to see how people work and perform. So if you had started with the full-scale exercise, if this little kid had been pushing against a full-scale fire hose on his first day, they push back with about 90 pounds of force, he'd be flying around and it'd be a big disaster and it'd make the news and everything else and there's just no one would be happy about that. So you got to start small, start something achievable. In my case, with that program, starting with the smallest groups of developers who were receptive to the ideas and who probably weren't going to destroy everything when we ran the exercise. As you get some early successes, you want to build on trust and safety. And this little girl is such a badass. Like, that person is going to be a firefighter someday. So when you make these little structured exercises, you start to build a competency and you're able to demonstrate your value to people. It doesn't matter what the program is. Again, if it's, if you're going to continuous integration, continuous deployment, if you're actually finally implementing source control on your infrastructure environments, if you're doing that big JavaScript refactor that you said that you were going to do four years ago and now you're finally getting to it, or if you know, Stubbornella is yelling at you because your CSS actually makes her cry. Whatever that project is, you want to start small. You build up, you get some early successes, and then you begin creating these champions, as I described. So people like being smart, people like being in the know, people like having special knowledge, people like kicking ass and getting things done. Generally what I found is that when you can get people excited about what's going on and show that what you're doing makes a measurable impact, you can start to spread that out virally by having them be pretty excited. These kids, you know, go back to the kindergarten and they say, you know what, I played with the fire engine and I did all the things. They're evangelizing how cool firefighting is to other people in the same exact way that I know every single one of the developers that I worked with went back and said, you know what, I really love is availability engineering and I love doing these resiliency... they did not do this, by the way. This is a lie. But they to some extent they did. And then you move up to a little bit more training. You increase the bar. And so you say, you know what, we've gotten you to this place where you kind of know what you're doing, and so now we're going to run full-scale exercises. We're actually going to burn a house down. We're going to take a data center down. We're going to push something into production slightly faster than we have before, and we're going to begin measuring that and seeing the impacts on a team level, but where you're able to compare one team to another. Over time you're able to see the deltas between performance. So this is a heat map on a city showing where fire engines did not meet their response time obligations. It's pretty clear where you don't want to live and it's pretty clear where there's problems. You can do this in the same way when you're presenting a case inside of a business about what you want to do. You say, look, when we have people that went through this program, we had way faster response times, much shorter MTTR, and you know what, people were happier because they could deploy stuff on a regular interval. You want to celebrate those successes. So this is me, you can see me there. This is after a shift where we put a fire out. It's pretty fun. And what's interesting is that positivity ends up being a viral adoption tool. Colin Powell said that optimism is a force multiplier, and it absolutely is. When people find, we're seeing this with cloud adoption right now, oh, you mean I can just type in, get an instance, and deploy it? That makes people's lives so much better so quickly that it's impossible to repress, and suddenly there's a big buzz spreading around it. And the reason that it feels so good is because they had that early success and they keep on getting more power. And then finally you can exploit those compelling events to do the hard work. So most of you probably don't know this but the way we got fire sprinklers is lots of people died, and there was all kinds of resistance to putting sprinklers into buildings. But finally it got bad enough that we got a national standards body put together and people were willing to spend the money, willing to spend the time, willing to do all the building and everything else in order to make things safer. Now they couldn't do it prior to that terrible compelling event, but they were willing to do it afterward. And so when you're looking to make larger changes or you're looking for those moments, this is how you do that. This is how you do a big scary program like powering off data centers. This is what gives you the cultural currency to make a big change. Which I covered this, so just to review: start small, create champions, use metrics, celebrate your successes, and exploit compelling events. Here are the hacks. Starting small. So the reason that starting small works, the reason that doing a very small project works, is because it isn't a threat to the establishment within your organization. It's easy to ignore. It's easy to pass off. It's under the Somebody Else's Problem field, if you're a Douglas Adams fan. And when you're first building this and you're super excited and you encounter that first person that kind of wants to do battle with you and they're like, no, we totally can't ever do that because of compliance, or, you know, we've got this weird Sarbanes-Oxley requirement, or PCI DSS, or our security needs are so unique that we could never use JavaScript in that way, or whatever it is, right? So the trick is to just call it an experiment. Minimize. Now that's not what you actually want to do because you know you're going to be running everything in production 100% within months. But just say, no no no, it's just a short-term experiment. Don't tell them the truth, because it is an experiment. You know, it might not work. It'll work. So that's how you minimize the risk to the people that are going to do battle with you. Creating champions. This is an area that I kind of suck at. And one of the first things that I am terrible at, and I imagine most of you are, is you like getting heads down and you do a bad job of getting your boss on board with what you're doing. What you really want, your first champion, you want is to get your boss on board. You want to say, hey, you know what, we're going to fix this thing, it's been broken forever, it's going to be great, here's what I need to do it. I'm going to take a little bit of risk. And you want to get them on board so that they can represent and be that first champion for you as you begin delivering. At Amazon I was lucky. I had Werner as my executive sponsor for one of my projects, and he was awesome because he would come in, he's a giant, and basically say, you have to do what Jesse says. And I loved that. And I try to do that now as an executive at my own company when I'm trying to support people in the projects that they're trying to do. But the interesting thing here is it's easy to forget this and it's easy to have that weird antagonistic relationship where your boss is like, well, what's it going to do? You got to flip them. And if you're not able to flip them, don't fight stupid. Make more awesome. Go somewhere else. This one's a little trickier. Give everyone else the credit at this stage. Like if you get a developer up and running with continuous integration and deployment, or you implement some vastly improved frontend library, the best thing you possibly can do is spread the love as far away from you as possible so that as many people are like, you know what, I totally did that, it was awesome, and I want to do it again and again. At Opscode we do this really clearly with our community. It's one of the best ways of activating people. You know, we're always talking about what everyone else is doing. We make our contributions. We try to be quite humble about it. This has been a huge force multiplier for us, and it will work for you every time. The last thing is, while you're giving everyone the credit, give out special status. So Google totally nailed this early on with the SREs. They gave out bomber jackets, they did like special other kinds of coats. They really, I guess, like coats. Patches. Like they had velcro patches. Anything you can do to make people stand out because they're a part of your program. And it's funny how little effort it requires to make them part of your tribe, on your team, and advocating for you consistently. It's a super hack. And make sure that people with special status brag about it, but also maintain an air of exclusivity. So it's not a program that's open to everyone. So you know, it's only like, this is a pilot, blah blah blah, and so create a little scarcity early on. It's the best marketing you'll ever do for whatever it is you're trying to do inside of your organization. Metrics. So let me tell you the first thing not to do. I have a terrible history of, I love emailing metrics decks out to people without context, because I'm like, look, I saw a thing, it's super awesome. And that is the way to lose your champions right away, particularly when it's like the, you know, impending doom deck, which I would title things like. I still do that from time to time. But the thing to understand is that humans really need numbers to glom on to, to compare with other numbers. It makes people feel really safe. So find a number that makes sense. Mean time to deployment. How, you know, uptime, if that makes sense, it probably doesn't at this stage. Time taken for a developer to go from typing in commit to deployment. Because that's lost money right there, right? Like, code that's written and not deployed is wasted money. Find a number that supports your change and then use it ruthlessly. So you're going to first, you're going to show value. You're going to say, look, this thing that we did, this cultural change or this technical change we made, has incredible value, here's what it does. And then later it's going to end up being used as a weapon. So you should anticipate this. So you'll have one half of the organization who's using the new thing and you know, deploying software in 6 minutes, or sub-4-second or sub-2-second or sub-1-second, you know, first page load times. And then you'll have this moment where they're the superstars, and then anybody that doesn't do that is a jackass. And so just anticipate that that's going to come. So be prepared to use it ruthlessly. The last thing is tell your story with data. So I'm a big fan of Hans Rosling. You have to narrate the data so people can understand it. So when you shave off a couple of milliseconds off of a load time or do some other really powerful transformative thing for user experience, don't just say we shaved 7 seconds off. Take a narration of a couple of sessions. Show lots of different graphs. Make it a nice printed artifact that you can hand around. Again, this is how you get that currency where people will believe in you because they can look at it and they could go, wow, that's amazing, how did you do that? And the answer is, well, one thing is that we're actually able to deploy the code that we write. Or we're actually able to make changes, which we weren't able to do for 6 months. So you know, you can start working on this in six months, or maybe you'll make this change. But a story told with data provides a truly compelling way to force people to see the light, and it gives them something to hold on to that makes them feel safe. And that is the most important thing you're going to need as this sort of thing spreads out and you run into the resistors who show up and they're like, no way, we can never do that. Well, here's the story. What's your story? "No." Is that your story? Like, is that how you want to be remembered for your contributions here, as "you said no a lot"? I hope not. So, funny story. So at Amazon I said no so much that I signed the launch posters with "no" and then a little squiggle, which is my signature. So if you're ever over there interviewing or you work at Amazon and you see like an older launch poster and you see a big "no," that's me. Don't be that guy. The celebrating successes thing. So this is telling that powerful story. You really want to get people in, you want to pull them in, you want to say, I can't believe how much better it was now that we've used this particular technology. At Opscode we do a lot of this with our customer case studies, and we do that predominantly so that we can help people see the value of what we're doing. In your case, like, if you need help, I'm, you know, email me, I'm happy to help you craft that internally. But what we do as a community right now is we're doing really nothing but a lot of storytelling. We're, you know, talking about what's happening. A lot of the tools that are coming out are great but it's all culture first. So be prepared to tell a powerful story and use other ones, all the videos that are available and everything else. Always be positive about people and how they overcame the problem. So people are always good even if they're terrible. Do not attack people individually, as tempting as it is, because what happens is you alienate the people that would be able to come and help you. So it should not be about the people who created the problem. They had a reason why they created the problem. There was something in that system, some constraint that they were trying to live with, just like you're now trying to overcome a constraint. But feel free to attack the problem itself and the underlying constraint, or at least ask why you care about that at all. The last thing on celebrating successes is, and this is frustrating when you've been a person who's been pushing an agenda for a long time and trying to move a large body of people over to your side. You have to leave room for people to come around. So I found early on in my career, sort of trying to make big changes, that I'd be pretty mad when, you know, you'd be arguing with someone for a year about why you should do something different, and at the end of it there was so much weirdness between you that it became hard for them to actually admit that you were right. The best possible outcome when you're taking the things that you're learning here and you're exporting them to the larger world is this: it is that they can flip to your side without even knowing that it happened. It's just the simplest, most obvious, clear thing. It is the right thing by default. And that is truly winning. There's a great book called Crucial Conversations and Crucial Confrontations, which, it's woo-woo in some ways, but not. I recommend it. And it talks a lot about how to discuss this sort of conflict with people. But don't fight stupid. And don't create it by, you know, needing to be right all the time. Give that credit away early. So the last thing is compelling events. So I said before, just wait and you will have an opportunity to exploit a compelling event, that... well, I'll get there in a second. You beat me to it, man. All right, surprise is ruined. So stuff breaks all the time and you can use it. But more importantly, there are all these things that are used against us all the time to say no, to block things, to do really wasteful work. And the best possible use of an internal compliance mandate that you're all going to suffer through is a way of being like, hey, you know what, we could do that while we do continuous integration and deployment. We could do that while we improve the front-end user experience. And I know a lot of people who have been extremely successful in subverting what should be a terrible process and turning it into a great opportunity for change. Cloud has provided this. So every, how many like CEOs and CIOs are like "cloud now"? Raise your hands. Oh, wow, blessed few of you, I love it. So the, okay, anyway. These types of migrations, you know, seasonal scaling, whatever it is, these provide these great opportunities to be like, you know what, let's do it a little bit differently this time. And you should use those to the best of your ability. When it comes, it's not "I told you so." Like, when the dev pipeline finally breaks down and you're like, you know what, you should have been using Git instead of that other horrible thing, don't do that. I mean, you can do that like with your friends and be like, I told them so. But honestly the most powerful thing is just asking, what do we do now? And again, leaving them that room to come around to you, because you'll find that people just glom on and say, you know what, I love that, I want things to be better too. Almost nobody actually wants things to be as shitty as they can be inside large organizations. Almost nobody. There are some people that totally love it. I don't understand those people. I do encounter them from time to time. The last thing to understand on the compelling event thing is, so this is one of Jessica Hagy's graphs. So opportunity increases with level of upheaval. So the bigger the outage, the bigger the thing that broke, the better chance you have of making sweeping changes inside of the organization, because people are suddenly receptive to them in a way that they would never have been before. I'm going to talk about one last thing: permission. I'm a "get forgiveness, not permission" kind of guy, and most of you, if you're here, are probably somewhat in the same boat. I had a profound lesson about this which really shaped the way that I approached organizational systems and dynamics. During Hurricane Katrina I was deployed as a task force leader. It's another very long story, but there was a fascinating case study called the New Waveland Cafe. And the New Waveland Cafe was staffed by anarchists, and FEMA desperately wanted... so FEMA, we'll call them the Enterprise, and the anarchists being maybe the Velocity community. FEMA was like, well, who is in charge here, right? And the anarchists would literally yell at the same time, either "nobody" or "no one's in charge." And so then FEMA would be like, you have to leave, and we've got to set up a shelter, like ignoring completely that there were thousands of people being served hot meals right there in front of them. It's one of the craziest things ever, but their process and their permission system and everything else said that they had to be within the system, conforming. And the anarchists refused to be a part of the system that gave any single individual more authority than the other. It was one of the craziest things I've ever seen. And like it was really frustrating, and there was this guy with one of the major emergency management agencies who simply said, great, let's make them all site directors. And so then when FEMA would come and say, who's in charge here, someone would say, I'm a site director. And then they would give them supplies and materials. And it worked incredibly well. They ended up getting to about 20,000 people a day that they were serving. It was amazing. It's a unique story, but there's a lesson here. This is a picture, a very tiny fragment of it. So those are FEMA-provided medical supplies and a supermarket that they built that was free because they didn't believe in money. So it was a little weird there, but as long as FEMA kept on bringing in stuff, it worked great. It was really cool. The interesting thing about this as a lesson for me was: most of the time when people are saying no, what they really are saying is, I don't know how to say yes. And you find that when people have, you know, some reason or a belief where they're trying to do the right thing, oftentimes you can hack it just by finding a way to do something slightly different. One of these examples, and then I'm out of time, is: most companies have a wiki or an internal documentation tool. I find that simply documenting your authority in that tool is a great way of granting yourself authority over a particular project. And so I recommend using titles like czar, or even Master of Disaster, which was what my business card read at Amazon, and it was appropriate to me and it gave me a lot of leeway. But I just typed it into a form one day and then it stuck. But the point here is that the permission that you guys need in order to go out and actually do crazy awesome stuff in your organizations and overcome a lot of the stupid really is just going to happen because you will it to be so. And then the occasional obstructions that you run into are usually overcomable with, well, you know, a little creative engineering and maybe a badge that says site director. And that's what it means to not be fighting stupid and to make more awesome. Thank you very much. It's been a pleasure to be here.