#17 Lessons from SRE’s Wild West Days (with Rick Boone) – Boost software reliability

Episode 17 [SREpath Podcast]

Ash Patel interviews Rick Boone who is a pioneering engineer in the field of infrastructure and reliability engineering.

Rick worked at Facebook in production engineering when it was still called AppOps. He is also well known as being Uber’s first SRE hire. He shares amazing stories from those pioneering days.

Rick also draws from his experience to share his insights on how to build stronger SRE teams, as well as support effective career progression for individual contributor SREs.

Episode Transcript

Ash Patel: Great to have you on here, Rick. I’ve been following you for some time because you were the first SRE hire at LinkedIn. Is that right?

Rick Boone: At Uber.

Ash Patel: At Uber. I mixed it up already. Do you know what? I’m, I’m looking at LinkedIn at the same time. Sh**

Rick Boone: No worries.

Ash Patel: I’m looking at a LinkedIn tab right now. I should actually close all my tabs.

So, I’d love to hear about your history because right now you’re exploring new opportunities and that’s exciting. I want to hear about that in a second, but I want to hear about your history as an SRE, how you got into the space, what you did at various companies over that history. It’s so exciting.

Rick Boone: For sure.

Thank you, by the way. I’m very flattered with your compliments and that you’ve been following my career for a bit. I’ve been in computing literally since I was three years old, pretty much, my mom used to take me to her job, she worked with big VAXs and mainframes, and then I went to university for computer science.

When I left, I didn’t know what I wanted to do, but I knew I wanted to do something, obviously in computing, but I didn’t know what it would be like, back end, systems, coding, front end, and so I took a position at a web hosting company, which don’t really exist anymore, but 20 years ago, they were all the rage, like Geocities, things like GoDaddy, et cetera, et cetera.

So, I took a position there because I knew I’d be exposed to all different types of roles. Even when I was in university, every comp sci student had access to command line. That’s how we would just either write our code or submit our code homework. And I always found more fun… Messing around on the Unix command line and just like playing around and just jumping into various like systems and things like that.

And I sort of transferred that over to my first job at a web hosting company. Well, I was doing support, but I started just messing around a lot on the systems on command line. And so that got me into being a pure systems administrator. I basically decided this is the pathway. This is where I’m most excited by.

So did systems admin and systems engineering work for a while. That was about 2004. So did that around Los Angeles up until about 2011. And this is a time where it was what we would call nowadays SRE work, but that term didn’t exist. It was just systems engineering and this was just keeping like all the servers up and it really was broad.

It was everything from like, I remember running down to downtown Los Angeles to like unpack and cable cat five cables, unpack the server, recape, like cable the Server, rack it up, load the Linux image, and then like, install the software, and then building, rolling out all the scripts. All of that was very, very manual.

This is obviously pre cloud, and, you know, literally hands on server in the data center in downtown Los Angeles. And it was just called Systems Engineering. And then in 2011, I went to work at Facebook. And this was, even for Facebook, this was early days in terms of like reliability and realizing how do we keep all these services and systems running.

When I joined, Facebook had just decided to call this domain Application Operations Engineering or AppOps. They were organically realizing we need a place to put people that are really good at just keeping applications operating, literally just keeping them running. And again, this was all just figuring this out.

I was there for about three years. And during that time, Facebook really solidified domain, really locked down what it meant, eventually became production engineering.

It was a super exciting time to be in that space. Obviously, Facebook and Mark had a really strong philosophy about infrastructure and always staying up.

He famously said it around this time, build everything to be 2x. And at the time, like, that was mind blowing. For someone to say, we always need to have 2x the capacity, 2x the availability, it was mind blowing at the time. No one was thinking like that, and so Facebook was driving this “infra and reliability are everything” mindset.

Then I left there, I had been embedded, an embedded SRE, or embedded production engineer, the entire time, and moved over to Uber. At the time, Uber was just starting to realize the pain of massive hyper growth. And when you grow that quickly, and especially in the way that Uber was structured, it was structured as, at the time, four really independent domains.

Mapping, business infrastructure, data, and marketplace. Each of those were like, basically separate companies internally, and that was intentional to allow for rapid productivity and growth. There was no way to build infrastructure for those four areas from the centralized way it had been done before.

So Will Larson hired me in from Facebook to sort of bring over the knowledge of how do you do this embedded model of SRE of reliability for these bespoke but large technical domains. So I started at Uber in 2014 and along with someone named Eamon, who was hired to lead all of SRE by Will as well, basically from scratch, like it was full Wild, Wild West.

Basically from scratch, initiated everything from how do we do interviews? How do we think about services? How do we think about processes? How do we think about on call? How do we think about deploying so rapidly? Uber was very exciting to me at the time. It was very wild, wild west.

Seven to 10 deploys a day. 700, 000 lines of code. This is like 2014 and it was just let’s just go and just trying to get reliability guardrails on that while still building this thing in 2014 was just completely unheard of.

Ash Patel: Just a quick aside Rick, I was gonna say for anyone who’s listening and they may have been in high school in 2014, I know I wasn’t, they would be thinking, ah, 700, 000 lines of code?

That’s that’s like, that’s like nothing but Having to deal with that back then was not an easy task. It was, it’s just so different now, right? Everything’s, it’s, it’s like pioneers like you have really refined the space to the point where people don’t really notice large volumes of this kind of code or this kind of work.

It’s amazing what you guys did to establish the foundations for this space.

Rick Boone: Yeah, it was, it was, as you were pointing out, it was completely different and like I was saying when I first started in 2004, I was somewhat like rolling around scripts and, and literally racking the servers. We weren’t racking the servers ourselves anymore, but it was still pretty sort of hands on.

You know, we were still dealing with full servers. There was no containerization yet. I mean, you just put the code on the server, and so we just have thousands of raw servers. Just the scale of managing these things directly, like there weren’t any cloud tools. We just built, you had to build everything yourself.

You wrote your own scripts, and you know, it’s getting increasingly, increasingly complex. And so, to your point, things that now we just take for granted, which is awesome, like, it’s great that we are able to work at this scale and this pace and this sophistication of tooling. Back then, you just built out of necessity, because you get to a point where you’re like, what else do we do, you know, like, we have to get this out.

London needs cars to be running, and Paris needs cars, and people gotta get picked up, get picked up in Seattle, what do we do? And so, yeah, it was, It was very fun, very intense, but to this day, the most exciting, like, in terms of, like, everyday thrilling work that I’ve ever done, just, it was really intense but fun time.

We were building something, like, Uber really had no prior art around this level of, like, reliability, and so really, we were, like, in real time building the culture, figuring it out while keeping all this up. Yeah, and it was it was very exciting. And then around 2016, I moved out of SRE proper and moved to, I was still in infrastructure.

But I moved to software engineering, doing capacity engineering. That was really the end of me being a proper, official SRE engineer.

Ash Patel: Once an SRE, always an SRE.

Rick Boone: Yes, 100%. I still think, I mean, it’s funny because I still see the world through that lens. When I left fast engineering, I went onto the executive track, but I still always look at…

So one thing about SRE is that… You have to be aware of infinite things that are interacting in infinite ways and you have to have extremely strong systems thinking. You have to know if we change this, that other thing over there is going to change and how is that going to affect each other? You have to think very broadly and strategically and also very rapidly assess a situation.

That skill set works very well with executives. Also, you have to work with limited information because if there’s an outage, especially in a really complex environment, as you’re going through information and trying to root cause, you never have the full picture. But you have hypothesis, you have various signals that are coming in.

It’s a very chaotic moment. Every minute is critical. And you have to just sort of, we have 40 percent of the information. Let’s try this. That didn’t work. Let’s try that. Okay, and we know these two things are not. It’s not this, so we can turn on that data center back on. And it’s this sort of grounding and trying to be really smart across a very broad and dynamic domain with limited information and making decisions as best as you can.

And that is essentially executive work. As we said, once an SRE, always an SRE.

Ash Patel: Essentially, doing SRE work is preparing you for executive track.

Rick Boone: Yes, I, I 100 percent that it is. It’s also like, you have to learn to always be calm. You have to learn to always be sort of seeking out more information, realizing that things are going to fail.

How do we either avoid outright failure or just prepare for a world in which things are not going to go perfectly, but prevent total catastrophe. I think there’s a definite correlation between really strong leaders and executives and people that have been in the SRE world.

Ash Patel: You’re definitely now at an executive level, but let’s go back to around 2014, around the time when Will Larson hired you as the first SRE at Uber.

And I got that right this time.

Rick Boone: Yes.

Ash Patel: What was that conversation like? Because I’m curious how he would have pitched the idea of reliability engineering to you.

Rick Boone: So this is a, it’s really funny story. I left Facebook in the summer of 2014 and I went on what I call my like eat, pray, love trip. I went to Thailand for a month and then I come back from Thailand and I was going back out to Europe.

I came back from Thailand for my buddy, great Buddy Sunbury’s wedding, and I was going back out to Europe and I was packing and I got this call about 4. 30 from one of the recruiters at Uber, Danny Michelle. And he said, Hey, total cold email. Like he, I hadn’t updated my LinkedIn, Facebook and he’s just like, Hey, I see, you know, like just reaching out, like, you know, this is what we’re building over here at Uber.

We’d love to talk to you. And I was like, you know, funny enough, I actually know of your company. I know of Uber. Your company is the only company that I’ve actually been thinking about after I left Facebook. They had this great blog post that Uber Engineering had put out, excuse me, Uber Data Science had put out, about, even at this point, they were doing really cool stuff with their data, and they had realized that you could figure out where all the vices were happening in the city, based on Uber data, because on paydays, people would travel.

And they were like, oh, you can see where various places of ill repute are. It was… Just so fascinating to me. I was like, that type of thing reminds me of the earlier days at Facebook, just sort of this very, I don’t know, I just thought it was like, just the coolest thing that they had, that I had seen with technology, and so I was aware of Uber, and I was like, your company’s the only one that I’m actually excited about right now after leaving Facebook.

But I’m in the middle of traveling, I’m literally flying to Europe tomorrow at noon, and this is 4. 30 and I’m before, and I was like, if you guys can like get on the phone with me before I fly out, I’d love to talk to you, but otherwise, like, I’m leaving for three weeks, and he’s like, oh, okay, and so he gets Will Larson to call me, Will called me within, I think, about 20 30 minutes, and Will chatted with me for, I’d say, maybe 30 minutes.

I didn’t really say much. Will was a fantastic salesman and really explained, like, what they were doing, what they were building, what the situation was that they were in. The hyper growth, the need to really get focused, reliability expertise in these, like I said, these four domains. The fact that they couldn’t scale out as much as they wanted, that there needed to be really like a basic, a layer of people that were between the direct infrastructure and the people using it, you know, just sort of explain like where they were going and what their needs were.

Also just the rate of growth, what they were doing and like how exciting it was. And I let him talk for basically 25, 30 minutes. And then at the end I was like, Cool, you really didn’t have to say sh** like that. I have, I’ve been aware of the company. You guys just, this is fully coincidence. You all just happen to email me.

But I am, I’ve been very aware. I’ve been watching you all. So, I’d love to talk to you. But I’m literally flying out to Europe tomorrow. But I promise when I come back, I’ll let you all know. We can go from there. And then as soon as I came back, I can’t remember exact days. I would say I came back on a Wednesday.

And we had an interview set up for Friday. I mean, Uber back in those days was hyper, moved hyper fast in everything they did. And so I had an interview like that week I came back. It was a great pitch. It was an exciting pitch. I think it was very much, there is a sense of, we know where we need to be, but we need to get people in here right now that can just sort of, Just go, just figure this problem out, which seems really exciting to me.

Ash Patel: It’s interesting that you mentioned that it was supposed to be a role that was wedged between, I don’t mean to say wedged in a really strange or weird way, it was a role that was between Infra people and the developers. I think I’ve seen, actually I know I’ve seen Will wrote an article and I visualized it into a trunk and branches model where the trunk is the platform team essentially, what we would call a platform team now, and the branches are the specialized teams that work on various aspects to help improve the overall infrastructure’s efficacy.

So was SRE part of the branches or was it also in the platform as well, in the trunk?

Rick Boone: I’ll say at that time, and I don’t want to speak to Will’s vision at that time. Rather, I guess what I’ll say is, I suspect that Will came up with that model after. Because that team at the time, we were called Infra-embed.

There was no SRE at the time. The name was Infra-embed and as a matter of fact, the concept of even platform hadn’t been formed yet. It was more, we have servers, we have infra, we are exploding. And we need to just get a handle on this hypergrowth and what was happening is there were clear schisms being created between different languages, different data models, like I said, mapping and market, well, it was called real time at the time, which was what people interact with on the phone, like, you know, that real time connection and matching.

And business infrastructure in each of them, for technical reasons, had different language choices, different database choices, etc. So, for instance, you can imagine that the real time domain is more about caching, less about data durability, because it’s literally like, it’s real time. Mapping was driven by Java, because Java at the time was really fast, and was really glinted towards, there were other mapping applications that were built in Java.

Business infrastructure team was Very, very heavily driven by, of course, data sanctity, data durability, scale at massive, massive scale, because all services are essentially terminated within business infrastructure for figuring out, is this credit card valid, what’s the rating, when the rating has to get stored, what’s this person’s name, things like that.

And so because of that, we were embedded in each of them where we had to create this sort of local expertise in just that domain. And so. We needed to have a way to go from the centralized infrastructure of all these servers and all these components and all these systems, these sort of base infrastructure systems to how do we start to specialize these?

And it was too much of a task for the central team to do both, like running all of the infrastructure. And also knowing that, hey, business infrastructure really needs to have Postgres, like, locked down. The Realtime team really has to have their caches set up in this way. And the Mapping team really needs to have their Java libraries, like, you know, managed in this way.

And so, That’s where the Infra embeds vision came in, which was like, we need this sort of middle people to really specialize between the requests that are coming in. The other thing is that the Infra team was getting overwhelming amounts of requests and they couldn’t differentiate, they couldn’t like clean out, you know, they couldn’t figure out like how to differentiate and prioritize.

Ash Patel: Interesting you mentioned about infra embeds. I have had this conversation with a few other people who were at hyperscale companies at that same time. They all have different terminologies, different lingo for essentially what we now call SRE. It almost seems like companies came about this in their own way.

They discovered it and they found that this was the solution. And I think Google gets a lot of attribution for… A lot of these practices, do you find a lot of similarities to what, how Google has promoted site reliability engineering and what Uber was doing at the time and other companies you’ve worked at were doing?

Yeah,

Rick Boone: I would say I definitely do, especially at that time. I know like a really strong similarity was requirement and drive towards software engineering as a key component of all this rework. And I know, like, I felt that once I got in to Uber, what Facebook was doing, which, as you point out, they arrived.

At this, you know, pretty rapid and it was after evolution, but they were like, this seems to be the best way to solve this problem. And then Uber had arrived in it. It was a very similar, they were very similar in terms of their culture, in terms of just the fact that both companies have this, this approach to move very quickly and like, don’t let other teams limit your growth or your ability to produce.

I don’t want to say there wasn’t a centralized understanding of it, but there was this idea of you can be much faster if you like distribute the reliability needs. And I do think that like that becoming where people land it, I think is a very natural evolution. I’ve seen it in other companies as well, where companies still will, they’ll arrive at this solution.

Now, whether it’s the best solution or not. I think it depends on the company. I’ve seen companies try this model and then they like back out of it for whatever reason. But I do think that many things about SRE end up becoming the sort of natural, organic evolution stuff. Every company ends up here. And there’s, I think there’s other things like this too.

Like how do you evaluate performance for SREs? How do you structure various SRE teams? Do you keep a core team building tools versus, you know, we’re talking about these embedded teams versus specialized teams like just observability, do you pull observability out and have it as its own team, or do you keep it as sort of a diffuse domain within your SRE or production engineering team?

Ash Patel: That’s the interesting thing. You were starting off early in this SRE game, well before most other companies, so you worked in a time when hyper growth was the norm. I would say hyper growth is still possible at a lot of companies, but it’s just kind of, it’s there. Back then, it was such a new thing. How did it feel to be in that?

It would have been exciting, but at the same time, what was it like for everybody to just be working at 100 miles an hour all the time? Were there any feelings about that?

Rick Boone: Terrifying and thrilling. And I still miss it. It was both at the same time. I’ll give you an example. The last day I was working out before I went to Facebook, we had, we had about two, two racks?

No, excuse me, maybe 10 racks or something like that. I thought that was a lot. And I was just like, oh my god, like this is, this is, you know, oh my god, this is such an amazing amount of servers and learning about top of the rack switches and blah, blah, blah, and internet communication. And I went to Facebook and I remember in boot camp, I was writing scripts and working with things where it’s like with one hit of my button, I would work on things that were 10x that 10 racks.

And that was just a Tuesday morning. That was just like, oh yeah, like, before I’ve had my coffee, let me just sort of like run the script. Three data centers around the co and just moving data and performing work on hundreds and thousands of servers. And it was terrifying. It was just sort of like… How do you prepare for failure here?

How do you understand? How do you model this? I remember one of the things that I remember when I got to Facebook, I just was like, how does anyone understand this? Like, how does anyone have the model in their head? Like, how could you see all this and be able to reason about it? It was, like I said, terrifying.

It was so thrilling because you realize you’re working with some of the smartest people around. Just the solutions that you come up with, you get to this place of realizing a lot. Everything is possible that you just keep moving along and like… Better tools, better insights, continuing on like that sort of 1 percent or 5 percent every day and just getting better and better.

But I never like took for granted the fact that what we were doing was completely wild. Here’s a Facebook and you get to Uber and I just remember Again, terrifying and thrilling, like, being in the office Friday night, because, of course, you know, Friday night’s weekends were the huge time at Uber, so it’s inverted.

Friday night, Saturday night are the times with the most traffic. We’d be in the office Friday night. Really late, and just hoping that like, please god don’t let Paris crash. Oh my god, please don’t let Paris crash. Oh my god, and just this, in real time, marshalling resources, like setting up new salary queues, like setting up new brokers, oh, you know, oh let’s get a new, we have to get a new Postgres replica online.

And just realizing that no one’s ever done stuff like this and just, it’s so thrilling and you know, you’re working with people that are just so also like you, so excited to be there and creating solutions in real time and realizing the scale of what you’re doing and matching that demand and matching those needs, but also sort of realizing like, Oh my God, like we’re still, we’re figuring this out in real time. I’ll also say this You never realize in the moment, because you’re just trying to like, you’re just trying to keep powers up, you’re just trying to get the next thing out, you’re trying to get the 10th commit out, the 10th deploy out of the day.

But I look back on it now and I’m like, those were some of the most, I learned so much in those times. There are some like 6 month periods during that time that I’ve never learned as much in such a condensed period of time and it’s very sink-or-swim and not on an individual basis; on a like company… it’s an existential thing at that point because it’s like this is all us like, you know Like you mentioned nowadays it just happens, you know hyper growth just works.

It just happens. It’s just like oh like AWS has it they got it. It’s cool We can focus on the products and the features and just getting things AWS are getting Kubernetes set up correctly. You’re getting like the deploys, right?

Again, it’s awesome. I love that we’ve gone in that evolutionary path, but back then these were like existential threats. If one of us screwed up a deploy or screwed up an architectural decision, or like some reliability failover didn’t work, it was just this really beat the company and it’s this sort of sense of responsibility and heart palpitation inducing situation that like really keeps you in your toes.

And that’s also some stuff that I miss as well. Like it sounds weird, but I do miss that sense of urgency and that sense of, Oh my God, like this really is, this is, this is the company.

Ash Patel: And it’s serious stuff. Hundreds of thousands of people are relying on you to get their ride.

Rick Boone: Yeah.

Yeah. And that was something that also made it really important for me. I didn’t realize it at the time when I first joined. When I joined, it was just, oh, I love what they’re doing. It seems like a really innovative technology company. I don’t drive, so I love the idea of using the full resources of cars, and I grew up in New York City, so I think like cars can be a waste, and so like, I loved working at Uber,

Ash Patel: Downtown Toronto, so yeah, agreed.

Rick Boone: After I got in, it’s funny, I’ll never forget this, when I was working at Facebook, I had another fellow engineer colleague, who was actually also in production engineering, he was legally blind, and one of the best engineers to this day that I’ve ever known in my life, like, I mean, just brilliant, brilliant person.

But he was legally blind and maybe like six months after I was working at Uber, he like wrote me. He was, I want you to know how much like Uber has changed my life. My wife doesn’t have to drive me around for my daily errands anymore. It’s sort of like liberated me and it’s like really changed my life.

And I had never thought about, I always thought of Uber as like, Oh, like I’m going to the market or I’m going to the bar and just, I was living in San Francisco.

So I had a different concept of how it was being used. And it wasn’t until I got in and you start hearing the stories of people and you realize that someone’s grandmother is getting to the hospital, to the doctor’s office where before like it was a big hardship or like my friend who was legally blind and now his days could be totally different.

And that’s when you would learn about that and that made it, this really is people’s lives. Also the drivers. And also what I think people forget about the history of Uber is that it started and came about as the world was coming out of the financial crisis. There was a massive recession, GDP was depressed, and I remember when I joined they were like, Every city we start in we become like one of the quickest, like we contribute to jobs.

And like drivers are now getting jobs that are like, this is their job. They have this outlet that it didn’t exist before. And it’s rare to just sort of have a moment in history, like we’re seeing one now, maybe with AI, and I’m not comparing Uber to like the advent of AI, but it’s rare to see a moment where like jobs are just created out of the blue.

And so that was another thing where I was like, this is someone’s paycheck. We can’t screw this up. This person can’t make money today if we go down. So that was another element that informed how we handled the reliability and how we sort of approached the real time aspect of this has to stay up.

Ash Patel: I think it’s important because I have friends and some family who do rely on it as a big part of supplementing their income. And as things get more expensive, they’re going to rely on it a lot more. So yeah, it’s definitely important.

Going back before Uber you were at Facebook and that was what you would consider production engineering, that’s what they call, it’s broadly known as SRE, they call it production engineering.

Were there any nuances there or was your time there too short or too early? It was like 2011 so that was quite early days.

Rick Boone: So it wasn’t called production engineering until, I can’t be sure of the dates, but it was App Ops engineering. Until, I want to say maybe 2012 ish, and then I think it became production engineering.

As far as I’m concerned, and it’s funny, we were talking earlier about like how so many companies arrive at so many of the dilemmas and sort of realities of SRE independently. Another one that I’ve seen every company arrive at is, what do we call ourselves? Are we SRE? Are we reliability engineering? Are we production engineering?

Are we, you know, are we DevOps? Are we, The DevOps one I think has come a bit later in that conversation in that timeline. From my perspective, I don’t see any substantive or material difference between production engineering and SRE. Now, to be fair, I left Facebook at 2014, and they might have more strongly delineated as time went on, but from when I left, and from what I know SRE to be, And sort of went on to work further on at Uber.

To me, they were functionally identical. To me, I think they’re functionally the same. But, again, just a caveat, I could be wrong on that just because I left so early.

Ash Patel: That’s fair enough. A lot has changed in the last 10 years, obviously. In computing terms, 10 years is like half a century. Yeah. In terms of how things evolve. Because you’ve seen it for a whole decade, do you see any anti patterns that have developed in this field, this broader field? Like we’ve got SRE, DevOps, platform engineering, anything that really sticks out to you?

Rick Boone: The main one that I see a lot and that always gives me pause and immediately triggers like, uh, okay, let’s dig more into this is a focus on specific tools, technologies, products, companies, keywords, buzzwords, etc. As opposed to concepts and abstractions and so what I mean by that is I’ll see someone being interviewed or I’ll get into a conversation with someone in an SRE team and there will be this sort of overabundance of focus on just make it by Kubernetes on AWS. And just sort of really driving into Kubernetes, Kubernetes, KSQL, like, and the sort of, like, specifics of Kubernetes, or how it runs on AWS, and this happens across multiple domains, like CICD, you know, it’s, it happens in a lot of different places.

I always… try to pull back from that. It’s like, what’s the, what are the concepts, what are the concepts of like containerization or resource allocation or orchestration that the Kubernetes represents? Because it’s great that Kubernetes has basically gone and you know, like it’s all over and it’s sort of like standardized will be due, but that’s not what SRE to me is about.

Like that’s just an implementation, but SRE is about understanding the fundamentals of how distributed systems and complex environments work together to provide a product. And how do you keep those things available, reliable, et cetera. I think when you rabbit hole too much into the implementation, you can lose a lot of the ability to really have that broad minded expertise and insight into what’s really going on here.

And so I’ve seen it manifest in a number of different ways. I’ve seen interviews where people reject someone because they didn’t see the right word. They didn’t like mention the right product. They didn’t like see the right cloud name or tool name. And I’m just like that. It doesn’t matter because I see these things as vice president and I’m like next week I could come in I probably won’t do this.

But next week I could come in and say like we’re moving from AWS to Google Cloud I wouldn’t do it like that over indexing on like one tool is not going to really make us a really excellent team and also like to me SRE is so much of it is about the idea of understanding really complex things and boiling them down to fundamentals and to like understanding how in this complex environment all this stuff works without hyper fixating on like one particular thing.

You have to see the whole chessboard. And I think it’s really hard to see a whole chessboard if you’re focusing on like one square of that and ascribing all this sort of power to like this particular implementation of this tool is what we have to work with. The other thing that I see related to this is that I’ve seen teams start to change their requirements or their designs or their like models of how do we build, how do we solve a problem based on the limitations of whatever tool that they’re familiar with. And so, instead of thinking from first principles of hey, like, this is what we gotta do to make our product absolutely stable, and reliable, and available, and have strong uptime. What they’ll think is well, we can do these things that this company that has nothing to do with us told us we can do.

So how do we make do with this set of potentially limited tools? And I’m always like, that’s gonna get us only as good as that company allows us to be. We have to build for our product and what we need to have to deliver for our customers. Now, if we can do it with that tool, great, but the first principles should be grounded with us.

And when I see people start to give this power and they start to derive requirements from what their tool has told them they can do, then I start to be like, ah, we got to back away from this because we’re not really doing the right thing anymore. We’re doing what someone else has told us.

Ash Patel: Many organizations are guilty of doing this.

What do you think can change their perspective on it? It’s a lot of, oh, it’s hard, isn’t it? It’s really hard to make that change.

Rick Boone: You know, I’ll be, I’ll be totally honest here. I mean, I’ve been honest the whole time we’ve been talking. I’m sure you have been, yes. I will say I noticed it maybe five years ago.

And I noticed it, it caught me by surprise, because I came out of Uber.

Uber rolled their own containerization and orchestration software, they just like built it from scratch, because at the time, Mesos and Aurora were not able to scale to what they needed, and Kubernetes was also not fully where it needed to be, and so they sort of just built their own from scratch, and that was to me you know, it was this sort of what are the concepts so that we have to think through here and then whatever we come up with, that is what is best for us and it was this very conceptual and abstraction driven conversation.

When I left, I had noticed that there was a lot of, Oh, like if you nail these tools down, you can get, you can get the job, you can get position, you can get into the world of SRE. I don’t want to say that, I think I was lucky to be at a place where it was like, we were building our own thing.

There were no tools that could work at our scale, so we built everything from conce we started with concepts and then built our own tools. Whereas, I think many places, as you said, you get hypergrowth for free. And so, it was just, oh, if you like, use this tool, you get this. If you use that tool, you get this, all of which builds up to hypergrowth, or just being able to scale very easily, and so, the way I approached it was just explain to people, I mean, I did it, I did it by example.

I would be in interviews, I would be in conversations and I would always sort of cross out a name or just say like, what is the concept here? Like, what are we, what are we trying to accomplish with this? What if this wasn’t here next week, next month, next year? How do we start to understand that we have to build for ourself?

I don’t know if I have a super silver bullet answer here. Can I give it a shot? Yeah, yeah, yeah, please. I’m probably gonna take some notes on what you said.

Ash Patel: I’m not sure if you can take notes on it, but… I think it’s just taking the rationale that what you know right now is not what you’re going to need in a year’s time, so you’re going to have to start learning design patterns, start following people who are thinking five, ten years ahead.

Stuff that Martin Fowler was saying ten years ago is what people are starting to kind of look at now. People were talking about blue green deployments two, three years ago, and I was like, I’m pretty sure someone wrote about that a very long time ago. I just vaguely remember. So it’s kind of staying up to date, like actually understanding what you actually need to be knowing.

Yeah. Or at least having some idea. Kind of having a crystal ball of, okay, in a year’s time, this is kind of becoming important. I don’t need to know it right now, but I kind of need to start learning elements of it and apply it using, like you said, first principles. And I think a few other elements include having a design thinking mindset, being able to actually put things out.

And actually draw them out. I’m not talking about being an artist. I’m talking about thinking like an engineer, someone who actually can make a blueprint of what is happening here and how can we expand on it. And that requires another skill, lateral thinking to be able to actually, you actually need to not be thinking, okay, this is what I know, so this is how I’m going to apply this.

You actually have to think, okay, this is a tricky problem we’re having with our infrastructure. How we’re actually going to drive this change, how we’re actually going to make it happen with a different perspective on things. Maybe I need to ask people who are outside of my space, take inspiration from another space.

And I speak with a lot of people from a lot of areas of life who have succeeded in their space and they always say, I explore, I don’t pigeonhole myself into what I’m just doing.

Rick Boone: It’s funny because I, when you said lateral thinking, I took that and I was earlier thinking about nonlinear thinking, which like is, it’s essentially, it’s another term for, for lateral thinking.

I was exactly thinking of that same thing where to me, SRE is often, it’s the domain of lateral thinking or nonlinear thinking, like when you see a problem, especially if you’re like dealing with like a real time outage or something that’s never been seen before. Theoretically, every outage should be a new outage.

Now, we know it doesn’t happen, and like, some things repeat two or three times, and you really nail it down. Like, okay, like, we, now we can do a true post mortem and fix this. But, theoretically, every outage is a brand new outage. Just like every plane crash, it’s usually a completely unique, it has never happened before.

And so, like, to sort of understand what’s happening, you have to be able to sort of think in this lateral way of, we’ve never seen this before, but what could this be like? How can we, like, model this? And, like, how can we sort of have this sense of, what are the, like, boiled down atomic units of what’s happening here?

And then sort of build back up from there. And I think that, I think really great SRE engineers, but I think really great engineers in general have that ability to look at something. And some of the people that I think are the most brilliant and can look at something and simplify it to what it is and take away all the other stuff and be like, this is what is happening here.

And as you were speaking, you remind me of something that I tell people often, especially people who have never been through computer science at university. Many people think that coding is computer science, is getting a computer science degree. And I say, no, you can actually get, you can get a four year computer science degree and never write a line of code.

It’s about the concepts and the thinking through an algorithmic way. And it’s about like, how do you approach a problem in these very fundamental ways? The coding is just the implementation, which would be sort of I’m saying like that the coding is the is the tool of a technology, but the real value in that degree is how do you think about this problem?

Like, how do you model it? And that’s very much I think you just reminded me when you were when you’re going through this, like, oh, this is what he’s describing is very much that same concept.

Ash Patel: That’s what I tell people who want to progress. A lot of people ask me about how do I go from kind of a junior level to a midway to a more senior level and I tell them bother doing more certifications you already have six AWS certifications I don’t think that’s going to help I think more useful than that and don’t even yeah don’t even jump into the kubernetes thing know enough of it.

But there’s a lot of AI that’s coming out. There are a lot of tools that are coming out helping you manage this now. So you don’t need to be memorizing what to do in Kubernetes now. You need to be able to understand, like you said, a lot of the principles that are taught in computer science, a good computer science degree, we have to delineate, we still have to delineate good computer science degrees

with essentially a computer science degrees that’s called that but it’s actually an IT degree and it’s giving you a very baseline view of the entire realm of computing very different things

Rick Boone: I guess in the past maybe five years or so I’ve gotten into you like a concept of multidisciplinary thinking and like mental models and the idea that creativity and Intelligence in some ways is really nothing more than just applying ideas and models and solutions from one domain into another domain that they’re not typically seen in. There’s a few books out there written about this. It’s one of the reasons actually that I got into the executive track is I started to realize that I love so many things in this world and I end up applying so many of them to engineering, to engineering problems, so like, Military history and economics and like behavioral science and like psychology and all of those things, even when I was a pure engineer, you know, I love behavioral science and psychology, for instance, and like, I’d have this idea, I’d have this project that I wanted to get off the ground and I go talk to someone about it and start to realize, oh, I’m just incentivizing someone behaviorally and once I could understand what that person’s drivers were and what their incentives would be, then I could model the project based on that, that idea of having a broad grasp of topics.

And whether that’s topics that are only in engineering, only in software engineering, or whether it’s across many domains, as you’re saying, and this is a, I love that you’re saying it because it’s a big thing with me as well, the ability to take, first of all, to consume lots of different domains of knowledge, but then also recognize when you’re seeing a similar problem, and then apply it.

As you’re saying, is one of the primary things that levels people up and whenever people, exactly like you’re saying, I talk about my day, I’m just like, as an executive, as a leader, like, I’m usually applying things from like, military history, or like, world history, or foreign policy ends up being such a prominent, I read so much foreign policy.

Because it’s such an important, it ends up being such a resource that I go to all the time for like, how do you have, how do you understand different parts of a company that are doing different things but I have to like, you know, negotiate about resources or things like that, economics and anyway, I can go on and on about that topic.

But I love that you said.

Ash Patel: There are so many parallels as to what’s happening in the world and in different areas to how you can apply it. But the warning would be, don’t copy what Rick is doing, because you have to find your own way. Don’t read, don’t load up foreign policy documents this afternoon, thinking, wow, I’m gonna get better at SRE.

That’s not how it works.

Rick Boone: No, it’s, it’s, it’s, I’m so happy you even pointed that out as the caveat. It’s that I love these things. And then I was like, oh man, I love reading about whatever it was, and then I would like, see it modeled in, in my engineering process, be like, oh man, I can use this. I think there are painters, people that love painting.

I do talk about this when I give other talks about career progression. And I’m always like, if you love painting, you should like, keep painting, but think about how are there ways that you can bring your perception of art and the mixing of different shades or anything like that or like, how you sketch something out first and sometimes you erase it and then you just whatever it is, like, how do you bring your love of art into your engineering work?

And that’s one of the main things I always tell people is that, almost identical to what you said, I I’ve told people, like, look, coding is, when you really get down to it, like coding is not, like, at a certain level, coding is not the x factor. It’s almost not even, like, the fifth or sixth factor sometimes.

If you get really high, that’s not, like, everyone can code, especially now. Robots are coding. So, like, not to say everyone can code well, but, like, it becomes commoditized, and the thing that really is the leverage is your ability to bring your other aspects of yourself and your insight into a problem space.

Ash Patel: ChatGPT can pump out some amazing code. I mean, reasonable code. It’s amazing that it can pump out reasonable code. That’s what I’m trying to say.

I’m relearning coding, but I’m learning it from the perspective of actually having known a lot about how the architectural patterns work.

How to actually solve problems. Rather than, hey, I’m going to memorize all this syntax. Very subtle difference. Now, going back to way back when we started this conversation, I’m sure people are wondering, what is Rick doing now? What is he passionate about? I think you spoke about AI a little bit, but what are you exploring right now?

What are you, what is the future of Rick Boone looking like?

Rick Boone: I resigned from my previous position in January, and ever since, in the technical realm, I’ve been taking a bunch of AI classes. I’ve loved machine learning and AI and data science since my days at Facebook, and I’ve always been linking in and out of it.

For instance, when I was at Uber, I was on a machine learning team. One of the best times of my life doing capacity engineering through machine learning to predict, like, how much capacity Uber would need. So taking classes again, I actually built all an investment portfolio manager and analysis tool.

As a lay person investor, there’s a surprising lack of tooling, like really complex tooling out there to allow you to do things like hypothesis modeling or like understand various performance of your sectors in sort of a real time way. Not real time way, but rather your requests will be real time like, Oh, I think I want to see it in this way.

And like, it’s, they’re a little bit limited unless you work at like a hedge fund. You know, you work in an investment bank, like those tools aren’t available for people. So I went out and like built my own, of those. And then I’ve just been like traveling, swimming, gymnastics, just sort of enjoying my life.

But I’m actually starting a new position next week. Oh, exciting. It’s vice president and head of engineering for a financial and wealth advisory firm.

Ash Patel: So in New York City as well. If you say finance, it’s going to be in New York City or London.

Rick Boone: Yep. They’re based in New York. Yep. Yep.

Ash Patel: That’s great. I hope you can add a lot of impact to the space and have a lot of amazing learnings. It’s been great to have you on here, Rick, and…

Rick Boone: Thank you.

This has been awesome.

Author
Recent Posts

Connect?

Ash Patel

Reliability Nut at SREpath

Ash has an unhealthy obsession with software reliability. Maybe it’s got to do with the trauma of working at a few companies where software kept slowing or went down while he worked to turn it around. His ma hopes that he can one day turn this passion into a respectable job or business. Still waiting…

Connect?