#18 Winning at SRE in Banking and Telecom (with Troy Koss) – Boost software reliability

Episode 18 [SREpath Podcast]

Ash Patel talks with Troy Koss who is the Director of SRE at CapitalOne, an early adopter of DevOps and SRE in the banking sector.

He shares insights on working in regulated industries like banking telecom with his early work experience being at Verizon, a US telecom.

Troy also shares his experiences on choosing the right kind of SRE individual contributors and developing their abilities. He strongly emphasizes the importance of education as pivotal to ongoing reliability success.

Episode Transcript

Ash Patel: Great to have you on here, Troy. I am very curious about what you do with regard to SRE at Capital One.

Troy Koss: Yeah, thanks for having me. I’ve been listening into a lot of the sessions you’ve been having and excited to chat more about it. There’s never a day that I don’t enjoy talking about this stuff, so excited.

Ash Patel: Can you give me a brief overview of what you’ve been doing in the SRE space for the last couple of years?

Troy Koss: It’s been pretty inconsistent, but I think that’s part of the flavor of reliability engineering . It changes all the time, depending on the needs, right? I would say my past couple years, I was at Verizon for a period of time and getting reliability engineering started and championed as a field, you know, within, within the company and a practice within a company rather, and spent a lot of time evangelizing and conceptualizing like what does it mean to be reliable for Verizon?

We set up a lot of different reliability teams across the different tech divisions and we also spent a lot of time, you know, figuring out how to measure reliability and what does that look like and identifying patterns of TOIL and automation opportunities and resilience. And, and then I was there for, for a bit and leading that before hopping over to CapitalOne and, and here it’s been a lot, a lot of the same, but like with a unique flavor. I like to say that everybody has their own version of SRE or their own version of it, because, you know, you can’t just copy and paste what’s in the Google book.

So we had to kind of define like, what is the reliability of engineering for CapitalOne and that means a lot of different things than it did when I was at Verizon. So a lot of emphasis on resilience and patterns and staying up, especially being a bank. And we focus on all the same similar things though, like reliability measurement and SLOs and good things like that as well.

It’s been a mix. It really depends on what the flavor is for the week, but largely defining what is it, what do we need for our company? That’s what I’ve been spending a lot of time on.

Ash Patel: You’d have some very different applications for what you were doing at Verizon, which is a telecommunications company serving most of the United States, and what you would be doing at Capital One, which is essentially a financial institution.

Did you find any nuances in how you approached them? Like you said, you’re not copying what Google was doing, so were there nuances between how you approached the issues at each organization?

Troy Koss: Oh, yeah, definitely. I think the common pattern between all of them was that they were highly regulated businesses.

Right? I mean, it must be a glung for punishment at this point for sticking with these huge giant companies were highly regulated. So there was a lot of that common pattern where there was a lot of extra layers of security, a lot of extra layers of regulation and compliance. But I think the The difference, too, is also looking at the maturity levels of the different companies at different states, too.

I found that, you know, my time at Capital One has been great in that we’re all in on the cloud. We’re moving forward. We’ve got a lot of modern applications. We have a pretty simplified footprint in terms of that. Some of the challenges that I met at Verizon were that we weren’t 100 percent there. We had different clouds.

We had on prem. We had a lot of different things to manage and how does that whole ecosystem work? So the challenge became different. It was, it was about simplification, right? And trying to distill a lot of the patterns across the clouds at Capital Air Verizon, rather. That’s what we spent a lot of time doing.

And I would think that, you know, the, the net, net of it though, is the software was still kind of the same. Sure. It’s telecom. Sure it’s banking, but it’s still software right at the end of the day. So the patterns and stuff that we’re doing with different failover techniques and resilience patterns, et cetera, were all largely the same though.

So different scales, different environments, pretty much are the big things you have to really navigate around.

Ash Patel: How did you get into reliability engineering in the first place? So you’ve been at these companies for spending some years at each of these companies. How did you get into this?

Troy Koss: That’s a funny story.

So going way back, when I started my career, I was, you know, out of school. It was computer engineering, software development, and then had an opportunity to join the leadership program, the development training program at Verizon. And they had a track for software development, then they had a track for network.

Ironically enough, there was some, I heard some rumor that I ate a sandwich in a weird way during the interview and during the break session the person in the IT software development space wasn’t interested, so I went in the network track actually, which is like completely not software engineering, and I spent a lot of time in network learning the bread and butter of what Verizon had done with the networks and did traffic capacity planning and our site buildouts to build out cell sites and cell towers. I spent a lot of early part of my career doing that work, looking at the system performance of the network that really modernized into software over time. This is 2013, 14, 15, 16, like things started evolving into software defined networks and network function virtualization.

And as we were doing that, I was able to then start using my software to like solve problems. And then, all of a sudden, I found myself in the software side, the IT system side. I was like, oh, this is normal to me, monitoring systems and looking at traffic and following patterns. And I was like, this is great.

And I get to also use software to solve that problem or help people that are building software solve these problems and observe and monitor and a lot of that. So I was doing SRE the whole time. I just didn’t know it. And I was just helping support what is one of the most reliable networks on the planet.

Then lo and behold, I’m in Site Reliability Engineering, just a different name. For what I was doing the whole time. So

Ash Patel: you’ve had a gradual progression in your career. You were doing, like you said, you were doing a lot of things at Verizon. You were a software developer, network traffic engineer, software solutions architect.

And then you handled 5G for Boston.

Troy Koss: Yeah. Yeah, it was, it was a little bit of everything when I was there. I was bouncing around quite a bit, learning the business, learning what was going on there and really doing all the different facets, right? Capacity planning. I was having to do some of that. The monitoring I mentioned with traffic engineering.

When I was doing a lot of the stuff around our solutions architecture work, I was trying to help us mitigate risk. These are all like parts of SRE, I just didn’t know it at the time.

Ash Patel: A lot of people think that to grow into leadership, to get into technical or people leadership But You need to be following a linear path.

You spent seven years at Verizon. You got a broader understanding of how everything worked. Do you think that contributed to your progression into being a director of SRE now, that it allowed you to understand the system more broadly?

Troy Koss: I totally would agree to that, and I’ve been talking to some of the folks on my team, and you know, we always joke, you know, SRE, is it, is it really just a buzzword?

Is it really just a catch all? Is it DevOps? Is it, what, what is it really? And, and it largely is systems engineering, right? You’re, you need to understand all the different inner workings of, Of how the system fits together and operates you have to have observability, so understanding how the outputs of that are telling you what’s going on inside of it.

And I think without my, like, being able to have done all these different parts to kind of map, like, look at all the different angles of a business, of a, of a network, I don’t think I really would have been able to be as effective as I am in my current role. There’s a lot less unknowns that I have to deal with.

Everything you see that’s new, it has its own flavor perhaps, but it really honestly is just a spin of what you’ve already seen in a different way. So you’re kind of prepared. You feel like you’re ready to go when you hit, hit an edge case or something’s not working or you’re having an incident in production and go, okay, like how can I, how can I break this down and find and start narrowing down where the problem is and start fixing it?

And it really allows you to approach these problems pretty calmly and naturally.

Ash Patel: The place you’re currently at right now is somewhat of a darling for the DevOps movement. Capital One gets listed a lot amongst the enterprise people who I meet, especially the DevOps Enterprise Summit. They’ve obviously been doing a few things right.

And you must be part of that equation. So I want to hear what you guys are doing that gets everyone so excited every time I meet people at a conference.

Troy Koss: Yeah, well, I can’t give you all the secret sauce, right? But I think a lot of it starts with the culture that we have that’s enabling and allowing for that.

I’ve never been at a place where we have this rigor, we have this regulation, this compliance, this high bar that we have to meet. Yet, we’re not afraid to push the envelope on that and try what the boundaries are and figure out what they are. Every year we’re doing these huge exercises to really test our resilience and make sure that the whole company is on board with these like large sweeping patterns and exercises.

And I think that it’s really a testament to like the behavior of the folks that they, the talent that we try to attract. And the leadership that we have that kind of allows for that and enables that there’s not a lot of control of like, this is what we must do. It’s a lot of grassroots of like the engineers saying, okay, we need to try this new thing, or we’re going to need to try to push the envelope here and not afraid to really take those chances.

But knowing how to do that with control and making sure we can still be reliable at the day for our customers is a big part of it. We really aren’t afraid to take those chances. Our CEO, we call him Uncle Rich. He’s fantastic and really pushes the whole enterprise to do that every year.

We have these big strategy sessions. He’s really trying to force us to be on the leading edge of things. Like you said, our cloud movement DevOps, we’re always one step ahead because we’re not afraid. And I think that’s a real big piece to our success.

Ash Patel: SRE is such an important part of any financial institution, so I’m sure a lot of people are looking up to Capital One and what you guys are doing, especially in the SRE space, the DevOps space of that organization. Obviously people are trying and testing a few things and they’re not as successful as what your team is doing.

Have you seen any anti patterns in the broader SRE, DevOps, platform engineering field that make you have pause and think, Hmm, why?

Troy Koss: That’s a good one. You know, let me think about that for a second. There’s a lot of anti patterns. I mean, even we still have them. Even myself. Sometimes I jump the gun on something or don’t take my time to actually make a good responsible decision.

You know, everyone’s human. But I think a lot of these sweeping broad brush definition of like, what does it take to be reliable? And chalk it up as a checkbox activity or the things that are really anti patterns. It’s not easy. It takes a lot of time and it’s never over. And when I say it, I’m talking about building reliability.

And because that’s the case, a lot of times you see we can solve everything with this software or do resilience and that’s it, right? That’s like one part of eight things that really encompass reliability engineering. It’s, it’s not a one thing and you’re done. There’s not like, you know, I think it’s a lot of the marketing and a lot of the product space that’s out there kind of has this allure and attraction of, Oh, well, you know, monitor everything, know everything that’s happening in a simple click, I think it’s going to go pain of glass. Single glass of pain is what I think a lot of as SRE guys. Someone had taught me that this year and I, now I use it all the time.

There’s all, there’s all this like, You can just be done with it. And, and I think everyone’s like in a rush to that and that, that’s probably the biggest anti pattern is that it’s gonna happen fast. It’s like you can solve it right away. And just in talking through with you here, it’s kind of how I would chalk it up.

But it’s unfortunate because it’s not, it’s something that takes time. It’s painful and you have to grow into it. There’s always that human aspect. You’re not gonna solve it with a tool. You’re not going to solve it with 10 tools. It might get worse with 10 tools. There’s always that human element.

And I don’t know if it’s an anti pattern, but ultimately what I’m seeing is we’re not really investing enough in like educational piece to this. It’s something that I’m passionate about doing. I love working with junior developers and engineers to like teach and explain some of these patterns and things.

Because you don’t learn it in school. There’s no software engineering for reliability. There’s not a lot of classes that are going to teach you how to monitor systems that are really at scale or to build resilience patterns and failing over regions and dealing with catastrophe or capacity planning of your application.

That’s not a thing. And there’s an anti pattern of not really investing in education. Like we don’t hear that. We don’t hear for pushes and curriculum changes. We don’t do a lot of that. And sorry, I’m awfully passionate about this, as you probably could tell, but man, I don’t think we’re ever going to get past the point that we’re at if we don’t continue to invest in people and try to use tools to solve the problem.

And I would imagine with the advent of like Generative AI and a lot of those things, it’s just going to keep on getting worse before it gets better. So I’m hoping we could really push the envelope and try to get education being at the forefront of this. And like, how do we teach, teach a team to look at a dashboard of a SLO rolling 30 days, and it’s degrading performance over time.

And we, it’s a slow burn. There’s no major incidents. There’s nothing crazy happening, but we see the slowly trickling down effect of lowered reliability or lowered availability for our customer. How do we investigate that? How do you teach that? How do you, like, how do we, how do we make that a natural behavior?

People are intrigued by that. They see that and they go, Ooh, we should fix that. But how? And it’s intimidating. It’s an intimidating barrier. It’s like, I don’t even know where to start solving that. Right. So how do we produce that? Education teacher engineers how to manage their systems.

Ash Patel: It’s one of those things you generally have to get people started well before they even face a problem to go “hey these are some of the first principles, these are some of the ideas adjacent to what you might face”. That’s what i find i’m just putting in my two cents at the same time as interviewing you on this because I’m also very staunchly passionate about people getting educated in the right way. It’s not just about getting certifications, the AWS certification, the CKAD .It’s just the foundation. There’s so much more to it.

Troy Koss: And I’m wondering too, you know, I don’t know if I want to flip the questions to you, but like have you seen, you know, you, you interview a lot of people and talk to Have you seen, like, what have you seen in this space in terms of the education and have you seen a lot of emphasis on it or, like, really still an absence of it largely?

Ash Patel: I’ve still noticed an absence of it. It’s something that is very much that people are hiring engineers just on the basis of, “okay, you have these certifications, you have X number of years working with Linux, you have X number of years working with Kubernetes. You’re hired”. It’s not trying to really understand how deeply do you understand a system can behave, how do you solve this problem.

Problem solving skills are not really given that much precedence. Being able to understand problems from a systems perspective, and not just talking in terms of computer systems, from a system design perspective. It’s still lacking, but overall, even forgetting all these abstract ideas of learning system design and architecture and patterns and all, whatever we may have, I have definitely seen that people are not, and I love how you flipped this question because I’m having to think about it, they’re not equipped with the right capabilities to solve the problems that they face that can really impact the resilience, performance, security, whatever your priorities are of a system.

And I think we can, as organizations, definitely put in more effort into that piece.

Troy Koss: Yeah, totally. And then it becomes a how do you prioritize that right in an enterprise and that’s a lot of what we’ve been thinking about lately is, okay, how do we then prioritize this behavioral change, this cultural change?

I think a lot of it, you have to show ROI, you have to show value, show benefit before we invest something as a company, we have to show how much more, you know, we’re going to save or how much we’re going to be better because of it? It’s a hard one, right? How do I quantify these qualitative changes in behavior for people?

One of the things we can do is certainly prepare for the worst. And under the preparation for the worst, I think we can learn while… Also giving assurance to those stakeholders that we’re investing in saving us money, time, whatever it may be.

So what I mean by that more tactically is investing in things like pre mortems, being able to, okay, we’re going to do a huge release. What are all the things that could go wrong? Let’s roleplay through this before we’re all freaking out when it actually doesn’t go well. Okay, let’s do some mock incidents. Let’s simulate the incident. What happens? Let’s replay the past 10 incidents that we’ve had.

Let’s try to, in a controlled environment, step through this. Like, what does good look like? And show some examples to engineers and teams. I think. We need more blueprints. We need more examples. We need people to see what is right. We need to call out the anti patterns, too, so they know what is not right about what they’re doing.

And really use that, like I said, under the guise of, okay, we’re going to do this to prevent the next incident, or we’re going to do this to make sure we learn from the last incident, right? And those are all the things that I believe will, like, be tools to really start the conversation.

And once we start seeing the value of that, I believe truly that we’ll be able to invest more in it with more education, more learning and development, more enablement of the engineers.

Ash Patel: Absolutely. I think that’s something that we all need to keep talking about.

Let’s move on to something a little different in terms of, okay, you’ve had challenges like that. Education’s definitely one challenge that you would face with, How do you get your engineers up to speed with your systems?

How do you get them understanding a lot of concepts that are relevant to your work? How do you deal with the organizational challenges of bringing reliability into teams that may not necessarily care about it from the outset?

Troy Koss: I will start by saying my job is easy and dealing with these in that there is a huge appetite for it.

Right. And I think it plays back into what we said earlier about our culture here and how great things have been. We get a lot of teams that reach out that are ops teams that are like, I don’t want to do this anymore. I need, there’s gotta be a better way. We need to help. We need to, we need to, what are we doing wrong here?

Like, how do we get into like really reliability engineering and not just keeping the lights on for this app or these systems. So we’ve got a lot of that. I would say that’s a majority of what we see. And then there’s pockets though, even within those teams, there are some pockets and, and others that are really focused on the now, right?

Like, what does this solve for me today? I’m trying to solve this problem because I told you one of the biggest things is it takes time and it takes a lot of time to evolve and change your teams to think differently. Because it takes time, it’s a hard sales pitch. It’s like, I have a problem right today.

I think a lot of it is, how can I help them now start seeing what would be the fruits of the labor of investing in changing our behavior, changing our culture to start practicing these reliability things, right? Because look, everyone’s flat out all the time. Like no matter what, no matter what happens, no matter how many people you hire, you’re always going to be flat out.

So I think we start by a lot of it is data driven. All we do is we try to, let’s just, let’s look at data and assess where you’re at today as we meet with a team, and then show them the 10 opportunities that they have in front of them. Let’s provide them advice to say these things could help you right now, like right now, there’s immediately something you could do.

And we look at kind of what are called like derivative metrics that are how many postmortems are you doing for your low severity incidents. Alright, what is the rate of that? If it’s zero, which it happens to be we need to start with just that. Let’s just look at your past ten and just roleplay what happened and we’ll get 20, to go resolve problems and, and have that timeline to reflect back on in the future.

Let’s look at your alert health and by alert health, I mean, are all of your alerts actionable? I still don’t really believe in non actionable alerting, but let’s look at that. What’s your ratio of alerts that are better acknowledged or snooze or ratio like let’s examine those things because you go into a team that gets 25, 000 alerts a month that get pushed to a Slack channel that nobody’s really using. Well like let’s get rid of that noise. Let’s stop. Yeah. let’s just delete these alerts and start over. And that gives them the now, right? That gives them now and sets them up and you’re like, okay, maybe there’s something here. We should maybe listen to these people.

Right? There’s a lot to, lots to do and there’s a lot of low hanging fruit that I find oftentimes helps sell it and brings it in.

The larger changes, you know, measuring reliability with SLOs. That’s a hard one. We’ve been on that one for a while. And it’s really hard to sell the value of you don’t have visibility.

Sorry, like you need it. Like you need to know how well things are performing in between your incidents. You need to know an answer to that. And those ones take a lot of time organizationally, I think. You gotta show examples, you gotta do it a hundred times with ten other apps and keep bringing them back.

It’s the sales part of the job, unfortunately, but those are hard to overcome. And organizationally, you have to know who you’re talking to. I think I forgot who it was, it wasn’t Charity, it was somebody had given a talk, I think it was Abby gave a talk on like the SRE journey at SREcon last year, you can look it up.

And it’s a really good presentation, a good talk, and a lot of what an SRE’s job is like navigating the organization and understanding, okay, who am I talking to? What does this person care about? Okay, how do I not spin it in a corrupt way, but put it on in the light that they understand?

If I’m talking to a senior leader, the visibility conversation actually becomes valuable. It’s like, do you know how well your applications are performing? And if their answer is well, we’ve had less incidents. Okay. I’m sure you’ve had less incidents, but like, how well are your applications doing? How fast are they? How available are they?

It gets them thinking, and they’re like, man, I probably need to know that, or I should probably know that, right? If you talk to an engineer, it’s, it’s like, about the alerts, right? Like, do you know when something’s broken, how to fix it? No. Well, what if we put runbooks attached on your alerts, and then all of a sudden, when you get the alert page out, and you don’t know what to do, you now have a guide that tells you what to do?

Wow, that would be actually pretty great. Yeah. So, it’s these different conversations at different levels to make sure you’re showing the benefit, showing the value to each of them. That’s kind of a technique we’ve been using.

Ash Patel: So, there is a bit of selling when it comes to your job.

Troy Koss: A tad. A little bit.

A little bit. For sure. Yeah. Yeah. And making people’s lives easy. Yeah. Yeah. Yeah. It’s hard because you’re like, I can make your life that much better. And they’re like, okay, where’s the snake oil. Right? Like, what do you got there? It’s hard to trust you, but that’s true. Here we go again.

Yeah. Yeah. Here’s another one. DevOps. Here we are. You’re going to save my day with DevOps and Like those are all. Well, I mean, that’s, that’s probably what we talked about a little bit already, but that’s some of the industry problems, right? It’s everything’s this, I got the solution for you.

I got it. I got it, man. Trust me, I got this thing. And there isn’t a thing, it’s like, we have a lot of things, like I just described to you. There’s 10 different things we can work on and solve for for each person.

Ash Patel: Everything you can see on that single glass of pain. Yeah.

Troy Koss: Yeah. It’s definitely that.

There’s actually a really good, I think it was Stephen Townsend or somebody recently just posted an article on that topic and it was phenomenal. It was a great read. I think it breaks it down what we’re actually after. I highly recommend it.

Ash Patel: I’m fairly sure that was Stephen. He definitely likes talking about it. It’s his space. Yeah. Being in the observability space and dashboarding.

It sounds like something he would write. It is. It is. That’s a spot on. It’s a good one. It’s a good one. Yeah. People should definitely follow him for that.

So, you’re actually several layers up from what we would consider an SRE manager, being a director of SRE. You would still have some choice advice for people who are new to the SRE space in terms of people leadership, technical leadership. Have you got any advice for people who are facing these kinds of roles?

Troy Koss: Lots of things not to do, I can tell you that much. There isn’t a secret path, but depending on what layer you’re at and you’re kind of working towards, I think being well equipped to do the job is gonna make your life a lot easier. I’m not discouraging those to like embark on it.

I don’t think it’s a really great role for the mid level engineer to get into. I think it can be great for a new engineer, believe it or not. I think we, I’ve talked to this some of my peers. I think like new engineers that are fresh, that haven’t molded their opinions yet are great because they can see a little bit of everything and really like sink their teeth in and understand how it all works and get a good worldview out of the gate.

I think it can become difficult for those that already have formed opinions in the middle. And that it’s more of a role I find for senior people that have just been through the trenches. They got the bruises, cuts, scrapes, you know, broken limbs to be able to handle it. And it’s not an elitist thing. I’m not trying to disqualify anybody for it.

I think I would encourage anybody. It’s probably the most rewarding job there is because you’re seeing how it all fits together. And making sure it’s actually working.

Everyone puts all this into it and it has to actually work. I mentioned at the beginning, when I, we get new, unknown challenge, or an unknown resilience gap, or an unknown monitoring caveat about how OTEL works versus your other instrumentation.

Like, you have to have seen something similar to it, or it’s, it’s a new thing you have to embark on in the moment. And you see different flavors of it, and it’s easy for you to approach these problems, and you feel comfortable. It’s So, I think really having a lot of well rounded experiences is something you’re gonna need. At a management level, it gets kind of weird because you still have to be involved, I guess is probably the best way to put it.

You can’t not be hands on in this role. And I don’t think, depending on what level you’re at, you have to be hands on in a sense that you have to really know what’s going on. To be able to advise, I believe there are natural born leaders that can be out there that can hop into a software engineering role and lead a team.

Director of software engineering. But if you want to lead a reliability engineering team, you really have to be that counsel that advice board to say, Hey, like. Have you looked at this, this, this, and this? Oh, no. Those are great ideas. Let me take a look.

You have to still be technical, right?

I think largely it’s not something that you just manage your way through. You have to really honestly know the systems inner workings and patterns to look out for and let the team, you know, kind of be a little step ahead of the team in terms of what they need to be prepared and ready for what it is. My role in particular is a little different in that our group is the enterprise SRE team and that we’re working on establishing what does Capital One need to be reliable and then I think a little bit more strategically around like, okay, we need to do this to change the way we measure reliability. Okay, we need to do this to advance the resilience pattern. We need to do this to do continuous chaos engineering. So that way we actually are sure of the changes we’re making consistently over time. And we’re not supporting an app or supporting a set of software.

So we’re a little bit of a different lens than others are.

Ash Patel: Right at the very beginning, I learned something new from you. A new way of looking at who should you hire as an SRE because most of the people who I talk with in general, and I think even if you go on Reddit, people who are like, if you’re a junior developer, don’t even try and apply for SRE jobs…

and you’re actually saying, Hey, it’s actually the midway people who are a little bit Too dogmatic about how they do things. They’re the ones you should be very careful with. The junior people, I feel, yeah, I’m with you. If they have the right attitude, you can mold them into exactly what you need out of an engineer who can understand systems well.

If you have an engineer who’s been working in a very compartmentalized way for five to ten years, it’s very difficult to get them out of that. And I’ve been there. I have tried to get people out of that compartmentalized thinking. It’s not easy. It’s tough.

Troy Koss: Yeah, it’s brutal.

I actually picked up on the junior thing from somebody. Jim, guy I work with, Jim.

He’s the man. He observed it over a period of time, right? He’s one of our divisional partners we work with all the time. And he’s like, yeah, some of our best engineers are the ones fresh out of college that are like software developers that want to see how the whole system works together.

There’s always that youthful vigor, that comes with being fresh. You haven’t been defeated so many times yet. So you’re like, wow, this is exciting. I can learn all these 300 different things and I’m ready to do it, right? You’re not like tired yet. You’re not exhausted and you’re like, oh, there’s another thing I have to learn.

Oh man, the networking side of things. Oh. Man, this, they’re logging into an operating system. Oh man, like you keep on figuring out these new things that have to happen. But yeah, it’s, it was a brilliant observation by him and In that it’s a really good population of people to get into it.

If your organization has the capacity to invest there, I will say you’re going to invest in that. Don’t expect that engineer to go and solve for world hunger the first week on the job. They’re going to take some time to learn, but it does pay dividends. If you’re investing in the long game, which I think enterprises should do. Investing in them early is great because you might not get something from them the first six months, eight months, a year, even, but their objective lens, having never seen anything.

It’s something that’s so invaluable. You can’t put a price tag on it.

If they have the good attitude and they’re curious and they, they want to fail smart and that you, you put some boundaries around them and help them some bumpers from bull, you know, a bowling reference there, it’ll help them and you’ll get a lot of, a lot of reward at the end of it.

Some of the folks on my team that are, are junior, I, I value just as equal as the most senior person in our organization. They’re really helping us push the envelope .

Ash Patel: I’m really glad I dug into that because you gave some great tips on, yeah, bring in the juniors, but definitely put the bumpers on, on the lanes because you’re going to need to guide them through this.

They’re going to need some help.

Troy Koss: I need bumpers. I still need bumpers. Terrible.

Ash Patel: Me too. Oh, I shouldn’t have given that secret away.

Troy Koss: Yeah, your friends are going to go and be like, Oh, hey, let’s go. Let’s go. Let’s go bowling now. And just try to embarrass you.

Ash Patel: Hmm. Hmm. I’m, I’m busy that night. I am busy. I’ve got other things on.

Oh, that’s funny. You’ve had a few people who are new to SRE, completely new to systems engineering, come in and perform really well. Do you have any tips for any people who are still looking to get into this space? It’s a mixed bag right now with people getting a little jaded, so let’s bring some positivity in.

Troy Koss: Yeah, I think one of the best things you can do to prepare yourself, for those that are interested in getting into it, It’s hard to find the opportunity, but you can create it for yourself. I think building your own system and supporting it is probably the best way to really like get hands on and learn that.

I’ve done some of my consulting work and other stuff in the side, the years, and trying to maintain that as a side of desk, a nighttime endeavor. Has been really one of my beneficial things for me if I’m working my nine to five and then I have another job but the other side that maintains the software for a company, a real company that makes real money that does real things and And I needs to support that I can’t leave in the middle of the day and go work on that because I have my commitment to my actual job. So I’ve learned through doing some of that Why you need good alerts, why you need to have your quick fail safe triggers and why you need to have your monitoring and your logging set up the right way and managing that well.

I’ve learned about cost management, right? And waking up to setting up cloud functions to talk to each other that end up with 2400 and something dollar bills. Don’t ask me why I’m not. I know that You know, maintaining your own thing and being the only engineer really forces you to have that appetite to keep things alive and, and you really learn all the angles like when you have to do it all, you really learn it.

Now, it doesn’t have to be a business. you can build an app and build a simple website and scaffold that out and put it on the internet and still practice on all the different angles, right? You don’t even have to put, you don’t have to publish it.

It could be local on your environment, but you can still do almost everything. You can still build a product and understand how it works and instrument it and practice and download the agent for whatever monitoring software your company uses and play around with it. Like understand its inner workings. Explore with it.

I still do that to this day. I still do that a lot to try to learn how it’s all working and instrumenting.

Whatever tool we’re using, I put that on a sample app and I play around and try to learn. But I think if those are really getting into it, you gotta practice, right? Then you gotta move.

You gotta experience some of it. I think that’s for someone that’s getting into it.

Ash Patel: I think that’s great advice for people who should really understand that this line of work is about practical work. You actually have to do things and it’s not just about taking the test, getting the certification and then you’re sweet.

The job is going to be very tough for you if that’s all you’ve done so far. It’s a great idea to get in a lot of practice Find out if you even actually like doing this. That’s a good test if you like doing it. There are so many other things you could be doing you could be a hairdresser. You could be a crypto bro.

You can do all kinds of things if you don’t like putting in the practice.

Troy Koss: That’s great advice. What you just said is, I mean, that’s, that’s great advice beyond the scope of this for anybody that’s doing anything that they’re doing that they don’t like to do, stop doing that if you can, if you can. I don’t wanna be ignorant to the fact that some people have to due to circumstances, but man, yeah.

Yeah. I love what I’m doing. There is not a day that I wake up and I’m like, Ugh. Like, ugh, I hate, I mean, there’s, there’s situations I shouldn’t say, I should be clear. There are days I’m like, oh no. Another X incident, like we how are we gonna, you know, there are already 20 we have to review, like how do we get through this and there’s days like that for sure, but I really enjoy this stuff and some people ask me, they’re like, wow, you’re like always on, you’re always like, yeah, I wouldn’t be if I didn’t like this.

So, don’t force yourself to do it. I see a lot of people that are in the SRE space that don’t like it. I’m glad you brought that up. Don’t do it to yourself. Like build products, go try software development and build a feature. Yeah, there’s lots of fun stuff you could do. And then you have that knowledge that you can bring.

You’re probably way more advanced than the typical developer anyways. So go have fun. If you can.

Ash Patel: In summary, there is a shortage of good SRE talent, but do it for the right reasons.

Troy Koss: Please. For all of us.

Author
Recent Posts

Connect?

Ash Patel

Reliability Nut at SREpath

Ash has an unhealthy obsession with software reliability. Maybe it’s got to do with the trauma of working at a few companies where software kept slowing or went down while he worked to turn it around. His ma hopes that he can one day turn this passion into a respectable job or business. Still waiting…

Connect?