#15 Growing Reliability Engineering Across 5+ Companies (with Nash Seshan) – Boost software reliability

Episode 15 [SREpath Podcast]

Ash Patel talks with Nash Seshan, who has supported reliability work in over 5 organizations, including Cisco, eBay, Dropbox, Lyft, Netflix, and Wayfair. He shares his learnings from reliability work at these big brands.

Nash also draws from his experience as co-founder of a Y Combinator-funded startup on effective engineering leadership. He also gives his take on issues with ill-conceived automation.

Episode Transcript

Ash Patel: Great to have you on, Nash.

I am excited to learn about what you’ve been doing in this space. You have been working in the software field since 2009, and you started off at Cisco, but then it’s like logos for days. I mean, I’m looking at eBay, Netflix, Dropbox, Lyft, and that’s just to start off. How are you doing?

Nash Seshan: I’m good, and thanks for the kind words, Ash, and thanks for having me on.

Ash Patel: I want to hear more about what you have to say about this history you’ve packed in. It’s still a relatively short period of time for you to get all of this experience in.

Nash Seshan: Yeah. When I graduated college back in 2009, I graduated with like a major in network engineering.

It was a subject that really fascinated me. And at the time, Cisco was one of these dream companies that I wanted to work at. Little did I know that that was going to be the first job I was going to take right out of college. So I got into Cisco had a really, really fun ride working there, learned a ton of stuff.

And around the time, as I was leaving Cisco, I also kind of realized that the industry was making a push towards software engineering in generally cross all verticals. So I realized that I would be left behind if I didn’t catch up or if I didn’t really teach myself how to code.

Towards the end of my time at Cisco, I taught myself how to write basic scripts in Python, in bash just to kind of automate tiny bits of work I was doing. And then I felt like I needed formal training in actually learning about software engineering and some of the core concepts around it.

So I got myself a master’s degree in computer science. And ever since I have kind of been in the industry, networking has always and still will always be my passion. But I decided to step into the software engineering space where I started out as a network software developer at eBay automating deployments and writing tooling for the data center network that eBay and then as the industry evolved into infrastructure engineering and SRE, I was kind of already in that space without really having the title. So as I evolved through eBay, as you mentioned in your intro, I was already in SRE before SRE was public and out there.

And then eventually the SRE terminology just latched itself onto me. I’m fortunate to work at some of these logos, even though it’s a short period of time. A ton of experience and a ton of learning along the way.

Ash Patel: We’re definitely going to deep dive into some of that experience in a moment, but right now you’re at Wayfair, which is an e commerce Company for furniture.

Is that correct?

Nash Seshan: Yeah furniture and home goods. They’ve been around for about 20 odd years now as a company, but it’s something so surprising to me when I joined I wasn’t aware that Wayfair had been around for that long. But yeah, they’re actually, if you want to think of it, the Amazon for home goods and furniture.

Ash Patel: And that’s an engineering manager role that you hold.

Nash Seshan: So I lead the database reliability engineering team here at Wayfair. And my team is responsible for all the production databases that power the Wayfair platform.

Ash Patel: So you had to switch from a technical leadership track.

You were a technical lead at Lyft until 2021, and then you had a switch over to people leadership around that time.

Nash Seshan: Yeah. So at Lyft, I think I was part tech leader, part people leader as well. And then around 2021, I also did my own startup. So I had the experience of building from ground up, hiring a team, leading and managing them.

Ever since my startup didn’t work out for me I kind of moved on to people leadership because that’s something I feel I have a lot to give. It’s, it’s a new field for me. It’s new. It’s a new challenge for me, but at the same time, I don’t feel it to be very new because it’s something that I’ve always been doing in some capacity, either being a technical leader or a mentor or some sort.

People leadership, while in terms of number of years of experience is new to me fairly new to me it doesn’t feel new at all. It feels like something I’ve been doing for a long time.

Ash Patel: That’s interesting. What do you think of some of the things that are very similar between these two different tracks in tech work?

Nash Seshan: I think it really depends on which particular track within the industry that you’re in. For example, in my case I was always an SRE or always a network engineer of sorts. And I always found ways to tinker around and experiment and see how I could improve certain parts of what I was doing.

Which I think is very ingrained into my work ethic and generally very ingrained into what I do or how I operate. So when I got into people leadership, it was pretty much the same. Like, how do you work with people? How do you tinker around to see what makes them tick? What doesn’t make them tick?

What motivates someone versus someone else on your team? And everyone is different. The only difference that I see between the two is that people leadership is really a lesson in psychology over and above having technical expertise, but if you’re a technical leader and you’re able to like teach other people on your team or coach other people on your team and also like help them grow people leadership is very much the same thing.

It’s doing that in an official capacity as opposed to being a technical leader, which may or may not be an official capacity. The only nuance there between people leadership and technical leadership is that you have this human psychology aspect that you need to understand.

Ash Patel: I wanted to get your two cents on that because people say people leadership, technical leadership, very different things.

It’s like two forks in a road, but like you said, there’s still a lot of involvement from technical leaders in coaching people, helping them become a better engineer. And people leadership just adds that psychology element of how do we get people to be the best they can be, whereas a technical leader would focus on, Hey, this is the technology that we’re wanting to focus on.

Let’s see how we can help you get there. It’s a nuance, but still a similarity between both those traditionally seen as very different tracks. Completely agree. We’ve spoken about your interesting SRE journey. And I said earlier that I wanted a deep dive into some of your earlier experience. You mentioned your experience with Cisco, where you were a network consulting engineer, and you learned that actually having skills to code up infrastructure and automation was going to be useful.

And you were right. How did you find that going to an organization like Netflix? People love following Netflix story. And I bet it. People must reach out to you all the time and say, Hey, what was that like?

Nash Seshan: Yeah. So Netflix was actually one of the most gratifying experiences of my career.

It was pretty short lived for its while. I was only there for about a year, but while I was there it was probably the best company culture I’ve ever worked at. I say this with a huge caveat and a huge asterisk because the particular culture I’m about to refer to isn’t for everyone. It’s the same as for example, you might be able to work through a problem really fast, or you might be able to work through a problem really slowly, but are you able to work autonomously?

Do you need help? Do you keep seeking for help? Or like, do you need to be told exactly what to do? Or are you kind of a self starter? Netflix is all about self starters. And every single project I worked on at Netflix was something where it was like, I was literally given a one line problem statement.

There was no context around it. And the one life problem statement was intentional because what they wanted me and other people at the company to go figure out what the constraints would be, how to solve the problem and how to limit the scope of the problem.

So inherently teaches you all these skills about being an owner, literally going and figuring out who the stakeholders to this problem are what is the current impact and what do you really need to solve? Because you’d get a business direction, but you wouldn’t really have a technical solution.

The idea of like determining the business problem, the stakeholders, the impact of the problem, and then eventually what is the best solution and can you phase the solution out or do you need to like build everything on day one?

All these things are skills you implicitly learn when you work at Netflix. This was a huge opportunity for me and I thrive in environments where I get ownership.

The more ownership you throw at me, the better I perform and the better I am as an engineer. And that was the environment that just worked perfectly for me.

It was like the dream job come true. But that said, it’s not for everyone because I know that a lot of people at Netflix find the level of autonomy that’s given to you very, very intimidating. It was great for me, but it may not be great for other people.

Ash Patel: So in summary, you have to be a self starter to be in a place like that.

Nash Seshan: Pretty much. Pretty much. Because the managers and the leadership there: they entrust and empower you with so much to do the best job you can. That, like, you’re responsible for every decision you make. Like, you are actually part of the decision making process. You own the decisions you make.

So, if they’re good, we celebrate the wins. If they’re not so good, well, it’s a learning opportunity for everyone. We all take the hit. We all take the hit.

Ash Patel: Which is a good culture to have. I know it’s a tough culture for people who want to be told, hey, this is a job, I want to do this. But that tough culture is also a bit more of a fair culture than a place where they pretend to be here’s a job Take it, but then there’s a lot of politics that happens and that happens in a lot of places.

Nash Seshan: That’s true I think with Netflix, I think it’s so at least my experience when I was there I’m not sure how the company’s changed and evolved as they’ve grown since I was there but During my time, this was very much the culture. They were a very small, tight little team of about 2, 000 employees worldwide at the time and they intentionally wanted to keep the size of the company small because they wanted teams to be small, tight knit, and with enough ownership on whatever they were doing to a point where, like, they would feel comfortable making those decisions within their purview.

I’m pretty sure that that’s changed, as the company has also changed, but my experience was great.

Ash Patel: So the next place you were at after Netflix was at Dropbox and you had a title of Network Reliability Engineer. I’m deep diving into all of this because I haven’t met many people who have had so many logos. Thank you. And i’m not saying this any way that is a bad thing i’m saying it’s a good thing and this is you’ve had a variety of experience and that’s why I’m so curious what you’ve learned at each place and what each place has given you and helped you build into your current abilities. So the next place you were at was Dropbox and you had the title of Network Reliability Engineer. Right. So kind of a subset of SRE or specialized version?

Nash Seshan: It was very much SRE for networking. This is kind of the first role I took on where the SRE title kind of came somewhat close to what I was doing.

I think Dropbox wanted to have a different spin on it because the company was structured in a way where every application team or every yeah, every software team or every application team had a sister SRE team that was partnering with them. In my case, we had a network engineering team that was responsible for our data centers.

And they essentially needed an SRE team, which would kind of write automation for them. Hence the birth of the network reliability engineering team, which was focused on observability, monitoring, writing, automation, writing, tooling and helping keep the network and all aspects about the network reliable enough.

That experience was also very gratifying and very fruitful because I got to work at a very large scale company that was dealing in block storage and also had experience working with a hybrid cloud environment where we were partly in a public cloud, partly on premise. There were some workloads that were running in the cloud, some workloads that were running in our own private data centers.

So how do you then structure for that kind of environment? Netflix was also a hybrid cloud environment, but it was a very different setup because Netflix is a content delivery network. That is a whole different paradigm of network engineering compared to like traditional data center networking.

Ash Patel: You probably needed that because of the complexity of the data that Dropbox in particular was dealing with.

Nash Seshan: Yeah, because Dropbox deals with all its customers private data, right? So, I think all of Dropbox’s users entrust them with their private files. Now, these could be audio files, these could be video files, it could be media, it could be confidential documents. It could be stuff that people want to store away for archival purposes, for later use.

You’re entrusting the service with a lot of your private data. One of the good things about Dropbox was that privacy and security was a top priority across the company. One of the things, and I know I’m digressing a little bit, but it’s one of the companies where I’ve seen security and privacy implemented so well that every other place I’ve been at, I’ve always compared it back to Dropbox.

I’m like, that was great. And this is not quite as great. It’s worth a call out.

Ash Patel: I definitely think that’s worth a call out because DevSecOps is now, I would say it’s considered the norm. A lot of places are still trying to figure out AppSec and now we’re talking DevSecOps. Yeah, things are getting complicated.

So with your background, actually knowing how important security is and actually the practices that go along with that and integrating it with DevOps and infrastructure, that’s something that I think we need to teach people a bit more about this. I think we’re going to do an aside on this. We need to do something and teach people about this. For sure.

So now we’re going on to your experience at Lyft. I love Lyft. It’s sometimes 10 percent to 15 percent cheaper in Toronto, and that helps a lot, especially if you’re going a long way.

Nash Seshan: Yeah, it does. So at Lyft, I was within a vertical, which was kind of the R& D division of Lyft. It wasn’t the traditional ride hailing service.

I worked within the autonomous vehicle division of Lyft, which was called Level 5 at the time. The mission for that particular division was to bring robo taxis to the market. So we had a fleet of vehicles that we were running autonomous driving software on. It was about testing and doing a lot of the load testing and the stress testing of that code on the actual roads in Palo Alto to determine if the code is actually getting better, if the safety driver is engaging with the car as much as they need to, or if the code is able to operate pretty autonomously. That’s what the entire division was about. My job particularly on that entire team was to build infrastructure automation and observability. And kind of ensure that our pipelines and our infrastructure shipping code to the car and gathering data from the car after a trip.

were reliable enough because shipping code to the car had been a big problem. When I joined the company, a lot of folks were actually shipping code on flash drives because that’s what they would build a code on, on their workstations, and they were walking down to the garage and plugging it in and essentially uploading the code.

So we built some infrastructure to kind of make that process better. And then as the car was on the road you wanted to be able to, like, gather telemetry from the car. So build some intelligent solutions around that too. It was a fun project.

Ash Patel: I have so many questions, just stemming from that, because I’m one of those nerds really interested in self driving technology and it’s kind of gone quiet recently.

One thing I will ask you, and this is more of a computing related question to that. I was always thinking there’s a lot of need for edge computing capability if you’re gonna do self driving at scale, I know they’re dealing with other issues right now, but let’s just say they figured all those issues out with the current skill set that most companies have, would we be capable of doing something that utilizes it?

Edge computing and all those types of scenarios to bring it out at scale?

Nash Seshan: So that’s a great question and I think I would answer that in a slightly diplomatic way and I think the reason I’m going to do that is because my perception about self driving is the fact that you need to be able to have enough control on the actual hardware that’s on the car.

And enough control on the software that you’re running on the car as well. And I feel like very few companies actually have control of the entire stack Tesla for one of them is probably the pioneer in this space because they control the car hardware, they control the software. They’re actually also working with chip makers to build custom chips that run on the actual car.

When it comes to other companies, like say the Ubers, the Lyfts and the Cruz of the world, the car is leased out to them by a car manufacturer. The software is their own and the chips are probably whatever is available in the market. So when you don’t really have a lot of control on the entire stack, you can only go so far.

I’m not saying that you have to be able to build your own chips or you build your own hardware, but I’m saying if you control what you need from the hardware and from the software and from the actual chips that run on the car, you’ll have more chances to kind of bring this to a reality. I feel at this point, Tesla is probably the best position in the market at this point to make that a reality.

There’s videos on YouTube for folks doing the FSD trials and posting their experiences with full self driving and Tesla’s kind of proved that they’re capable of doing what is called level four autonomy, which is essentially Autonomously. You let the car do its thing right from the start to the end, but it’s on a predictable route and it’s not in unfavorable conditions like snow or hail or whatever.

It’s a bright sunny day temperatures are okay your battery’s fully charged. It’s not an unforeseen circumstance. Level 5 is like, it takes, it takes into account any and all unforeseen circumstances. Will we ever get to true L5 or Level 5 autonomy? Hard to say. Maybe many years down the road. But I think level four autonomy is what we can hope for, at least to begin with.

And I feel like Tesla and maybe now Polestar or some of the other companies that are doing this are probably in the best position to make that a reality for our customers.

Ash Patel: I’m personally betting on Lucid but you know,

Nash Seshan: Yeah, Lucid is another one. Yeah.

Ash Patel: As in not to win the self driving L4 thing, but you know, I like their cars, but lucid,

Nash Seshan: it’s a really good car.

Ash Patel: So you’ve had a variety of experience in so many, not so adjacent organizations because Netflix does something very different to Dropbox, does something very different to Lyft. And then you had a YC, you co founded a YC startup.

Nash Seshan: It was something that a friend of mine and I did. We were university roommates. He was in the biotechnology and life sciences space. And we wanted to build a platform that was going to reduce the amount of time it would take to bring newer drugs to market. We had identified an opportunity in the space where typical drug development takes anywhere between 10 to 15 years before it comes to market for consumers like you and I.

And we identified along that pipeline after talking to a bunch of users and a bunch of people we knew in the industry that there was enough opportunity to automate a lot of this workflows. A lot of the things that scientists typically do with drug development is data analytics, data collection reporting of data, running a bunch of experiments, maintaining inventory.

There’s a lot of these auxiliary and primary workflows, they’re just screaming for automation, just screaming for something to come and make their life better. That was the opportunity we wanted to tap into and build a platform that would start with low hanging fruit and then build on top of that.

We had some success in terms of the idea and we got accepted into Y Combinator, had a few customers who were willing to pay for us. The only thing that we kind of struggled with, and I think we knew this getting into the entire idea of the company, is that a lot of the demographic of scientists who are currently doing cutting edge research in that space 55 and above.

So they’ve kind of been doing research for like 20, 30 years now without the aid of any tech. It’s just been that way. All of a sudden we have ourselves coming along and kind of saying, Hey, you now have this platform that’s going to automate X percent of your work. All you need to do is plug it into your systems and give us your data.

And that’s a level of trust that they were refusing to give us. Well, not, I wouldn’t say refusing to give us, but it was much harder to break through that barrier. That was the first problem. The second problem was the fact that once we started getting some customers to actually trust us and, you know, give us their data and use our platform, we noticed very quickly that their standardization of the way they did things or the standardization of data they collected was very, very different.

There was literally no standardization. So some of the scientists collected data on paper napkins, some of them collected on notebooks, some of them used Google sheets, some of them used other mechanisms. Some of them took pictures of whatever they wrote.

Ash Patel: Wait, you’re not kidding, right? They literally used, like, napkins and they took pictures of data?

Nash Seshan: Yeah, they did. Yeah. Well, it was like they were running an experiment in the lab and they had nothing else to kind of note down data on, so they would use whatever they would get. At times it would be a paper napkin, at times it would be a picture of a paper napkin or another piece of paper. But that’s kind of what we had to work with, right?

Ash Patel: As a quick aside, my university training was actually as a research scientist in biomedical science, so I’m shocked. A lot of people don’t know this, but I also have a qualification in biomedical science. I worked with PhD, like, friends who became PhDs and professors and all that, but I am shocked to hear this.

Nash Seshan: Yeah, and it’s funny, because, like, my wife is a biotechnologist herself, so, like, when we were running into these problems, I would, like chat with her and say, Hey, is this something that’s abnormal to what we’re seeing or is this kind of the norm that you’re seeing in the industry based on your experience?

And she was like, yeah, I mean, we’ve done this ourselves. Like we haven’t had tooling or proper mechanisms to do it. And every scientist who comes along does it their own way. And then we also took this back and mapped it to an article we read when we were founding the company. This article was from the very renowned Nature journal, which kind of said that 70 percent of all biotech and life science research is practically not reproducible by another human being. And that just made sense because the data is not standardized. So if you publish a paper today somewhere and I’m supposed to go ahead and reproduce that research, there’s a 70 percent chance I will not be able to do it or get the same results that you did.

Because my collection mechanisms or my data collection mechanisms or my analytics will be different compared to yours. That was a problem we were rushing to get a solution to. Again, but then we ran into the first barrier of like having an age demographic of scientists who are like north of 55. It was challenges on multiple fronts but it was an enriching experience.

I mean, I think founding a company is no small feat and getting it to customers who want to pay for it is kind of the feather in the cap that I would take for, I mean, for life.

Ash Patel: I have two questions related to that and people who are wondering, wait, I thought I was listening to a conversation about SRE.

What’s this talk about biomed and scientists taking data? No, no, no. This actually has relevance because the question I’m going to ask you, Nash, is what parallels do you see with the industry as a whole in terms of people doing their own way in biomedical science, finding their own way to do things, and then having that and you’re trying to standardize it and the challenges you faced?

Are you finding any parallels to the challenges or any, are you finding that DevOps and SRE people are actually pretty good in comparison? I personally think they would be compared to that. Yeah. But do you see any anti patterns? And that’s one of my favorite things to ask.

Nash Seshan: I do actually. There’s quite a few anti patterns that I’ve noticed.

I’ll start by answering your first question about similarities between my founder experience and what we see in the SRE world. I think the lack of standardization is a lack of runbooks and playbooks in the SRE world, which teach someone how to troubleshoot or diagnose a problem in a certain way.

I think a lot of seasoned SRE folks, again, I’m not trying to like stereotype or throw a blanket statement here, but I’ve noticed that a lot of folks who are much more seasoned, they have a lot of this in their head. And when a problem arises or when an issue arises, they’re the first ones to go and troubleshoot and navigate through the problem.

But they have it in their head and it’s not often documented. Which means junior engineers or folks who are trying to learn the space are going to struggle. To determine how to do it. To learn from what works or what doesn’t work. In terms of SRE, I guess standardization. That’s one of the areas that I see lack of standardization in our industry.

The other thing that I also see is a lack of, well, this is getting better for sure, so I’m not going to say this in a pessimistic tone, but lack of standard monitoring practices. So when someone is, let’s say, for example running a piece of infrastructure, or like, you know, running compute on Kubernetes, or containerized system, or whatever, people monitor things in different ways.

They choose to monitor what they think is important. But if we were to have some industry wide guidelines or like RFCs or whatever to kind of say, Hey, if you were to look for these few things, or if you were to monitor these few metrics, you would get these kinds of results and you should always ensure that you get these monitoring bits enabled, whatever you’re doing.

I know it’s not a very strong case to push for because people can feel free to choose whatever they want to, but that is another area where standardization would certainly help. Going back to your anti patterns question, I feel like one of the biggest anti patterns in the SRE world is that folks try to automate everything, as opposed to understanding what needs automation.

Trying to automate everything or trying to observe everything using monitoring is possible, theoretically, not practically. I guess the thing that is most important to understand is whatever you’re trying to automate, do you fully understand the root cause of that? Are you automating the root cause or are you automating the symptom?

Sometimes the root cause is so very deep within that it might be a cascading effect, and you might be just seeing the symptom on the surface of things. So you might write some automation to fix the symptom, but are you truly understanding the root cause? I think this is something that lesser seasoned or lesser experienced SREs tend to do a lot more.

’cause like, oh, we’re running into this issue. This alert is firing 10 times a week or 10 times a day. Let’s go and write some automation to fix that alert. But is that alert really the problem? Or is it something much deeper that is causing the problem? That’s one of the biggest anti patterns I’ve seen, which I hope people work to get better on.

Ash Patel: Going back about standardization, there’s a high risk that if organizations take on a standardization drive, you’re gonna get people being too rigid. So I propose to organizations, Let your engineers experiment, and then once they have worked out a model that is effective and continues to give effective results, that’s when you say, this is how we do it.

Yep. Not going in as engineering management and saying, this is how we’re going to do it moving forward.

Nash Seshan: Completely. Completely agree. I think the engineers on the team have to definitely have a good understanding of what the environment is, have to be able to try things, have to be able to, like, like you said, build a model and what works, what doesn’t work, and then kind of iterate on that, right?

I think one of the key facets of SRE or in general, these days, software development of any sort, is incremental progress, right? Gone are the days where we’re all in the waterfall model, where you end up spending months and months together writing a piece of software. And then put it out there in terms of a release, and then go back and iterate on it again to fix whatever issues or bugs.

People want MVPs, people want like really, really small scoped products to ship as quick as possible. Because they want to know and get instant feedback on what’s working and what’s not working. SRE is exactly the same, right? So if you identify a problem, can you fix it immediately or can you mitigate it immediately?

And then work on a longer term fix that might require more time to implement. I think it’s really down to that at the end of the day.

Ash Patel: Yeah, I recently spoke with a director of SRE and he was saying, there are teams who want an immediate fix. They want you to fix their problem yesterday.

Yeah. I Want, I want the fix for what happened like now. And as an SRE, as a manager who leads SREs, you have to be ready for that kind of conversation. Yep. That’s when you say to your SREs, look, you have to be continuously looking at things. You need to be looking at the data that all these teams are working on and don’t stretch yourself out too much.

But, just keep your eye, keep your finger on the pulse of what’s happening with all these teams, completely. That’s when, yeah, that’s when you’re responsive, that’s when you increase your responsiveness and that’s when teams start loving those SRE guys around the corner.

Nash Seshan: Exactly. I think one of the things I always tell my teams is, I mean, this is usually for incidents that happen off hours, like off business hours or whatever. Focus on mitigating the problem. Let’s find something to stop the bleeding. You don’t have to work on building a fix during the incident.

You don’t have to stay up till 3 a. m. in the morning working on a fix. Can we stop the bleeding? If that means scaling up a machine to increase, like, throwing money at the problem. Do it until we get back to the business hours the next day. Let’s focus on the fix during business hours, which allows people to think more about what could be a longer term fix.

If a good SRE is on call and an issue comes up, they would be able to stop the bleeding. But then typically I know a lot of people who are very good SREs who the problem kind of eats them up through the night. They always try to see what can they do? How can they resolve it? They’ll probably go ahead and go down a rabbit hole into trying and troubleshooting what might be the root cause.

Sometimes they’ll fight it right away, sometimes they won’t, but at least they’ll come up the next morning with enough investigation done to know where we can start. And sometimes a fix is easy, sometimes it’s not. This empowers me and empowers them to kind of have conversations with upper leadership saying, And whoever is asking for a fix like yesterday, it allows me to have this conversation saying, Hey, we stopped the bleeding.

There’s no impact to the business. We’ll work on a longer term fix, but this is what we’re thinking. And here’s the data to prove it. I think having data powered conversations is always useful as opposed to saying, We need a fix. Let’s do whatever it takes. Let’s drop everything we’re doing and focus on a fix.

That’s not always the best idea at all times.

Ash Patel: There were two reasons why I wanted to talk about your startup experience. The first was what we just talked about. The second reason was I want to know your experience as a founder. Did it contribute to your ability to become a more effective engineering manager?

Nash Seshan: For sure. I think my experience as a founder reinforced my ability to develop. I think when you’re a startup, your things change every day, right? It was, it was kind of like walking into a war zone where priorities change every day. Customer asks change every day. We were a pre product market fit startup.

So we’re always on the hunt to try to achieve product market fit or find that one silver lining that would just allow us to explode in that space. And because of that, our constant struggle or our constant battle was to find that, find that needle in the haystack.

It’s very similar to how SREs operate.

When you’re looking for improvements in your system, you’re always looking for that one needle in the haystack or a few needles in the haystack that will allow you to make significant improvements to performance or to capacity usage or to network bandwidth or whatever it is. As an engineering leader, what it allowed me to do is it allowed me to ask a bunch of questions to my team saying, okay, do we really need this now?

Is this the most minimally scoped version of what we’re trying to do? Can we cut the scope further? How many features do we need to build or how much code do we need to write? Before we can put this out there. So it allowed me to ask those kinds of questions better. It allowed me to get them to think about these things better.

And then allowed me to justify business requirements directly to whatever projects we were working on as a team. People on my team always understand why we’re doing something rather than just the what, and then they figuring out the how they also know the why now. Because I’m able to tie that and be the liaison between their work and the business to be able to map that context back to them.

So it, it helped me become a better leader to a point where I was always offering context on why we’re doing something. The why may not always have been something that folks resonated with or agreed with, but I think at the end of the day, you have to align with business priorities. The business comes first.

So if a business decision is incorrect or someone feels it’s incorrect, that is a discussion to be had on a different day in a separate channel. But understanding why you’re doing something is always important.

Ash Patel: I’m sure you use that a lot in your current role at Wayfair when you’re managing a team that’s handling complex data infrastructure.

And I don’t have much exposure to e commerce like I had an e commerce company 14 years ago. But… That was an offshoot of what we’re working on for a long time. I am curious about how SRE fits into an e commerce organization. So can you give me a breakdown of that?

Nash Seshan: Absolutely. I think an e commerce company on the face of it is basically a marketplace where you have sellers trying to put their products online, buyers trying to buy from them.

But underneath the hood, it’s essentially a very complex piece of infrastructure that is powering this entire platform.. You have a lot of services that are running, which require heavy compute usage. You have a lot of databases that are powering different parts of the platform, which are very critical to the entire platform.

You have a lot of networking that happens either on premise or in the public cloud, depending on how you’re set up. And there’s a lot of storage that’s required to store catalog information or to store a bunch of other pieces of data that are very critical to the business. When you have so much complexity in the infrastructure itself, granted that it’s not a machine learning type workload where you have tons and tons of GPUs that are operating in 100 percent of efficiency or 100 percent CPU usage or GPU usage.

It’s not at that scale or that level of intensity, but at the same time, the scale is very different because Wayfair as a product or as a platform generates a lot of revenue, which means there’s a lot of orders, a lot of transactions, folks are buying stuff, folks are returning stuff. It’s a typical e commerce workflow.

So when you have these pieces of infrastructure, you obviously need to keep them reliable. Any critical database that goes down or any piece of critical infrastructure that goes down impacts the business directly, which means lower revenue on that given day. Which means lower order volume on that given day.

So these are critical hits to the business which need a reliability engineering organization to make sure that things are up at all times, reliably, scalably, and highly available. And that’s where my team specifically comes in because we essentially are responsible for all the production databases that power Wayfair as a platform.

Ash Patel: And you have a lot of revenue that you’re trying to protect here because… Exactly. 12 billion dollars, that’s staggering for an e commerce company that sells for… It’s a new world. It’s a new world.

Nash Seshan: It is. It is a new world. I think a lot of people are shopping online. And I think that’s where the market is moving.

Ash Patel: With this varied experience that you’ve had in so many different categories of organizations, have you experienced any organizational challenges in bringing in effective SRE, DevOps, infrastructure engineering practices?

Nash Seshan: That’s a great question, Ash. I think one of the things I’ve noticed is specifically at companies that are trying to establish an SRE org, not for companies that have already established or have been having an SRE org for a while now, but specifically companies that are trying to establish one, they tend to retrofit engineers with different skill sets into an SRE org because they want to bootstrap the org and they want to get people who have already been around at the company, had a lot of experience with the internals of the company and understand how the systems work, at least to some degree, to become an SRE and, you know, help kickstart.

While it’s not a bad idea, the problem that I see is SRE is actually a mindset and not a skill set. You can essentially learn the skills and the tools used by SREs like Kubernetes or containers or networking or storage or compute whatever. But SRE is really a mindset where it’s about being able to tinker, being fearless to experiment, being fearless to break stuff and try and see what resiliency is in your system.

I think a lot of folks who are not from that mindset or don’t have that mindset find it very hard to become really good SREs. If you’re someone who is already tinkering or like you’ve built stuff or you’re like you know broken stuff or like you know you’ve torn stuff apart when you were a kid or along the way.

You’re likely going to have that mindset coming in. For me, this mindset really came to me when I did my first job at Cisco, which was in technical support. I was taught or trained to troubleshoot a problem methodically, but also figure out what might work and what might not work to fix the customer’s problem.

But a lot of folks don’t have that kind of experience. A lot of folks don’t have that mentality to try stuff out. And especially when you’re on call and you’re in a high pressure environment. You need that level of composure to be able to try stuff and see what works and what doesn’t work. Retrofitting ops engineers or developers or other kinds of skillsets into an SRE org sure will bring the technical abilities out, like folks will know the technical stuff, but would they really be able to be calmer under pressure?

Would they have longevity in their SRE career? I think that’s something I feel is a huge organizational challenge. I wish organizational leaders understood that more and tried to balance the team out better with seasoned SREs and folks who have internal intrinsic knowledge and then let them grow into that role over time, but bootstrapping an SRE org in a company with purely folks who have been around without a lot of tinkering experience.

Thanks. I don’t know if that’s the right way to go.

Ash Patel: There are a lot of sysadmins who are very disgruntled right now saying Hey, my title just got changed to SRE. What is that all about? It’s only been a title change. And some of them are saying, I just want to roll up VMs. I just want to do that.

And now they’re getting told No, no, no, no. You’re going to do this thing that Google wrote about in its book in 2016.

Nash Seshan: Pretty much. And it is, it is kind of a mindset change. It’s like I said, it’s not, it’s not a skill. It’s a mindset. SRE is really a mindset. It’s not a skill.

Ash Patel: That’s a great piece of advice to give to SRE managers who are also trying to hire good people.

Is there any other piece of advice that you’d give to them for developing an effective SRE team?

Nash Seshan: I think to managers and leaders trying to build a team, I would say empower them to feel comfortable to go and break stuff, feel empowered to go and try stuff, feel empowered to experiment, and feel empowered to make your teams feel that they can go and do stuff without their job hanging over their heads.

Especially in this market, I think a lot of folks are very cautious about doing anything that might even remotely cost them their job, but I think good SRE leaders would be able to suck that up. And not let it percolate through to upper leadership at the same time give their engineers space to become better SREs over time.

So that’s what I try to do. That’s what I’ve always done. That’s the only piece of advice I would give to other SRE managers out there.

Ash Patel: There’s another area that I’m getting a lot more interested in. I think a lot of people will get value from you giving advice in this. Individual contributors keep reaching out to me and saying, Hey, my career progression, it doesn’t make sense in this organization.

Essentially, what I’m saying to them is you have to be a business of one. You have to develop your own self to be an effective individual contributor for the market, not just for the situation you’re in. And I’m guessing managers are going to hate me for saying this because I’m saying, Hey, you know, they’re…

You know, they’re going to move on eventually, right? And they should be developing themselves for what the market dictates. Do you have any advice for those individual contributors who are feeling stuck in one company’s way of doing things? And I’m saying to them, expand out the knowledge. Do you have any advice for them?

Nash Seshan: I think I would completely agree with what you’re saying in the sense that you always have to be curious. I think curiosity is the only way that will allow you to bridge the gap between where you are as an individual and what is being asked of you in the industry. So if you’re looking for a career transition after having spent a good number of years at a company, try to see what’s out there.

How has the industry involved in your particular domain, if you’re intending to stay in that domain, between when you started and now? And what is the delta between your current skill set and what is actually being asked of engineers? So for example, one of the things is that everyone’s jumping on this AI bandwagon.

Every company wants people to have prompt engineers now. It’s a good thing, but at the same time, there’s going to come a time, maybe in the next few years, where prompt engineering is going to be so overloaded with folks learning about prompt engineering that there’s not going to be enough jobs for prompt engineers.

So, identify areas that you can truly deliver value. How can you build on your current skill set? Or how can you learn something new and expand out wide? I think a lot of companies are looking for T shaped engineers and not I shaped engineers anymore. So, if you’re an expert at technology X or Y, how can you learn about A, B, and C and increase your breadth of expanse as well?

You’re a subject matter expert in technology Y probably, but how can you learn about other things and show that you can do X, Y, and Z too? Not just Y anymore. I think that’s going to increase the market value for yourself a lot.

Ash Patel: Nash. I love what you just said, because you’re touching on all the topics I write about all the time. I’m loving it. Thank you so much for joining me and giving your insights about your experience, which is, I think, is so useful for people in both the management side and individual contributor side to learn about. Thank you so much.

Nash Seshan: Thanks for having me. Ash. This was a great conversation.

I’m so glad we could do it.

Author
Recent Posts

Connect?

Ash Patel

Reliability Nut at SREpath

Ash has an unhealthy obsession with software reliability. Maybe it’s got to do with the trauma of working at a few companies where software kept slowing or went down while he worked to turn it around. His ma hopes that he can one day turn this passion into a respectable job or business. Still waiting…

Connect?