#11 Rising to Staff Engineer in DevOps and SRE (with Rajesh Reddy N) – Boost software reliability

Episode 11 [SREpath Podcast]

Ash Patel interviews Rajesh Reddy N who is a senior DevOps architect at CoinDCX.

Rajesh shares his thoughts on effective patterns in SRE, DevOps, and platform engineering.

He emphasizes the importance of understanding the systems, prioritizing issues, and avoiding buzzword-driven decisions. Rajesh also highlights challenges like alert noise and treating all issues as critical.

Episode Transcript

Ash Patel: Great to have you on here, Rajesh. Let’s start talking about what your role is at CoinDCX.

Rajesh Reddy N: I’m currently working as a Senior Staff Engineer as part of the DevSecOps team. I take care of the entire design about how the SaaS platforms in AWS.

And also when it comes to the platform engineering aspect, how do we create the IDP to reduce the toil internal developers are facing, also DevOps teams are facing, how do we reduce that?

So we’ll be mostly looking at SRE principles, as to how we can tackle that. Actually, we don’t have a separate team for platform engineering, DevOps, or SRE.

We have a small team handling all three aspects together.

Ash Patel: You’re an individual contributor, right?

Rajesh Reddy N: That’s right.

Ash Patel: As a staff engineer, do you find it different to when you were a principal DevOps engineer in your previous role and as a solutions architect in the role before that?

Rajesh Reddy N: So eventually when you are getting into bigger roles in your career, you’ll face some difficulties sometimes. So one difficulty is you will not be doing the actual work on a day to day basis. You have to explore yourself and you have to innovate yourself. That is the main difference I could see.

And the second difference I would say is it’s not mostly about the coding or just troubleshooting some issues . You have to understand some of the issues in the system you are facing, and you have to think about the long term and address those issues.

Those two are the main paradigms thinking wise, I could see as a difference.

Ash Patel: I used to work as a senior manager in a company. One of the things that we used to talk about for senior individual contributors was, as your role expands and you get promoted, it’s about the time span that you’re looking out at.

So do you find that you’re looking out at longer time horizons as you’ve grown in seniority?

Rajesh Reddy N: That’s right. You have to catch up with the latest things, what is happening in the industry. It’s very difficult to catch up. Unless you spend your own time, it’ll be very, really difficult.

Ash Patel: How did you get into SRE in the first place?

Rajesh Reddy N: I started my career as a network consulting engineer at Cisco. So, eventually I joined a startup called Anoda Networks, which was into network automation orchestration. I turned myself as a QA, a product management DevOps, SRE, all the roles together.

That’s when I realized that once you are a SRE, you will get to know what is happening in and out of the system. If you are a QA or if you are a developer, you might not know what is happening all other systems together.

But if you’re an SRE working in a startup, you’ll get to know what is the system, how the system is built, and how do you operate the system, what customer is expecting, what developers are building, and how do you collate those things.

That’s where I realized that my role is not to just code something, but to operate something actually. And when you are just coding, you might be coding one microservices in your lifetime. But if you’re just operating the systems, you’ll get to know how systems are behaving in different workplaces, and how do you operate, and what kind of different challenges you are seeing in the different tech domains.

I realized this is my interest, and I have to get into SRE and DevOps. And after three years of my career, I switched myself as a completely DevOps and SRE role. And then later on all the roles and DevOps only.

Ash Patel: It’s interesting you mentioned you’ve had DevOps and SRE roles.

Rajesh Reddy N: Would you use those terms interchangeably or would you consider DevOps to be quite different from SRE in terms of how you’ve done them in your work history?

In my work history until five to six years of my career where I have worked at Cisco or maybe Anota networks where they used to have only DevOps as a single role, but when I switched into Zeta , a FinTech company , they had DevOps and SRE as separate roles .

What DevOps will do is they’ll be creating our basic infrastructure for all the developers. Our design aspect or R&D work will be done by the DevOps folks. I would say they’re calling us infrastructure engineers , not a DevOps as such. And SRE’s mostly looking at the capacity planning and they make sure that applications are working very reliably, as expected .

And also they’ll be following the escalation management and also the incident management process. They tackle the issues at production and reduce the TOIL happening at day to day basis.

So these four things SRE will be doing.

The DevOps team mostly focused on the infrastructure creations.

That’s when I have seen the difference between DevOps and SRE teams.

Ash Patel: What would you say to people who say DevOps is a philosophy. It’s not a role. What would you say to that?

Rajesh Reddy N: I definitely agree to that because DevOps is just culture which you have to indulge.

It can be actual application developer or maybe SRE or maybe program manager. Anybody.

If you are just accommodating DevOps into your team, it doesn’t mean that you are creating one team with new name like DevOps or SRE. You have to accommodate the culture into the team or you have to indulge that culture into your team.

When I say culture, it’s mostly about the processes, tools, people. All three things have to collaborate together and achieve some goal, you are expecting. But DevOps is not one team. All the teams will be working for reliability, but the goal might be a little bit different on the granular basis.

But high level basis, it is only the reliability as a main focus for everybody.

Ash Patel: You now do DevSecOps in your current role.

That’s a change from doing DevOps. Have you found that now it’s no longer about DevOps? Now Sec is important in the Dev and Ops?

Rajesh Reddy N: Yeah. So when you look at my career as a Net DevOps.

We’re networking, development, and operations, all three working together.

And then later on , I figured out there is security as well. So it doesn’t mean that net DevSecOps, it’ll be like a long name to use. But when you are a developer or maybe full stack developer, whatever you are calling as full stack developer, front end developer, back end developer.

But for DevOps engineers, it’s like we don’t have any full stack DevOps. That’s where the DevSecOps or DevNetOps or NetDevOps, these names emerged. And security is very important pillar in your entire application lifecycle. Unless you tackle security issues, you’ll not be a full-fledged engineer altogether.

Ash Patel: We could go as far as saying that there’s no longer a DevOps role. You can say DevSecOps.

Rajesh Reddy N: I agree to that. The problem is if you are having a DevOps and we’ll take care of the security. Security is responsibility of individuals. All engineers have to take care about the cost, security and reliability.

But when you are a DevOps, you have to take care about security of your tools DevSecOps is playing a very pivotal role.

Ash Patel: There may be some engineering managers out there. They’re really intelligent people, but they’re just struggling to delineate the difference between DevSecOps people and people who are already in the organization doing cloud security and AppSec.

So how do you delineate what all these different teams are doing and they all have sec or security in their job titles?

Rajesh Reddy N: Job titles wise, maybe a few folks might be saying you need to have a cloud security and also Kubernetes security, container security, everything together as single profile.

But I would say as long as you have granular roles I would presume that within next three to four years, there will be a dedicated roles within the DevOps RSI. When I say predefined roles, you might be a dedicated release management engineer, which you have seen in the past. And you will be a dedicated engineer for the API gateway and the cloud native networking specialist.

And you will be a specialist role for the only for the security aspect of containers and Kubernetes. And there will be a security in the cloud and there will be a cloud engineer itself designing the applications. Definitely there is a separation between each and every team. And if you are merging all the titles together.

Asking the individual to handle all of them. Then there is a void. Somebody has to fill that.

If you’re just looking at cloud security, you’ll be working mostly with the AWS tool set whatever the cloud provider you are working on.

But who will take care of the Kubernetes or underlying platform security?

You have set up the infrastructure, but who will take care of the actual platform security?

That’s where the problem comes in. So you need to differentiate these roles.

Ash Patel: Essentially SRE might split, like you said, into all these various areas. And it makes sense because I’ve had two pieces of feedback from people I’ve spoken with over the years. The first is SRE is too broad a role.

It’s too much for one person to cover. So how do we hire for it? It’s just not possible to cherry pick different people and different skills and put it all together. It’s easier to actually have them specialized and then working within a SRE work group. And the second one is SREs are being pigeonholed into incident response.

Do you have any comments on these two things that I’ve learned from people as to how they’ve perceived this space?

Rajesh Reddy N: I agree to that because I have seen SREs are treated as reception engineers, where they’ll be sitting at the reception. And take in the incident and they’ll be tackling those issues.

They should not just be looking at the repetitive task and handling only those issues. You have to treat SREs as another engineer. They do have an innovative mindset. And they have to evolve as an engineer. And also they have to troubleshoot issues. And they have to automate.

They have a lot of work to do compared to all other engineers. Because they are customer facing engineers. They have to take care of the production and non-prod. Both things.

Ash Patel: You’ve been at CoinDCX since January this year. Can you give me a brief overview of what their history is with SRE?

Rajesh Reddy N: As I mentioned, we don’t have a separate team for the SRE or DevOps we are just calling as a DevOps team, even though name is getting changed to a platform engineering team eventually.

And the team is size of maybe 10 or 15 engineering members. We just have 15 team. That is one thing. And the mix of junior engineers or senior engineers, both of them are there in our team. Why do we need to have a junior engineer and a senior engineer is there are a few things which junior engineer can definitely do where they’ll definitely add the value.

They can come up with innovative mindset. And there are a few things where the history or context is required. What has happened in the past and if such issues happen in the immediate future, how do you tackle those issues? That’s where the senior engineers will be helping us. So we have a right mix of both senior and junior engineers.

To evolve

Ash Patel: I suppose then we can say it really doesn’t matter what you call your team as long as they’re doing what we want them to do, which is SRE, platform engineering, DevOps, and pulling practices from all these areas and making the software reliable, functional, and secure.

Rajesh Reddy N: That’s right. Because reliability or security is not just part of only DevOps or SRE as I mentioned.

That can be part of even actual app developers or maybe QA folks. And every engineer needs to take care about application reliability. If I’m deploying any application in my production, I’ll be just taking care of the, how do you deploy, how securely I’m deploying my application. And as an app developer, you need to take care about how you are developing your application.

Do we have any vulnerabilities? How do you fix them? So that’s where everybody’s responsible to make sure the application is very reliable and secure.

Ash Patel: I’m fully with you on that. I see this anti pattern of people saying, “hey, we are only responsible for this because this is my job title. I am a DevOps engineer, so I only handle this. I am a platform engineer, so I only handle this. I am an SRE, so I only do incident response. Actually, I’ve never heard an SRE say that. But yeah, you get the idea.

This is one of the biggest issues that I’ve had in the past with people with specific job titles saying, this is the only thing I do, and if you actually have a broader mandate for the team, then people know that there’ll be a multidisciplinary group of people pulling in practices to achieve the goal of that particular team.

Rajesh Reddy N: If you’re having a multidisciplinary engineers, right, you’ll be just getting the wide technical engineers.

But what about the depth? So we need to take care of the depth as well. If you’re troubleshooting any issues in the production, unless you have a subject matter expert in that particular area, you’re spending or you’re wasting lots of time troubleshooting those issues.

Ash Patel: I was actually considering calling this whole area cloud operations. Something like that. I’m sure some people do call it that. Which integrates practices that we’ve now known as SRE, platform engineering and DevOps, DevSecOps. What would you consider is the biggest anti pattern that you’ve seen in this whole broad field of all these various areas, these three particular areas that we keep talking about?

Rajesh Reddy N: I would say not just one anti pattern, I have seen more than one anti pattern in the industry.

So the first one I would come to my mind is always not having a staging environment. So if you are not treating staging as staging, then what is staging?

And if you are not developing a dedicated staging environment, where do the engineers play? If you are not testing out anything in staging, and if you’re not giving a downtime or any other maintenance window in a staging environment, how do the engineers learn about how the system has been developed? How it has been deployed?

If you are deploying hundreds of enhancements directly into production, if you are not evaluating properly in the staging environment, you have to spend 100 times downtime in production. So staging treated as a staging, unless you are not treating that, then you have to face the difficulties eventually. That is the first anti pattern which I have seen.

The second anti pattern is everybody is treating every issue as a P0.

SRE is not just there for all the environments, SREs predominant or primary task is to engage in incidents where incident will be the production incidents, not the non production incidents. So if you are treating everything as P0 issues, then this is the right time to change. Otherwise, the same difficulties you will face in the future.

The third thing is SRE teams are treated as tools engineers. If you are treating SRE engineers as tools engineers, and you are just hiring a tools engineer who is well versed about Helm or Customize or some other tool, what if the tool itself is not sufficient to handle all the issues you are handling going forward? Or if the tools becomes like HashiCorp, Terraform became the business license?

What if Terraform engineer has been hired, but you are not, no longer using Terraform. So it’s not tools engineers as such. You need to hire an engineer who understands the system and who, who is okay to get into a DevOps culture and get into DevOps processes at your organization.

Ash Patel: So I’m guessing you can’t get behind the idea of a Kubernetes engineer.

Rajesh Reddy N: That’s right.

Ash Patel: Which there are quite a few job listings out there for. And I suppose to all those people out there, you’re going to get stuck if you decide, you’re not continuing with Kubernetes, and you’ve got people there.

Rajesh Reddy N: Yeah, I agree.

Ash Patel: I mean, they won’t get stuck, but you know what happens next. So what challenges have you personally faced in your work in DevOps, DevSecOps, SRE, and platform engineering in terms of trying to get the practices associated with these areas out into an organization?

Rajesh Reddy N: So categorically, if you just speak organization ways or technical challenges, every organization does use a different tool set. And if you’re just changing from one tool to other tool, there might be some difficulties to understand that. That is one change you will be seeing.

The second change is every organization will be having their own source code control management and everything else. And they do follow different CI and CD processes. Even for SRE or DevOps, whenever they want to make any change in their production, PR, and that pull request has to be accepted by multiple approvers. Unless they are getting approved by multiple stakeholders, you’ll not be able to place that something in production.

So that is one way. It is very good. Other way it is creating some problems like it might be creating some lead time to deploy something in the production, which is very emergency feature.

Technical challenges wise I could see there might be a technical deficiency in the team where you’ll be just onboarding the tools as is whatever you find in articles or in the YouTube.

You should not do that actually.

You have to understand how the technical stack works. Does it fit to your requirements? Unless it fits your requirement, don’t put that tool directly into production. As in so many organizations what they are doing is they’ll be just reading what Google is doing or maybe Apple or any other organization doing. They try to replicate the same in their environment. Everybody Is not the same you have to understand that particular thing and that is the main technical challenge I could see. How people are perceiving their problems, trying to just simulate the same problem. It’s not the same.

You have to understand the problem, which is core for your organization and bring the right technical tool set and put the right processes for your organization.

Ash Patel: It’d be interesting if you could give some words of advice for people who are in your position or have recently come into a position like you’ve had for a few years now of a principal or staff SRE. Do you have any words of advice for them?

Rajesh Reddy N: One advice would be if I SRE manager or SRE leader, right? First thing you need to understand is do you have any alert noise or do you have right incident management protocol set up? If you don’t have that, then you are suffering your engineers, whoever is having that as a SRE role or DevOps role.

First, tackle the issues.

Don’t have too much alert noise into your system.

Try to prioritize your issues. If you don’t prioritize your issues, then there will be issues regarding alert fatigue. If you are having alert fatigue, then you are into a very bad state.

And the second issue is don’t expect all the SRE engineers to develop everything as a dashboard.

So I have seen the two extremes in my experience so far. One extreme is we don’t have any dashboard to visualize what is happening in my system. How do I see that?

And the other extreme is I have too many panels or too many dashboards. Every metric I convert into a panel or dashboard or visualization.

These two issues are creating problems when you are troubleshooting any issues in production. So, as an SRE manager, you have to prioritize what is the SLO, what is the error budget, what is the burn rate, and how do you alert the systems, and do you have to wake up your engineers in the night or morning. Prioritize the issues, put the severity against each issue.

And use the automated tool set for the on call rotation.

Otherwise you might be forgetting who was the on call last week, who was on this week. So these three things, if you just put in a process, then you are a good SRE manager.

Ash Patel: And how about people who are actually looking to get into the position you’re in? So some people who are currently, and I know this is not considered a position, a junior SRE, but they do exist, or a less experienced SRE looking to grow into a career where they eventually reach your position.

What kind of advice would you give to them to help them grow in their career?

Rajesh Reddy N: Any person who is looking to SRE as a first job, right. I don’t recommend that at all, because unless they understand the systems, how they have been developed, how they have been tested, you’ll not be able to understand how they are deploying and what you have to monitor in the production versus non production.

SRE should be your second goal and SRE is already part of your app dev or QA. So just start with your development, understand how it has been developed. And understand how it is being tested and get some of the best practices, how the existing organization is doing the SRE or DevOps practices.

And career ladder wise, if you are just getting into SRE, don’t directly get into a cloud or Kubernetes ecosystem. First understand the systems like Linux, how the Kubernetes. If you are directly getting into a Kubernetes, you will face difficulties. Really so many issues while troubleshooting and also it will be like you might clear the interviews with respect to kubernetes engineers, but underlying issues are not just related to kubernetes.

Kubernetes is built on top of some of the systems like Linux and on top of the Docker or any other containerized technologies. So just progressively choose the SRE SRE position, I would recommend that.

Ash Patel: That’s what I see a lot of buddying SREs actually do. They see a job description, well, they’re looking on job ads, and then they see you need to know Linux, you need to know Kubernetes, you need to know all the different CNCF type tools, and then they go and try and learn all of that at the same time.

It’s a recipe for disaster.

This has been a fascinating conversation. Any words of advice you’d have for people setting up SRE in their organization about what they can do better in this space?

Rajesh Reddy N: So one thing is you have to clearly define your career path.

So if you are SRE, I don’t say you have to define your career path for the next 10 years. Just define for the next two years. If you want to get into SRE, define your path and understand the systems. Then get a hands on experience and speak with your industry peers. How they are doing that.

So have a collaborative collaboration. The discussion with all the other industry folks, you’ll get into the same challenges, what you have not seen so far. And other thing is I have seen some of the lateral hires also, where they’re pretty experienced in the system, maybe they have spent more than 15 years or 20 years in the system.

They’re very old in the tactics, what they wanted to do in the system, troubleshooting, or how they wanted to put in a process. Microservices were not there in the 10 years before it was completely monolithic. Now microservices are playing a very major role in the ecosystem.

You need to understand how systems are working. It’s not just for the new engineers, even the old engineers who have spent a lot of time in the industry. They have to unlearn some of the things and learn new things, what is happening in the industry. Otherwise, microservices testing itself is very problematic.

If you are having a distributed system. How do you deploy that? How do you test that? How do you release that? How do you promote that? Everything is a challenge here, but in case of monolithic, it was pretty straightforward. You have tools in place, you have automation in place. It’s very, very established process.

Your role was to just optimize the system and get into a production directly. But now, if you’re changing something in one microservice, that might affect other my micro, microservice. There is a cascading effect of microservices, so you need to understand the latest patterns in the microservices and cloud-native landscape.

Sometimes I have seen buzzwords are taking into place where I’m directly going with trendy word like Kubernetes is trending in the industry now. So why don’t we we implement Kubernetes? We don’t see any requirement for Kubernetes, but we are seeing that as a trend. We want to just use Kubernetes as our default ecosystem.

So this is completely anti pattern actually.

Even Amazon Prime, they shifted back to the monolithic. So they’re backstepping to the monolithic. There is a clear distinction between that if you are okay to handle these issues, or if you are okay to handle this much scale, or your system is capable or expecting that much business growth in the near term, then only choose for this particular thing.

Otherwise, don’t opt for Kubernetes or any cloud native things. They are very catchy words in the industry.

Ash Patel: Funny you mentioned that Amazon Prime video example. That was because they were using serverless, right? And serverless can get expensive. That’s right.

I think people didn’t read that part of the whole headline, where they were saying we moved to monolith because, yeah, we were actually using serverless for some, I don’t know. Can you figure out why they would use serverless in the first place for such a data and compute and bandwidth intensive application?

Rajesh Reddy N: So there are a few things, right? So if you are having a serverless, what they would see is in the horizontal scaling. So if you’re scaling one system from X to Y, you need to know right autoscaling methodology. If you are having serverless, one easy way is it is handled by your cloud provider. Even for Amazon, it’s not the case, however.

But you need to understand the right autoscaling principles. Search the apply in serverless, you’ll be ending up with too much cost. You don’t even know how much you’re spending there. And how do you scale that? That also you don’t even know. So even if you want upgrade something from X to Y, what is underlying infrastructure there?

You don’t even know. But eventually people are looking at like there is a system which has to be upgraded from X to Y version and to just to avoid the upgrades again, or maybe the auto scaling functionality, they’ll blindly choose serverless. Eventually what they see is serverless is not a completely another ecosystem.

It’s just like you are just shifting the responsibility from your organization to the cloud provider, or whoever is providing those services. You need to understand the distinction between that.

Ash Patel: I would use serverless for something like where I cannot predict how much usage there is going to be. But something like Prime Video I know people are going to be watching at specific times and certain levels of traffic are going to be coming.

So I’d be able to do capacity planning. Which is why it just boggled my mind. What I think could make people go, hey, see, look, they’re switching to monolith is because it was an Amazon company. Amazon Prime, where AWS was the cloud supplier . And they’re like, look, even them using AWS, they’re switching.

But they don’t realize that Amazon is still a client of AWS. It’s not actually telling AWS what to do. So each product of Amazon has to figure out their own architecture themselves. And they obviously learned that they didn’t use the right architecture the first time around while they were trying to figure out the actual product of Amazon Prime Video.

That’s right.

Rajesh, really appreciate you coming on and giving your insights about challenges in SRE, DevOps, DevSecOps, platform engineering, being a principal and a staff engineer. There was so much that got packed into this session.

So, really appreciate you joining me on this.

Rajesh Reddy N: Thanks for having me, Ash.

Author
Recent Posts

Connect?

Ash Patel

Reliability Nut at SREpath

Ash has an unhealthy obsession with software reliability. Maybe it’s got to do with the trauma of working at a few companies where software kept slowing or went down while he worked to turn it around. His ma hopes that he can one day turn this passion into a respectable job or business. Still waiting…

Connect?