How developers can survive “you build it, you run it” – Boost software reliability

Introduction

As a developer, you might not have anything to do with your code once it’s been committed all the way to looking after the code right up to production.

The latter is called the “you build it, you run it” model.

It’s not going away.

But that depends on your organization.

It’s likely to increase in popularity across organizations, as:

software operations become more complex and
management calls for more accountability from developers

You might even get asked to respond to incidents when your code causes an outage or performance issue.

(Hopefully, there’s an SRE around to join you on the incident and guide you through it!)

That’s why it’s important to create reliable software from the beginning.

This way, you’ll have to respond to fewer and less severe incidents.

We will explore some ways you can bake reliability into your feature development work.

👇🏼 Let’s start off with a developer responsibility that makes obvious sense and go from there…

Code maintenance beyond the original commit

It’s often gone without saying, that your responsibility for code doesn’t end at its first commit. From time to time, you will have to update your code to:

improve its efficiency including performance optimizations
make it work with major changes in other parts of the software e.g. changes affecting how API will be consumed and
fix bugs that have been found in testing or production

Not doing this effectively will increase technical debt.

And technical debt is the enemy of reliability.

Other than continuously improving your code, you can integrate reliability-supporting concepts into your workflow.

Get comfortable with reliability-supporting concepts

Here’s a hard truth: your infrastructure is flawed.

And it always will be flawed.

The best you can do is prepare your apps and services to deal with such flaws when they turn into outages and performance issues.

Here are a few reliability concepts you can deep-dive and contextualize to your environment:

Observability through monitoring, tracing, and logging
Instrument code with open telemetry
Add circuit breakers to the code
Integrate regular code reviews into your workflow
Prepare services for load-shedding behavior
Add feature flags for nuanced operational control
DIY passive guardrails to protect your code-in-production
Design code for graceful degradations and failure modes
Implement test-driven design (TDD) into your planning
Optimize code to work with distributed database architectures

I recently interviewed Pablo Bouzada of ViaPlay, who as an engineering manager espouses the reliability-focused thinking and practices above.

He’s not an SRE manager nor has he ever held an SRE title.

His team doesn’t have a single Site Reliability Engineer.

Yet he has put in time and effort to integrate reliability concepts into the work of his software engineering team.

They’ve reaped the rewards through highly reliable software.

Their legwork early in the SDLC reduces their and their operations peers’ future workload in terms of incident response and large refactoring projects.

Check out the episode here.

Incident response for code-related issues

Site Reliability Engineers (SREs) make the bulk of first responders and coordinators for incidents like outages, performance degradations, etc.

2023’s tech recession has seen a reduction in the number of such SREs working in many organizations.

It’s still anecdotal with little media coverage on specific layoff numbers for this role.

But I’ve had enough SRE managers tell me about having even more limited resources than before.

What this means is there are fewer SREs going around.

And they tend to be the main or sole on-call responders to incidents.

Now there’s a deeper need to share responsibility for such incident responses.

It’s now a fine balancing act of responding to incidents with limited SRE resources and the often overworked developers who are closest to the code.

To make your life easier, it’s best if you have a well-integrated suite of tools and practices that can give you greater visibility into issues, as well as ideas to solve the problem.

Without naming names, there are tools that can help you:

instrument telemetry automatically to your code
setup incident response war rooms as incidents occur
monitor resource usage patterns by services
trace across services to find redlines for latency
trace within specific functions of code for the same
make sense of the big data from logs, traces, and metrics
use AI for troubleshooting issues with infrastructure components

What I am getting at is that there is a myriad of solutions to ensure you aren’t wading around in the dark when you have to respond.

Each can cover a very specific aspect of your incident response needs, to give you better and better coverage of even the most complex incident.

Wrapping up

This has been a high-level overview of your responsibilities as a developer for the reliability of software.

Keep in mind that you may need to support an incident response at some point in your career, and might even need to do it regularly.

To minimize the pain of doing this, I suggest you design your code with reliability concepts in mind and regularly maintain it to minimize technical debt.

Author
Recent Posts

Connect?

Ash Patel

Reliability Nut at SREpath

Ash has an unhealthy obsession with software reliability. Maybe it’s got to do with the trauma of working at a few companies where software kept slowing or went down while he worked to turn it around. His ma hopes that he can one day turn this passion into a respectable job or business. Still waiting…

Connect?