Agile software delivery is de rigeur of modern software. However, as complexity increases, there’s a high risk of frequent, high-velocity breaking software-in-production.
Software-in-production is when the software is accessible by users.
That’s where Site Reliability Engineers (SRE) can come to support the Agile software team’s efforts.
Who are Site Reliability Engineers?
Site Reliability Engineers (SREs) are a specialist form of software operations engineers who are committed to ensuring your software remains reliable after being deployed across the release train.
They may help build and improve the platform that your software is launched into. However, it’s rare for SREs to own the platform work entirely. They are typically highly experienced engineers.
The unique value of Site Reliability Engineers is in:
- advocating better software release practices
- planning for better production systems and
- doing proactive project work to improve the software’s performance and reliability
In other words, Agile software work needs SRE support. But SREs also need Agile.
It’s not an equal relationship, however. Let’s modify the above statement for accuracy:
- Agile can benefit from SRE help to a great extent
- SREs can benefit from Agile practices to some extent – in their project work
My relationship with Agile practices
Over the years I have
moonlighted worked in several startups. These ventures depended on Agile methodology to release software with high frequency.
I am well-versed in Agile being a certified Scrum Master. The culture at these places made Agile work well. This was a saving grace in fast-moving environments.
My last role (before starting SREpath) was as an operations director at a healthcare company. Certainly more traditional and structured than working at a software startup.
I owned software vendor relations as part of my portfolio. All of the software vendors we had relationships with switched over to cloud and Agile delivery since the pandemic in 2020.
These vendors launched more features in the last two years than in the previous eight. The 2 years have correlated with the most unstable period for these systems.
Our end users constantly complained of not being able to access critical systems during business hours. The vendors’ ability to deliver features faster increased, but their ability to do this as a reliable service begs to question.
Part of this fragility comes from their lack of insight into increasing the software’s reliability in production.
Site Reliability Engineering excels at increasing the reliability of software.
If I were to sum up SRE work, it would be to reduce software fragility and increase resilience to black swan events while supporting Agile developer success at the same time.
Where did SRE start?
The concept of Site Reliability Engineers originated at Google way back in 2003.
An executive at Google, Ben Treynor Sloss, determined that the only way to handle Google’s mega-scale user requests was to create a new operations engineering discipline.
His north star early on was to reduce the likelihood of 500 errors i.e. the server encountered an unexpected condition that prevented it from fulfilling the request.
Seems apt that Ben’s vision succeeded. How often does Google’s service stop working for you? Rarely compared to other services, right? SREs are the secret sauce behind this.
Many software outfits don’t care about reliability
The unfortunate problem with bringing SRE into the Agile mix is that it fits into the wrong software arena.
Reliability slots into the black box of non-functional requirements (NFRs). In a typical organization, no one wants to look at risk until something’s going wrong i.e. it’s too late.
I’ve had many conversations around addressing reliability, uptime, and error handling go the way of, “Uh huh, we’ll look into it next quarter. But first, let’s release this exciting new QR code tool!”
The problem then compounds.
The more agile work these vendors do, the more fragile their software becomes. That’s at least in production, where our many employees and I got to see their work.
What is the point in building great features if users can’t load them?
The assertions that I’m making here are:
- Improper Agile practices (e.g. weak DevOps) can rapidly compound technical debt
- Higher technical debt subsequently increases the surface error for reliability, performance, and security issues of the software-in-production
In a standard Agile timeframe, you’re altering your software every 4-6 weeks through continuous deployments.
These continuous deployments compound over time, so software put into production on Day 0 will morph into a very different beast by Day 30, 60, 90, 180, etc.
By day 365, you may not be able to recognize the same software compared to Day zero.
The more services you add or modify over time, the more drastic the difference will be. This long but apt quote describes the problem we face:
There is a fallacy in computer programming circles that all applications are ultimately decomposable – that is to say, you can break down complex applications into many more simple ones. In point of fact, however, you often cannot get more complex behaviors to actually start working until you have the right combination of components working, and even then you will run into problems with synchronization of data availability, memory usage and deallocation and race conditions – problems that will only become apparent when you’ve built most of the plumbing. This is why “but will it scale?” entered the lexicon of programmers everywhere. Scale problems only show up once you’ve built the system out almost completely and attempt to make it work under more extreme conditions. The solutions often entail scrapping significant parts of what you’ve just built, much to the consternation of managers everywhere. — Kurt Cagle, Community Editor @ Data Science Central
Increased complexity in microservices architecture means that software is less resilient to even minor conditional changes.
Extreme conditions are becoming the rule rather than the exception for commercial software. Even the kind with a small-ish user base.
I will unpack this issue of extreme conditions in terms of performance:
- User demand for cloud-based software has risen dramatically in recent years, so workloads are significantly higher than before
- Even very simple cloud software makes multiple API calls from internal and third-party services before displaying all relevant data
Despite this tinderbox, practices around software-in-production at many newly-minted Agile houses remain as if users were still using software installed on their PCs or Macs.
Site Reliability Engineering can rescue production software
Site Reliability Engineering practices put a proverbial security blanket on top of the above-highlighted deployment mess that can grow and grow and grow. SRE gives:
- the ability to respond effectively to outages, performance degradation, etc.
- assurance that the various underlying services will be able to handle pressure when it occurs
- a proactive approach to the successful production of software
The lesser risk of excessive downtime can justify the initial cost of creating an SRE function. Downtime can cost serious money in almost every industry now that so many production and service areas are heavily software-dependent.
Business models now depend on cloud internet infrastructure, which increases production risk due to its complexity. Downtime means money lost, even if it’s for mere minutes. It can cost thousands, sometimes millions of dollars.
Users want great new features, but now they also need reliability because what’s the point of a feature if you can’t even access it when you need it?