Agile software teams need site reliability engineers to support ongoing success

I originally wrote an article titled, “Agile and SRE are not mutually exclusive” for Site Reliability Engineers (SREs).

Most of them told me, “We already know this. Go tell the people running Agile in our orgs!”

I can see their point.

So here’s my go at explaining why Agile-made software needs the support of SREs for ongoing success on the increasingly complex Internet.

Who are Site Reliability Engineers anyway?

Site Reliability Engineers are the folks focused on ensuring your software stays running after being deployed across the release train.

They do a combination of reactive operations work to build platforms and respond to outages and do proactive project work to improve the software system’s resilience.

The field I just described is Site Reliability Engineering (SRE).

Let’s go back to my original post title, “Agile and SRE are not mutually exclusive”. You could interpret from the above title that:

  • Agile needs SRE
  • And SRE needs agile

Well, it’s not a binary for both situations. So let’s modify that statement a bit:

  • Agile needs SRE (to a great extent)
  • And SRE needs Agile (for enhancing its project work)

This article focuses on the first statement because the second one is already well known among the SRE community. (At least those lucky enough to spend time doing project work and not reactive ops work all the time.)

I will say this first because it’s important: I am not a critic of Agile.

If I were, I wouldn’t have sat through many a long workshop to get my Scrum Master certifications. Moving on.

Let’s begin with my own experience with Agile

As I mentioned earlier, I will aim to show you how Agile teams need the support of Site Reliability Engineering to continue their ongoing success.

So let’s look at my personal experience…

I’ve worked in 4 startups since 2008, all of which followed an Agile methodology. We could not function without our Kanban boards and standups.

It seems like software startups and Agile go hand-in-hand.

But startup life doesn’t pay the bills unless you’re lucky to exit (I almost made it in the last one!), so I have a dayjob that involves wrangling vendor software with custom tooling.

In my day job, all of the software vendors I “value-add” switched over to cloud and Agile delivery since the pandemic in 2020.

I’m now noticing an effectiveness gap in their ability to deliver cloud-based services reliably.

Most of these vendors have launched more features in the last two years than in the previous eight, which have also correlated as the most unstable period for these systems.

In my opinion, part of this fragility comes from their lack of insight or interest in increasing the software’s reliability in production.

That’s where Site Reliability Engineering excels. To reduce software fragility and increase resilience to black swan events while supporting developer success at the same time.

A senior executive at Google, Ben Treynor Sloss, created the concept in 2003 to ensure the search engine could handle the high number of requests it was getting.

Ben’s vision succeeded: how often has Google not loaded up for you?

Software reliability slots into this black box of NFRs – non-functional requirements – no one wants to look at until something’s gone wrong.

Whenever I mention the need for a systematic view of addressing NFRs, uptime and error handling, my counterparts at vendors attempt to soothe me with “Mmm-hmm. Uh-huh.”

The more agile work these vendors do, the more fragile their software becomes. That’s at least in production, where I can see their work in action.

Built great features but users can’t load them?

The presumptions that I’m making here are that:

If you’re doing Agile, you’re changing your software (in production) roughly every 4-6 weeks.

The changes compound over time, so software put into production on Day 0 will morph into a very different beast by Day 30, 60, 90, 180 etc.

In some situations, by day 365, you may not be able to recognise the same software compared to Day 0.

And the more services you add or modify over time, the greater the complexity quotient will be. This long but worthwhile quote aptly describes the problem we face:

There is a fallacy in computer programming circles that all applications are ultimately decomposable – that is to say, you can break down complex applications into many more simple ones. In point of fact, however, you often cannot get more complex behaviors to actually start working until you have the right combination of components working, and even then you will run into problems with synchronization of data availability, memory usage and deallocation and race conditions – problems that will only become apparent when you’ve built most of the plumbing. This is why “but will it scale?” entered the lexicon of programmers everywhere. Scale problems only show up once you’ve built the system out almost completely and attempt to make it work under more extreme conditions. The solutions often entail scrapping significant parts of what you’ve just built, much to the consternation of managers everywhere. — Kurt Cagle, Community Editor @ Data Science Central

Extreme conditions are nowadays a regular event for many applications.

Even the ones that would otherwise never reach the enterprise scale.

There are two factors contributing to this sense of “extreme conditions”:

1. User demand for software – especially cloud software – has gone through the roof in recent years.

2. Even simple business software makes multiple API calls from third parties before displaying valuable data.

Yet most software vendors run their production systems as if users still worked on on-premises software. The horror!

Site Reliability Engineering’s winning touch

Site Reliability Engineering (SRE) practices put a proverbial security blanket on top of the possible Agile mess that can grow and grow and grow.

  • SRE gives the ability to respond effectively to outages, performance degradation etc.
  • SRE gives assurance that the various underlying services will be able to handle pressure when it occurs
  • SRE is the proactive approach to the successful production of software

The higher cost of an SRE can be justified by the lesser risk of excessive downtime (costing real $$$ in almost every industry, now that so many production and service areas are heavily software-dependent).

Also, not needing developers to wake up at 3 am every other night is a perk you can bank in a tight talent market. That’s what site reliability can result in.

The call for SRE boils down to this

Users want features, but now they also need reliability because business models are now dependent on cloud services.

Downtime means money lost, even if it’s for mere minutes. It can cost thousands, sometimes millions of dollars.

At the very least, software reliability principles. If possible, teams making mission-critical software should consider adopting the entire Site Reliability Engineering capability set.

Want a deeper understanding of Site Reliability Engineering culture?

👇 Take SREpath’s free 7-day SRE culture patterns course 👇

Leave a Comment