#21 – Better SRE in 2024 is all we can hope for

Episode 21 [SREpath Podcast]

Show notes

Sebastian is back for this episode to help set out direction for 2024.

We reflected during the holidays on the problems SREs faced in 2023 in terms of job insecurity, burnout, and “that really shouldn’t be my sole job”.

Sebastian and I talked about what we hope to bring to the community in 2024 to make SREs and SRE teams stronger, happier, and healthier at their work.

My reflections from this episode

InĀ 2024, we need to end this madness.Ā SREs are not purelyĀ incidentĀ responders.

But I have had far too manyĀ conversations that have given me evidence to the contrary.Ā Far more than I could stomach.

I’m a curious sort, so I enquire how SREs do their work when I meet one.

Many have told me point blank that a majorityĀ of their time is spent on-call.Ā 

They get precious few hours to work on anythingĀ remotelyĀ proactive.

I consider this to beĀ one of the – if notĀ theĀ – worst antipatterns in Site Reliability Engineering.Ā 

It’s like some managers read the chapter aboutĀ incidentĀ responseĀ in the 2016 SRE book and forgot about everything else.

People! There are 100s more pages on what else SREs need to do.

Things alluding to proactive work, working across the software system… you know, enabling team stuff.

BecauseĀ SREs are enabling teams in the Team Topologies model.

“But wait, Ash, what if we simply need to do on-call because there are so manyĀ incidents?”

My friend, that’s a case for concern right there.

This tells me that too little is being done in terms of release engineering, capacity planning and early warning systems like pre-production observability.

Related article:  #18 Winning at SRE in Banking and Telecom (with Troy Koss)

And that’s just the tip of the iceberg.

I could rattle off at least 5-6 proactive things you could be doing to reduce futureĀ incidentĀ severity.

There is something that needs to be said to achieve even a sliver of the above processes.

I align with the viewpoint of industry experts who assert that for SREs to excel, they need:

āž”ļø Direct access to business stakeholders

āž”ļø Direct access to developers and infrastructure experts

and in some cases…

āž”ļø Direct access to users and customers

There is another aspect to the issue and itĀ puts the burden of guilt on people like me.

I feel like I often overwhelm people at SRE conferences about the sociotechnical aspects of the work.

You’d think I would remember they are not part of our little echo chamber where everything SRE is “obvious”.

SRE is far from an obvious practiceĀ ā€”Ā incidentĀ responseĀ antipattern case-in-point.

So the honestĀ truth is, it’s not obvious and we need to work harder.

Harder in what sense?

To make the easier, more productive path to SRE more visible to you and your boss and whoever else needs to know.

That’s my goal for 2024 alongside Sebastian, who will join me from time to time to share ways we can do this.

My first port of call will be to make a section on Observability, covering how to resolve critical problems in that area.

There are many other resources out there already, but I hope to share ideas in as simple a way as I can.Ā 

Oh, and for now…

Since you’ve read this far, here’s a plug forĀ theĀ SREpath podcastĀ [Spotify link] that you need to take advantage of.

Related article:  Building the case for starting a software reliability team

InĀ Episode #21,Ā Sebastian and I spoke about thisĀ incidentĀ responderĀ antipattern [Spotify link], which is alsoĀ the first episode of 2024!

Whether you join us on thisĀ ferventĀ journey or pave your own path, we’ll keep working to move SRE to its true North Star.

And let’s hope that is NOT to be a fulltimeĀ incidentĀ responder.