SRE is not a monolithic role – Boost software reliability

SRE is gaining more traction and a misconception is gaining steam among senior stakeholders. That SRE is a monolith role like what “programmers” were in the 90s. Let’s burst that misconception…

SRE is a broad, overarching responsibility that needs a multitude of role considerations to pull off properly.

It is not a monolithic role where all SREs do pretty much the same thing. Like what programmers were in the 90s — they (supposedly) all pumped out code in similar strokes. Now we have front-end engineers, back-end engineers and everything in-between.

SRE is the same — a mélange of diverse role opportunities.

I will cover the nuances of SRE roles in more detail below.

No, SREs are not…

working one-size-fits-all roles — their scope of work will depend on the needs of the software systems they are responsible for e.g. more alerting if responsible for critical services
ops-on-steroids — a highly-skilled site reliability engineer should not get pigeonholed full-time into Sysadmin tasks like running Bash scripts or spooling VMs
stereotypical introverts — they are capable of being technicians and leaders with vocal contributions to areas like architecture, project management and team collaboration
able to offer turnkey SRE on their own — an individual may be able to “run SRE” for a smaller org with limited scope but won’t come close to the full scope of the SRE domain (it’s HUGE)

Site Reliability Engineering capability map - partial view of the work that SRE teams may engage in over the span of their function — *Above: a glimpse of the SREpath capability map that SRE leaders use to plan teams*

SREs have a wide scope of available work

More likely to call for T-shaped abilities where you are a specialist in a certain area but have a breadth of knowledge to be “dangerous enough” in many related areas
Solid SRE teams benefit from a combination of generalist and specialist engineers — you might have an SRE working across many responsibility areas while another may solely focus on Chaos
Roles may become more fluid in the future — SRE leaders may guide individual SREs toward broader responsibilities within 1-2 responsibility areas like performance engineering, QA etc.

SREs work on systems and software at the same time

Some SREs are systems pros with a reasonable ability to code their way out of trouble
Other SREs are code mavens who want to get their hands dirty with infrastructure work
While others still are neither and learn enough code to modify open-source tools to their needs and they learn enough systems to make sure IPv6 won’t (hopefully) make life harder
Whatever they do, SREs should spend at least half of their time on automation, move away from toil and otherwise improve systems (proactive) rather than respond to incidents (reactive)

SREs can be injected across the enterprise

SRE roles can be designed to embed into various levels of the enterprise. The SaFE Agile Framework is the most popular agile framework among mid-level and larger companies. Its steering group has already worked out how SRE can fit into the various levels.

I’ve broken down the roles and responsibilities below:

Service-level SREs

Entry-to-mid-level SREs who are responsible for a single service

Provide app-level support for critical software services
Implement tools and teach for more seamless DevOps
Own SLOs and error budgets for their service

System-level SREs

Senior SREs who help release train engineers manage multiple services

Coordinate multiple product streams in the release train
Guide system architecture and production readiness
Own SLOs and error-budget tracking across the system

Enterprise-level SREs

Most senior-level of SREs reporting direct to CTOs

Run SRE center of excellence (CoE) for the enterprise
Develop SRE platform and best practices
Architecture support

In conclusion…

Doing SRE well means the difference between high-performance software and painful 50x errors. So let’s get our SRE team roles done right and not mistake SRE as a monolith role.

Author
Recent Posts

Connect?

Ash Patel

Reliability Nut at SREpath

Ash has an unhealthy obsession with software reliability. Maybe it’s got to do with the trauma of working at a few companies where software kept slowing or went down while he worked to turn it around. His ma hopes that he can one day turn this passion into a respectable job or business. Still waiting…

Connect?