SRE is not a monolithic role

SRE is gaining more traction and a misconception is gaining steam among senior stakeholders. That SRE is a monolith role like what “programmers” were in the 90s. Let’s burst that misconception…

SRE is a broad, overarching responsibility that needs a multitude of role considerations to pull off properly.

It is not a monolithic role where all SREs do pretty much the same thing. Like what programmers were in the 90s — they (supposedly) all pumped out code in similar strokes. Now we have front-end engineers, back-end engineers and everything in-between.

SRE is the same — a mélange of diverse role opportunities.

I will cover the nuances of SRE roles in more detail below.

No, SREs are not…

  • working one-size-fits-all roles — their scope of work will depend on the needs of the software systems they are responsible for e.g. more alerting if responsible for critical services
  • ops-on-steroids — a highly-skilled site reliability engineer should not get pigeonholed full-time into Sysadmin tasks like running Bash scripts or spooling VMs
  • stereotypical introverts — they are capable of being technicians and leaders with vocal contributions to areas like architecture, project management and team collaboration
  • able to offer turnkey SRE on their own — an individual may be able to “run SRE” for a smaller org with limited scope but won’t come close to the full scope of the SRE domain (it’s HUGE)
Site Reliability Engineering capability map - partial view of the work that SRE teams may engage in over the span of their function
Above: a glimpse of the SREpath capability map that SRE leaders use to plan teams

SREs have a wide scope of available work

  • More likely to call for T-shaped abilities where you are a specialist in a certain area but have a breadth of knowledge to be “dangerous enough” in many related areas
  • Solid SRE teams benefit from a combination of generalist and specialist engineers — you might have an SRE working across many responsibility areas while another may solely focus on Chaos
  • Roles may become more fluid in the future — SRE leaders may guide individual SREs toward broader responsibilities within 1-2 responsibility areas like performance engineering, QA etc.
Related article:  Reduce software outage risk with passive guardrails

SREs work on systems and software at the same time

  • Some SREs are systems pros with a reasonable ability to code their way out of trouble
  • Other SREs are code mavens who want to get their hands dirty with infrastructure work
  • While others still are neither and learn enough code to modify open-source tools to their needs and they learn enough systems to make sure IPv6 won’t (hopefully) make life harder
  • Whatever they do, SREs should spend at least half of their time on automation, move away from toil and otherwise improve systems (proactive) rather than respond to incidents (reactive)

SREs can be injected across the enterprise

SRE roles can be designed to embed into various levels of the enterprise. The SaFE Agile Framework is the most popular agile framework among mid-level and larger companies. Its steering group has already worked out how SRE can fit into the various levels.

I’ve broken down the roles and responsibilities below:

Service-level SREs

Entry-to-mid-level SREs who are responsible for a single service

  • Provide app-level support for critical software services
  • Implement tools and teach for more seamless DevOps
  • Own SLOs and error budgets for their service

System-level SREs

Senior SREs who help release train engineers manage multiple services

  • Coordinate multiple product streams in the release train
  • Guide system architecture and production readiness
  • Own SLOs and error-budget tracking across the system

Enterprise-level SREs

Most senior-level of SREs reporting direct to CTOs

  • Run SRE center of excellence (CoE) for the enterprise
  • Develop SRE platform and best practices
  • Architecture support
Related article:  Building the case for starting a software reliability team

In conclusion…

Doing SRE well means the difference between high-performance software and painful 50x errors. So let’s get our SRE team roles done right and not mistake SRE as a monolith role.