Google’s Site Reliability Engineering hierarchy (Remixed)

This post contains an original SREpath visual summary. This visual simplifies Google’s highly academic “SRE hierarchy” as an easy-to-explain journey map format.

Here’s a sneak peek at the visual summary:

Before we continue, let’s cover some background information…

You most likely know that Google is the company that originated the Site Reliability Engineering (SRE) phenomenon in 2003.

Google has even released a handy book for free, public access to grasp the concept. The book is called Site Reliability Engineering (2016) and is authored by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Murphy.

All the authors are known heavyweights in the SRE space and are regular speakers at the annual SRECon event run by USENIX.

One part of the book that stood out to me was Google’s SRE hierarchy.

Let’s examine the noticeable features:

  • it’s structured like a pyramid
  • it’s a play on Maslow’s hierarchy of needs

Maslow’s hierarchy of needs is an abstract concept to grasp unless you are at least mildly interested in behavioral psychology.

File:Maslow's Hierarchy of Needs.svg - Wikimedia Commons
Source: Wikimedia Commons

Very briefly, let’s explore Maslow’s hierarchy, so we can understand Google’s SRE hierarchy better:

  • food, water, and shelter are at the bottom of the hierarchy, denoting them as a foundational need
  • as you progress up the pyramid, you aim for needs that progressively move further away from mere survival mode
  • the highest levels of needs are aspirational and define the character and more advanced levels of being

Maybe now, you can make better sense of Google’s SRE hierarchy. But you may not get the luxury of explaining it like this to your peers who want the straight answer for doing Site Reliability Engineering.

From my personal experience, this kind of hierarchy model will not make much sense for non-technical stakeholders.

Some of these stakeholders – like executive sponsors and higher management – control the trajectory for your SRE efforts.

In the SRE context, I learned that many senior leaders looked at the hierarchy and interpreted capacity planning as more important than incident response, so they focused resources on that effort more than paying for on-call engineers.

Don’t believe in my experience alone?

Read Dan and Chip Heath’s book, Switch, on change management to see how a pyramid shape hierarchy needs to be more clearly defined in order to make sense.

I’ve learned from years of trial-and-error that visual maturity models better highlight key components of a new initiative.

The pathway analogy in many such models is a more visually engaging method to describe all the necessary capabilities.


Sidenote: SREpath used to be

Site Reliability Engineering is one of those concepts that lends itself well for starting conversations as a maturity model.

But be warned about pitfalls in following the path rigidly.

First of all, I don’t believe in maturity models for SRE because it’s a continually moving set of capabilities, not a finish line.

But maturity models can start a clearer conversation about SRE org building more than a prosaic technical conversation.

Tech speak is catnip for engineers, boredom and/or confusion for non-technical leaders. Trust me, I’ve instigated confusion in the past 😉

You may get backlash from more experienced Site Reliability Engineers (SREs) if you follow the above path like a prescription.

I got a mixed response when I posted this visual on a popular discussion forum for Site Reliability Engineers. Most feedback was positive but there were some pessimistic comments.

These comments highlighted criticisms that I want to address here.

Critical feedback:  “It’s a good theoretical model in a vacuum, but on-the-job experience has shown me that you don’t always get to handle these in that order.”

Response: I wouldn’t expect to dictate that the model be deployed in the order I’ve outlined it. Every model is good in theory until it hits the real world of competing interests, capabilities, and ambitions. This simplified model of Google’s SRE hierarchy serves two purposes: (1) as a conversation starter and (2) implying that capabilities should be addressed progressively by new teams rather than all at once.

Critical feedback:  “This is harmful. SRE is not a linear process, but a feedback loop where each of these areas (primarily 1-4) are improved incrementally with continuous effort. The notion that you shouldn’t automate “test & release process” until you’ve “mastered observability” is absurd.”

Response: I agree that SRE is not a linear process. This is one of many lenses through which I see SRE, and I want to offer people the choice to take on. My personal experience has proven a pathway approach to make the whole implementation an easier pill to swallow in complex organizations.

A few things to consider regarding observability vs test and release from an org/team design perspective:

  • Sure, if you have a strong case for test and release procedures first, why not? But most org leadership I have personal experience with would cave onto their engineers without the prerequisite of guiding data (that log, trace, monitor offers)
  • Another way to see it: observability is a technical exercise with upper management not knowing or caring how you do it.. when you start talking about setting policy or procedures, that becomes a sociotechnical exercise involving input from developers, team leads, managers and ultimately egos and ambitions – with raw data from observability, you can argue for a better path

Want a deeper understanding of Site Reliability Engineering culture?

👇 Take SREpath’s free 7-day SRE culture patterns course 👇

Leave a Comment