Where in team topologies does Site Reliability Engineering fit in?

We will explore the workings of the Team Topologies model and how Site Reliability Engineering (SRE) teams can fit into it.

In more detail, I will share with you the following:

  • an overview of the team topologies model
  • the 4 team modalities it proposes, and finally…
  • where SRE teams fit in team topologies

Let’s get started.

Overview of team topologies

Team topologies is a relatively new model/framework, having been officially introduced in 2019.

It’s a response by authors, Manuel Pais and Matthew Skelton, to fundamental and recurring issues in software delivery.

What business problem is being solved?

Modern software systems typically call for a fast flow of change. Examples of change include:

  • adapting to regulatory or market changes
  • bug fixes for known issues
  • general business process updates

Slower flow of change can lead to a backlog of work that keeps piling up. Meanwhile, the software falls behind in terms of meeting user needs and market competitiveness.

Here’s the thing: cognitively overloaded software engineering teams can only work to a certain speed of change.

This is why it’s important to reduce their cognitive load, in order to support higher release velocity.

How team topologies proposes to solve this

The authors have created a new level of clarity around how various software teams – from product teams all the way to platform teams – should operate.

A well-defined team topology within an org will give engineers the luxury of one of the most precious things lacking in modern business: focus.

Software teams in the last decade have experienced a myriad of change. They have been:

  • asked to be responsible for their service end-to-end
  • seen the emergence of the controversial DevOps Engineer role, which varies greatly in scope across orgs
  • wanting to get the right level of support from specialist and supporting engineers

Team topologies is a conceptual framework that overlays dynamic team structures. These structures can enhance the service that NFR-responsible software engineers like SREs provide to feature teams.

By NFR, I mean non-functional requirements, which refers to areas of software engineering other than developing features of the software.

Areas like platform, reliability, performance etc.

💭 Side note: I’m personally not a fan of the term NFR because the above-mentioned areas are critical to what end-users perceive as functional software.

The foremost aim of proposing unique and dynamic team structures is to optimise for team cognitive load, which can factor heavily on a team’s effectiveness.

Cognitive load: The total amount of mental effort being used in the working memory — John Sweller

Failing to optimise for cognitive load can lead to lower work quality, delays and unmotivated engineers.

Related article:  25+ Site Reliability Engineering OKRs

I suppose that means many teams today are poorly optimised for cognitive load!

A lot of the metrics that are now considered de rigeur in software delivery – like MTTR, deployment frequency etc – would be difficult to improve if teams are mentally overloaded.

Moving on.

Here’s a quick rundown of key learnings from the book on the concept of organization itself:

  • Org chart is always out of sync with reality
  • Formal titles may cover only the most visible work, but fail to address the holistic nature of the function they’re within
  • Informal connections among engineers are just as valuable as formal roles and chains of communication (if not significantly more)

The following parameters apply for ideal team dynamics:

  • Team size of between 5 and 9 members (inspired by Dunbar’s number)
  • Allowing anywhere from 2 weeks to 3 months for proper cohesion
  • Stable, long-lived teams, but not never-changing teams
  • Team owns one, clear aspect of the software system

💡 My key highlight from the book: Effective value creation requires forming dynamic team structures that reflect the kind of value that needs to be released.

Several outside influences are at play in the team topologies model. They include:

  • Naomi Stanford’s thinking on organization design
  • Dunbar’s number influencing effective team size
  • Conway’s Law on how communication structure influences software outcomes

I have noticed that Conway’s Law is talked about often in DevOps and SRE circles. Perhaps because it’s a law and us analytical types love rules? I jest.

Very quickly, Conway’s Law stipulates that software architecture won’t change (effectively) without changing how the people working on it are organized.

It fundamentally comes down to how the people are organized around the work.

How many orgs have you worked in where there is a methodical process for making the above happen?

Taking an SRE angle for a moment, I find from my conversations with engineering managers that the SRE needs of software are not being effectively covered.

And so the endless hiring rat-race continues…

Perhaps identifying the work at a deep level then adapting the team structure to it can alleviate some of this tension.

(I am developing an SRE capability auditing service to help with this)

Related article:  Site Reliability Engineering Glossary

Team topologies nods to modern organizational design principles

I particularly love how the authors refer to Naomi Stanford, who is one of the foremost thinkers on organizational design.

They draw from her 5 rules for org design. These 5 rules are:

  1. Design the team/org when there’s a compelling reason
  2. Develop multiple options for the design
  3. Time the implementation of the design right
  4. Seek clues for lack of alignment with reality and needs
  5. Keep moving the design toward better alignment

As basic as the above sounds, very few orgs have the knowhow, bandwidth or interest to do this in a proper fashion. Let’s change that, shall we?

For a complex area of practice like SRE, it is so critical to implement the right team design and continue to evolve it as the software system’s needs change.

Now, let’s explore the 4 team modalities that team topologies proposes…

4 types of team topologies

1. Stream-aligned team

  • the most fundamental team type
  • aka the “features” or “product” team
  • owns end-to-end responsibility for part of the value stream i.e. a service
  • expected to deliver features continually
  • ideal state is to deliver work via an API with no hand-offs to other teams to complete the work

2. Enabling team

  • exist to support the stream-aligned teams
  • drive adoption of tooling and practices across teams
  • masters of facilitating knowledge transfer
  • evangelise the uptake of better practices
  • understand and manage the challenges that stream-aligned teams face
  • may plug-in temporarily into stream-aligned teams to support unmet needs and fill gaps in their knowledge

3. Complicated subsystem team

  • exist to reduce cognitive load of stream-aligned teams
  • specialists in a developing and supporting a specific aspect of the wider software system
  • they may release an x-as-a-service to be consumed by many or all product teams e.g. observability tooling, Chaos-as-a-service, APM suite etc.

4. Platform team

  • responsible for the underlying infrastructure and tooling that services operate on top of
  • ultimate goal is to make stream-aligned teams capable of autonomously meeting their own needs
  • examples of work include maintaining underlying cloud, running knowledge bases, owning platform tools etc.

How SRE fits into team topologies

Let’s try and work out which team topology Site Reliability Engineers and SRE teams can fit into, as one of them is better suited than others.

We will explore each team topology 1-by-1.

SREs as stream-aligned teams

From a team topology perspective, an SRE team is unlikely to be considered to be a stream-aligned team.

Related article:  Is platform engineering at risk of shiny object syndrome?

The work of SREs goes across the full business domain, not a specific sliver of the value stream.

SREs are better considered as enabling teams supporting stream-aligned teams, which brings us to…

SREs as enabling teams

At their core, Site Reliability Engineers exist to support the ongoing reliability of software in production.

The existence of SRE teams for enabling the work of stream-aligned “feature” teams makes sense.

SRE teams can work on their own volition to assure system reliability by doing on-call work, capacity planning, code reviews etc.

Site Reliability Engineers themselves can embed into stream-aligned teams to:

  • help with fundamental reliability-focused issues in codebase and elsewhere within the stream-aligned team’s work
  • deepen DevOps – improve CI/CD practices etc when there is a lack of DevOps engineer bandwidth or presence
  • enhance you-build-it-you-run-it philosophy by giving feature teams training wheels for on-call and incident management practices

SREs as complicated subsystem teams

From an SRE purist’s perspective, Site Reliability Engineers would rarely own a complicated subsystem.

SREs are known to run APM tools and play around with the Chaos suite, but that would only be a part of their wider role in assuring reliable software-in-production.

They may work with the likes of specialists like performance and chaos engineers to support the implementation of a service across the org, but rarely would an SRE team or SRE focus on developing the end-to-end service.

SREs as platform teams

Site Reliability Engineers are well-aware of DIY efforts from developers that can bring down software services, so they will play a role in the platform.

They can support the platform by building passive guardrails that keep the developer’s workflows within safe confines.

The extent of their role in this will depend on past platform issues, the current developer climate and platform complexity.

SREs are not and should not become the main people for ownership of the underlying platform and its tooling.

Parting words

There is so much more to the Team Topologies model and book. I have done a high-level overview of key concepts to give you an understanding how it applies to SRE.

Be sure to check out the book and the authors’ videos for a deeper understanding of the model.

Additional reading: