Starting an SRE team from scratch [Quick Guide]

Site Reliability Engineering (SRE) leaders face a myriad of responsibilities beyond what Site Reliability Engineers experience.

Part of an SRE leader’s responsibilities include:

  • getting buy-in from various stakeholders like developers, engineering leaders, senior leadership, and more
  • defining the tools, processes, and backlog that Site Reliability Engineers will need to use in their day-to-day work and
  • building and organizing the SRE team or several of them

None of the above responsibilities are ever deemed to be “complete” – they are continually moving targets that require ongoing assessment.

To keep it simple, this quick guide will briefly cover the last responsibility: building an SRE team, specifically from scratch.

The thing is, SRE may be a non-conversation one day for an organization’s leadership, and mission critical need the day after.

This recount by Wayne Bridgman on the origin story of BT (British Telecom)’s SRE team describes how it can explode onto the scene for many companies:
“I was sitting at my desk when our digital engineering director came over and asked a seemingly casual question, ‘Have you ever heard of SRE?'”.

That conversation snowballed into a flurry of meetings, which led to senior leadership saying BT must get into all things SRE.

Starting and funding an SRE team usually involves uncovering the burning platform in the organization.

I’ve outlined some underlying burning platform issues that contribute to instigating SRE teams below.

“But wait, what is a burning platform?”

Very briefly, a burning platform implies that the problem is both urgent and bad enough to cause a strategic change effort.

Senior leadership may get actively involved, funding will magically appear out of nowhere, and more. A more thorough explanation of the phrase “burning platform” can be found here.

Burning platform issues that lead to forming an SRE team

Business issues

  • Is there a significant financial cost associated with downtime or poor performance?
  • Is there a regulatory requirement to demonstrate high service reliability?
  • Do operating partners require you to demonstrate high reliability of service?

Technical issues

  • Are your operations people hitting infrastructure roadblocks due to low or no coding skills?
  • Is your infrastructure struggling to keep up with demand?
  • Are your developers pumping out code with high technical debt?
  • Is the architecture of the system an afterthought to feature delivery?
  • Are your developers doing DIY with risky push-to-production practices?
  • Do critical services and systems seem to be getting increasingly fragile?

💡 Advice – there is a solid case for a dedicated SRE team if you ticked at least one box in the Business issues section plus 1 or more in Technical issues

Download the above checklist as a PDF:

What comes before an SRE team?

Now here’s the thing: not every organization can or wants to begin its Site Reliability Engineering journey with a dedicated team.

Many organizations experiment with various commitment levels of Site Reliability work before they commit to hiring and managing a full-fledged SRE team.

This often stems from the fact that there are many obstacles to starting an SRE function. In the beginning, the budding SRE advocate may face:

  • no team in place
  • no spare resources
  • no budget

Actions preceding a proper SRE function may include:

  • dealing with outages on an ad-hoc basis with no method or process for approaching incidents (which puts services at high risk of failing to meet customer and partner SLAs)
  • placing the onus of platform, observability instrumentation, and incident response on developers for their respective services (but then finding that they are spending too much time on this)
  • making an engineer responsible for Site Reliability work (but then finding that they are overloaded, spending most of their time doing reactive work like incident response rather than proactive work)

The above are a few problematic steps that lead to a natural progression toward a fully-fledged SRE team.

I’ve outlined these steps in the image below:

Critical factors in starting a successful SRE team from scratch

Solid budget

With an average salary expectation of $130,000+ in the US, you need a sizeable budget to hire a number of reasonably experienced Site Reliability Engineers.

The salary expectation is usually lower in other countries including Western Europe, but due to the rarity of the role and scope of work, is still likely to exceed what you’d pay a developer.

Strong roadmap

The pay and pray approach may be viable for running a team in a well-defined field of work, but SRE is anything but.

The work is ambiguous, volatile, and complex. So it’s important to have a roadmap of what you expect to achieve in 0, 3, 6, and 12 months – even if it’s just for the sake of individual contributor morale.

There are a myriad of opportunities in SRE. Where to start? Observability? Tooling? Platform enhancements? DevSecOps? Managing toil?

The key is to work on capabilities that can add value quickly – do things that enhance ability to run services safely and cost effectively.

Getting more bang for buck from small team might mean working on automation, to minimise manual work that compounds over time.

They might also focus on observability, to know where to focus attention in a complex services environment.

As the team becomes more comfortable, they might work on more ambitious org-wide goals like:

  • Guardrails allowing for fast, but reliable rollouts
  • Disseminate best practices to developers
  • Drive deeper automation of workflows
  • Identify and focus on problem zones in services

Leadership buy-in

This ties in reasonably well with having a solid budget. Without senior leadership approving the function, you’re not going to have a sizeable enough budget.

The other issue to consider is that Site Reliability Engineering teams need senior leadership to give them figurative and literal access to other parts of the org where they work to make an impact.

An example would involve teaching DevOps principles to developers – without a senior leader’s blessing, SREs may not be able to get across to the developer teams and their leaders.

One thing to do for sure: manage senior leadership expectations. Many are confident and ambitious people, so they may possess a degree of survivorship bias of a straight line to success.

Make them aware that the path forward will not be a straight up-and-right arrow. It will be a series of wins and setbacks that will give a very squiggly line to success.

Capable SRE leader

A Site Reliability Engineering leader at the ground level is a critical component of the team’s success. This may be taken as a technical and people leadership role in one.

In the early days of your Site Reliability Engineering team, a dedicated technical SRE lead may not be available. So as a capable leader, you need to step in and set both the technical and people direction.

You may play the role of SRE advocate, to share knowledge and provide credibility to the cause by having a clear direction for the function from Day 1. (I can help with that too)

You may wish to read up on leadership advice from celebrated people leaders like Camille Fournier (JP Morgan) and Heidi Helfand (Kin Insurance).

You may also wish to read up on technical leadership advice from infrastructure leaders like Will Larson (ex-Uber) and SRE experts like Niall Murphy (Azure).

I’ve outlined a few questions below that aim to challenge your thinking as a capable Site Reliability Engineering leader.

5 questions to ponder before starting your SRE advocacy:

  1. Will we hire Site Reliability Engineers from an external talent pool, retrain internal candidates (like developers and SysAdmins), or both?
  2. How do we define what success will look like for a Site Reliability Engineer, for immediate impact, and in the foreseeable future?
  3. What capabilities does our system call for besides the core requirements of observability, incident response, etc.?
  4. How do we allocate responsibilities to each Site Reliability Engineer, to ensure all aspects of our required capabilities are covered?
  5. How do we continue to do all of the above on a continual basis as the Site Reliability function evolves?

Point #1 i.e. recruiting people into the SRE steam, appears to be a major sticking point for many leaders and organizations.

You may wish to consider drawing people from areas beyond the usual search pools of “already an SRE”, SysAdmin or developer pool.

Considering pulling people from various areas to learn from each other, but most importantly teach each other. Example areas include L2 support, FinOps, environments and technology governance.

Now that you’ve reviewed the fundamental questions, I’ll share with you some tips I share with subscribers and potential clients who are creating a new SRE team:

  • Allow for the development of subject matter experts in key areas
  • Don’t have a single-point-of-failure (SPoF) of one person who hoards all knowledge and access in one area
  • Have a regular cadence for reviewing high-level issues: architecture and service topology, development environments, and deployments
  • Double down on cultivating the mindset & culture because without it, people will start thinking SRE is a project with a start and stop
  • Never underestimate how much you need to do to create a shared responsibility model for on-call and observability instrumentation
  • Set aside time blocks for non-technical meetings like leadership coaching, retrospectives, and values reinforcing sessions
  • Invest in the ongoing development of your individual contributors’ abilities including capabilities you don’t cover yet
  • Make accessible playbooks for known issues and processes so that people don’t reinvent the wheel when doing routine work

Examples of automatable or playbook-able processes:

  • Open support incident
  • Scale services up or down
  • Run end-to-end tests
  • Collect logs/extend context
  • Rollback to a stable state

I hope this quick guide gave you adequate food for thought for your approach to building an SRE team from scratch.

If you want to learn more about SRE culture, be sure to enrol in the free 7-day course via this link.