10 Tips for Onboarding New SRE Hires

How new SRE hires can get stuck

There’s more than one way to mess up your new SRE hire and get them stuck in a loop.

Here are 6 ways new hires will know you’ve made this mistake:

  1. unclear role requirements
  2. going too advanced too soon
  3. not having any tangible, measurable things to do in the first few months
  4. not feeling connected with the rest of the SRE team
  5. no clarity on how SRE fits into the wider organization
  6. little to no collaboration with teams outside of SRE

This article will unpack these 6 sticking points and show how to solve them.

Later on, I will share 4 additional tips to improve the onboarding experience.

Let’s unpack and solve each of these sticking points

Unclear role requirements

SRE is a broad-spanning role.

Your definition of SRE might differ from other companies your new hire has worked at.

Or your organization might not have clearly defined its SRE goals at all.

This can result in a less-than-clear view of the work that needs to be done.

In a nutshell, the roles you are hiring for lose clarity.

New hires can only be as effective as their knowledge of what needs to be done.

I recommend that you consider this hierarchy of role clarity:

Novice – no role description at all

Beginner – written job description listing all duties

Advanced – detailed job description adding priority areas and key outcomes

God mode – a visual roadmap of everything from Advanced Mode outlining progression over a timeline

Going too advanced too soon

There’s an old saying that you can’t run before you can walk.

This could not be more true for making sure new hires are effective in the long run.

I’ve seen a lot of managers do this: give a grace period of 4-6 weeks of not-so-tough tasks and then throw the new hire into the deep end of the swimming pool.

This does not work in my experience.

The work your current SREs have been doing may be something they have been doing for years.

And it’s likely they worked their way up to the level of competence you see.

So take gradual steps up to more advanced work with your new hire.

You need to gradually increase the complexity of the work.

This applies to SREs with prior experience too.

Not having tangible, measurable things to do in the first few months

This ties in with a lack of role clarity.

If you have a clear idea of what your new hire needs to do, it’s easier to set measurable goals of where they should focus their attention.

It’s not easy if you don’t know where to start the new hire off.

Some tangible and measurable pieces of work can include:

  • Shadow other SREs on incident responses (measured in the number of incidents)
  • Write incident reports (measured in the number of reports and takeaways)
  • Document processes as they learn about them (measured in the number of processes)
  • Write code to improve tooling (measured in lines and commits)
Related article:  Analysis of SRE and platform setup at 10+ tech companies

It may sound banal but it is tangible and measurable.

Not feeling connected with the rest of the SRE team

This can wildly depend on how the team is situated – is it in an office or hybrid or fully remote?

It can also vary depending on whether or not you have team-building initiatives in place.

The key is to create an emotional connection between the new hire and the work/team.

Here are 3 ways you can increase the emotional connection with your team:

  • Rotating buddy programs. Connecting new hires with a new buddy is useful, but only one buddy means one strong connection. Rotate the buddies around the team over a reasonable time like every 4 weeks to increase the spread of what I call “work buddy feels”.
  • Participation in a collaborative project. Get the new hire in on any project or activity that involves multiple members of your team. This can include projects like capacity planning for traffic variations or setting up a new incident response protocol.
  • Regular manager check-ins. I feel this should go without saying, but managers get busy at times and forget that they made a new hire or several new hires. It’s a wise idea to set a schedule ahead of time to do very quick check-ins.
    • Ideal mode: do verbal check-ins at scheduled intervals.
    • Autopilot mode: pre-write emails for sending at scheduled intervals.
    You can do both if you wish. I’ve done it myself over the years.

No clarity on how SRE fits into the broader organization

The last thing a new hire wants to do is step on toes.

Especially of people outside your SRE team.

This can happen if you don’t delineate the specific role of SRE in the organization.

What do I recommend?

Explicitly communicate to your new hire how the SRE function supports the broader engineering function including:

  • What developers are responsible for e.g. whether it’s the “you build it, you run it” or “we run it for you” or a hybrid model.
  • If other functions own a typical capability of SRE e.g. a dedicated observability team that works independently of the SRE team
  • How much assistance do other functions need e.g. does AppSec need help navigating services when they are facing an incident? do developers need incident support?

Little to no collaboration with teams outside of SRE

There might be a problem if you struggled with the last point about how SRE supports the broader organization.

Active collaboration with other teams is a key tenet of effective SRE teams.

Related article:  #9 Inside Booking.com’s Site Reliability Engineering Practice

I have heard enough SRE managers tell me about a related antipattern:

→ their SRE function is centralized (bueno)

→ but also only works with “outsiders” through tickets (¡no bueno!)

The reality of SRE collaboration should be very different.

I highly encourage SRE leaders to create formal collaboration activities with other departments including:

  • education sessions for developers on various operational areas
  • collaborative special projects with other areas like quality assurance, AppSec, database analysts, etc.
  • embedding SREs within critical feature teams

More ways to make sure new SREs don’t get stuck

Start new hires off “small and steady”

The trick here is to start new hires off with a small part of the bigger SRE puzzle.

What puzzle am I referring to?

It’s all the work that your SRE team is doing like observability instrumentation, responding to incidents, and more.

My belief is that you don’t want to throw a new SRE hire into fixing your tricky capacity issues (for example) on Day 1.

This also goes for new hires who have worked as SREs before.

What you do want to do is give them work that is small but measurable and gives them a sense of achievement once completed.

They need to feel like they are progressively unlocking increasingly complex work.’

Kind of like how you’d play a computer game.

Start slow with junior SRE hires

When it comes to junior SREs specifically, I suggest you give them a mix of both “thinking” work and “doing” work, which keeps things spicy.

It will help them stay engaged with learning by doing as well as develop their critical thinking around your processes.

Now, how would this pan out in real life?

I’d say give the newbie hire more concrete tasks initially, such as reviewing past incidents or reverse engineering a solution that your other SREs came up with.

They can also shadow one of your experienced SREs during an incident response.

Not only does this approach help to provide a sense of achievement, but it can also help to build up their process orientation over time.

Over time, a well-hired junior SRE will begin to gain more experience and grow their skillset.

This is when it becomes important to give them the freedom to explore beyond the boundaries of the process.

This is the time to foster creativity and innovation, allowing them to come up with new and better ways of doing things that might not have been considered before.

By providing a balance of structure and autonomy, you can help your junior SREs to grow and thrive in their roles.

This way, you can also ensure that they are contributing to the success of your team and organization as a whole.

Avoid trial-by-fire onboarding practices

The Site Reliability Engineering (2016) book by Google sheds light on the fact that the trial-by-fire approach is not an effective way to onboard new SREs.

Related article:  How cloud infrastructure teams evolve – from start to maturity

This is a valid point, as merely exposing new SREs to numerous tickets can become overwhelming. Even counterproductive.

Instead, take a more structured approach to train new SREs.

One that includes ample guidance and mentorship opportunities.

This will not only enable new SREs to gain a better understanding of their role but also ensure that they feel supported and valued within their new organization.

You might incorporate this with a formal feedback mechanism that can help identify areas for improvement.

A thorough training program with an academy format can also be helpful in this situation.

Evolve the work as new hires become more effective

At the beginning of a project, give team members relatively low levels of autonomy to ensure that everyone is aligned with the project goals.

As the project progresses, gradually increase their levels of autonomy to foster creativity and innovation.

To avoid developing a monolithic role over time, team members should be given responsibility for a breadth of issues.

This not only helps prevent burnout but also ensures that team members are constantly challenged and engaged.

Team members should also have the choice to code their own solution or to use middleware. This allows for flexibility and promotes creativity.

While being on-call is an important part of the job, it should not become the primary responsibility of SREs.

Many SREs quit the industry because they never get past the reactive “dumpster fires” and onto the proactive work they signed up for.

It is important to ensure that on-call duties are balanced with other duties to prevent burnout.

It is important to reemphasize that at no point should SREs become the main destination for tickets.

This turns them into operations in the traditional sense of “responding to tickets as issues arise.

You will not benefit from the SREs’ software engineering skills if they are constantly putting out fires.

Instead, SREs should be focused on proactive work to improve the system and prevent issues from arising in the first place.

Bringing it all together

New SRE hires can get stuck in a loop due to unclear role requirements, going too advanced too soon, lack of tangible tasks, feeling disconnected from the team, no clarity on how SRE fits into the organization, and little understanding of collaboration with other teams.

To avoid this, start new hires with small, measurable tasks, give junior SREs a mix of “thinking” and “doing” work, avoid trial-by-fire onboarding, and evolve their work as they become more effective.

On-call duties should be balanced with other duties and SREs should focus on proactive work to improve the system and prevent issues from arising.