Runbooks for better incident response

Introduction

Runbooks are a Site Reliability Engineer’s best friend. They are most useful when you envisage putting out the same fires again and again. Or at least do it without a 🤯 feeling.

Why runbooks are useful in SRE incident response

Here are 3 reasons why:

  1. Automated processes don’t always protect against issues — so software needs 10s to 100s of different activities actioned by skilled humans to keep the system rolling
  2. “30-40% of procedures require human judgement to resolve safely, so that’s still a bunch of run books that won’t go away – even if large parts of deployment are push-button / automated processes.”
  3. Prevents annoying experiences like this: “I recently ran into a situation where I spent 6 hours understanding how something works that would have taken 20 minutes if the relevant information was stored somewhere.”

Ways that teams have set up their runbooks

Confluence — is not particularly designed for managing runbooks but is an open-ended tool that enables you if you have a solid enough idea of runbook design

Jupyter Notebooks – an open-source tool with a combo of text, image and live code snippets so decent option if you are happy to install and maintain it

Markdown files hosted in git repo — maintenance might be an issue over time without strict guidelines within the team

Err… this ➝ “Sticky notes on someone’s desk. We’re thinking about getting a laminator to keep the coffee spills from being too serious of a problem.” 😅

Factors to consider in your own runbook setup

  • Make a standard runbook template — makes it easier to process information when in a pinch like when resolving an urgent incident
  • Have a collaborative approach to building the runbooks — don’t palm off to technical writers – the people who design and build the systems should be the main authors or at least participate in the process
  • Give an explanation of why the component of the system was designed as it appears to runbook user
  • Some runbooks have sub-processes – it’s important to clarify what are these and how they relate to the process they are children of

Want a deeper understanding of Site Reliability Engineering culture?

👇 Take SREpath’s free 7-day SRE culture patterns course 👇

Leave a Comment