Runbooks for better incident response – Boost software reliability

Introduction

I can confidently tell you that runbooks form a critical part of the incident response toolkit. I will also tell you that SREs are well-placed to start and oversee the development of runbooks.

If you don’t have a runbook yet, let me entice you with the thought of checklist-type documentation to follow when you’re woken up to deal with a 3am production meltdown.

You won’t be the only one using the runbook. Its simplicity allows you to more easily product teams into the incident response action. It gives clarity to those who may not be as experienced as you when investigating faults with their work-in-production.

Runbooks are most useful when you are finding your incident response to be a case of “putting out the same fires over and over again”. It removes unnecessary thinking from incident response and helps you focus on the task at hand.

Or at least carry out the work without an overwhelmed 🤯 feeling.

Why runbooks are useful in SRE incident response

Here are 3 reasons why runbooks are superior to “I’ll figure it out as it comes” as a strategy:

Automated processes don’t always protect against all possible issues — so software operations needs 10s to 100s of different activities actioned by skilled humans to keep the system rolling
“30-40% of procedures require human judgment to resolve safely, so that’s still a bunch of run books that won’t go away – even if large parts of deployment are push-button / automated processes.”
Prevents annoying experiences like this: “I recently ran into a situation where I spent 6 hours understanding how something works that would have taken 20 minutes if the relevant information was stored somewhere.”

Ways that teams have set up their runbooks

Confluence — is not particularly designed for managing runbooks but is an open-ended tool that enables you if you have a solid enough idea of how to effectively design a runbook

Jupyter Notebooks – an open-source tool with a combo of text, image, and live code snippets so decent option if you are happy to install and maintain it

Markdown files hosted in git repo — maintenance might be an issue over time without strict guidelines within the team

Err… this ➝ “Sticky notes on someone’s desk. We’re thinking about getting a laminator to keep the coffee spills from being too serious of a problem.” 😅

Factors to consider when developing your own runbook

Make a standard runbook template — makes it easier to process information when in a pinch like when resolving an urgent incident
Have a collaborative approach to building the runbooks — don’t palm the task off to technical writers – the people who design and build the systems should be the main authors or at least actively participate in the process
Give an explanation to the runbook user of why the component of the system was designed as it appears – ambiguity around intentions is a key reason for failing to come up with creative solutions to a tricky problem
Some runbooks have sub-processes – it’s important to clarify what these are and how they relate to the process that they are the children of

Author
Recent Posts

Connect?

Ash Patel

Reliability Nut at SREpath

Ash has an unhealthy obsession with software reliability. Maybe it’s got to do with the trauma of working at a few companies where software kept slowing or went down while he worked to turn it around. His ma hopes that he can one day turn this passion into a respectable job or business. Still waiting…

Connect?