Reduce software outage risk with passive guardrails

Shocking fact: only 10-25% of software outages are because of hardware or network failure. The rest are the result of human error like misconfiguration — paraphrasing Martin Kleppman, Designing Data-Intensive Applications In this article, I will share with you how setting up passive guardrails in and around developer workflows can reduce the frequency and severity … Read More

Renaming “post-mortems” of software outages for psychological safety

As a generative leader and mental health advocate, I am wary of seeing such a morbid term being thrown around for what should be a learning experience that advances culture. This post will differ from my usual positive posts about Site Reliability Engineering (SRE). Please bear with this because I’m an otherwise forward thinker. Two … Read More

Runbooks for better incident response

Introduction I can confidently tell you that runbooks form a critical part of the incident response toolkit. I will also tell you that SREs are well-placed to start and oversee the development of runbooks. If you don’t have a runbook yet, let me entice you with the thought of checklist-type documentation to follow when you’re … Read More