How developers can survive “you build it, you run it”

Introduction As a developer, you might not have anything to do with your code once itโ€™s been committed all the way to looking after the code right up to production. The latter is called the โ€œyou build it, you run itโ€ model. Itโ€™s not going away. But that depends on your organization. Itโ€™s likely to … Read More

Reduce software outage risk with passive guardrails

Shocking fact: only 10-25% of software outages are because of hardware or network failure. The rest are the result of human error like misconfiguration โ€” paraphrasing Martin Kleppman, Designing Data-Intensive Applications In this article, I will share with you how setting up passive guardrails in and around developer workflows can reduce the frequency and severity … Read More

Renaming “post-mortems” of software outages for psychological safety

As a generative leader and mental health advocate, I am wary of seeing such a morbid term being thrown around for what should be a learning experience that advances culture. This post will differ from my usual positive posts about Site Reliability Engineering (SRE). Please bear with this because Iโ€™m an otherwise forward thinker. Two … Read More

Runbooks for better incident response

Introduction I can confidently tell you that runbooks form a critical part of the incident response toolkit. I will also tell you that SREs are well-placed to start and oversee the development of runbooks. If you don’t have a runbook yet, let me entice you with the thought of checklist-type documentation to follow when you’re … Read More