Tag: incident response

  • #21 – Better SRE in 2024 is all we can hope for

    Episode 21 [SREpath Podcast] Show notes Sebastian is back for this episode to help set out direction for 2024. We reflected during the holidays on the problems SREs faced in 2023 in terms of job insecurity, burnout, and “that really shouldn’t be my sole job”. Sebastian and I talked about what we hope to bring…

  • #14 Faster Incident Resolution through Data-Driven Notebooks (with Ivan Merrill)

    Episode 14 [SREpath Podcast] Ash Patel interviews Ivan Merill who is head of solutions engineering at Fiberplane. Ivan shares insights about making sense of the big data that comes from observability and incident response, to improve learning and drive faster incident resolution in the future. He also sheds light on the importance of fostering collaboration…

  • #12 From Incident Firefighting to Reliability First (with Robert Ross)

    Episode 12 [SREpath Podcast] Ash Patel interviews Robert Ross who is the founder and CEO of Firehydrant, an incident management platform. Robert talks about his experiences as an SRE and making tools for making developers’ lives easier. He also shares his insights from offering incident management software to SREs and other software incident responders. Highlights…

  • How developers can survive “you build it, you run it”

    Introduction As a developer, you might not have anything to do with your code once it’s been committed all the way to looking after the code right up to production. The latter is called the “you build it, you run it” model. It’s not going away. But that depends on your organization. It’s likely to…

  • Reduce software outage risk with passive guardrails

    Shocking fact: only 10-25% of software outages are because of hardware or network failure. The rest are the result of human error like misconfiguration — paraphrasing Martin Kleppman, Designing Data-Intensive Applications In this article, I will share with you how setting up passive guardrails in and around developer workflows can reduce the frequency and severity…

  • Renaming “post-mortems” of software outages for psychological safety

    As a generative leader and mental health advocate, I am wary of seeing such a morbid term being thrown around for what should be a learning experience that advances culture. This post will differ from my usual positive posts about Site Reliability Engineering (SRE). Please bear with this because I’m an otherwise forward thinker. Two…

  • Runbooks for better incident response

    Introduction I can confidently tell you that runbooks form a critical part of the incident response toolkit. I will also tell you that SREs are well-placed to start and oversee the development of runbooks. If you don’t have a runbook yet, let me entice you with the thought of checklist-type documentation to follow when you’re…