Articles on better software operations practices

Is platform engineering at risk of shiny object syndrome?

So much has been debated lately about the emergence of “Platform Engineering” as a solution to software operations problems. It’s an interesting proposition. However, it is not your silver bullet that will fix all things one felt didn’t work out with Dev versus Ops, DevOps, or SRE. We are missing something very important in our … Read More

Reduce software outage risk with passive guardrails

Shocking fact: only 10-25% of software outages are because of hardware or network failure. The rest are the result of human error like misconfiguration — paraphrasing Martin Kleppman, Designing Data-Intensive Applications In this article, I will share with you how setting up passive guardrails in and around developer workflows can reduce the frequency and severity … Read More

Where in team topologies does Site Reliability Engineering fit in?

We will explore the workings of the Team Topologies model and how Site Reliability Engineering (SRE) teams can fit into it. In more detail, I will share with you the following: an overview of the team topologies model the 4 team modalities it proposes, and finally… where SRE teams fit in team topologies Let’s get … Read More

How Jaeger tracing fits into software observability

In this article, I will share how tracing and more specifically Jaeger tracing can fit into your wider software observability strategy. Before we get into tracing, let’s define observability. What is observability? Observability is a comprehensive means of gaining data on how software services perform in production. This data gives you a picture of the … Read More

SRE’s role in safer infrastructure-as-code

This article explores 2 simple ways for SREs to drive better practices and code hygiene within infrastructure-as-code (IAC) tooling like Terraform. Why bother? Because of its centrality to cloud infrastructure efficiency, it’s highly likely that you will get involved with an IAC problem at some point in your SRE career. I will mention Terraform from … Read More

Building the case for starting a software reliability team

This article aims to help engineering leaders consider issues before starting a software reliability team. Since I am an advocate for Site Reliability Engineering (SRE), we will now refer to such a team as the “SRE team”. Besides creating a new team, leaders face many responsibilities that are often invisible to individual contributors and their … Read More

Renaming “post-mortems” of software outages for psychological safety

As a generative leader and mental health advocate, I am wary of seeing such a morbid term being thrown around for what should be a learning experience that advances culture. This post will differ from my usual positive posts about Site Reliability Engineering (SRE). Please bear with this because I’m an otherwise forward thinker. Two … Read More

Why Agile software teams need SRE support

Agile software delivery is de rigeur of modern software. However, as complexity increases, there’s a high risk of frequent, high-velocity breaking software-in-production. Software-in-production is when the software is accessible by users. That’s where Site Reliability Engineers (SRE) can come to support the Agile software team’s efforts. Who are Site Reliability Engineers? Site Reliability Engineers (SREs) … Read More

Review of Google’s Site Reliability Engineering Hierarchy

Google’s book on SRE, Site Reliability Engineering (2016), has captured wide acclaim in the software operations world. One of the most discussed aspects in SRE circles about the book is its SRE hierarchy. The hierarchy has merit, but it’s also flawed in a way that would prevent you from educating people about SRE. I’ll get … Read More