Category: Articles

Check out our written research on SRE and software operations topics ⬇️⬇️⬇️

Articles, Team Development

Building the case for starting a software reliability team

This article aims to help engineering leaders consider issues before starting a software reliability team. Since I am an advocate for Site Reliability Engineering (SRE), we will now refer to such a team as the “SRE team”. Besides creating a new team, leaders face many responsibilities that are often invisible to individual contributors and their…

May 31, 2022
Articles, Opinion

Renaming “post-mortems” of software outages for psychological safety

As a generative leader and mental health advocate, I am wary of seeing such a morbid term being thrown around for what should be a learning experience that advances culture. This post will differ from my usual positive posts about Site Reliability Engineering (SRE). Please bear with this because I’m an otherwise forward thinker. Two…

May 8, 2022
Articles, Opinion

Why Agile software teams need SRE support

Agile software delivery is de rigeur of modern software. However, as complexity increases, there’s a high risk of frequent, high-velocity breaking software-in-production. Software-in-production is when the software is accessible by users. That’s where Site Reliability Engineers (SRE) can come to support the Agile software team’s efforts. Who are Site Reliability Engineers? Site Reliability Engineers (SREs)…

April 29, 2022
Articles, Opinion, Visual Summaries

Review of Google’s Site Reliability Engineering Hierarchy

Google’s book on SRE, Site Reliability Engineering (2016), has captured wide acclaim in the software operations world. One of the most discussed aspects in SRE circles about the book is its SRE hierarchy. The hierarchy has merit, but it’s also flawed in a way that would prevent you from educating people about SRE. I’ll get…

April 26, 2022
Articles, Mildly Technical

Site Reliability Engineering Glossary

April 26, 2022
Articles, Team Development, Visual Summaries

How cloud infrastructure teams evolve – from start to maturity

I recently read a post by Will Larson, who started SRE at Uber. The post is called the Trunks and branches model for scaling infrastructure organizations. Several passages in the post covered how infrastructure teams can evolve from the startup phase. I felt it would be easier to comprehend the dense-and-rich advice with a visual…

April 19, 2022
Articles, Team Development, Visual Summaries

Cloud infrastructure success is a fine balance of budget and service quality

The visual summary below is based on a post by Will Larson, who started the SRE function at Uber. His post elaborates on a “trunks and branches” model for developing infrastructure-facing teams. It also covered an interesting perspective on the balancing act of budget and service quality. I will explain the visual summary underneath it.…

April 12, 2022
Articles, Mildly Technical

How 6 system resilience patterns increase software reliability

Introduction System resilience thinking can inform better Site Reliability Engineering decisions. Specifically, it can affect how the SRE culture unfolds and handles critical situations. The system resilience concept is rooted in theoretical computer science. Don’t panic. I will explain how it can – in a practical way – support increased software reliability in production. We…

April 1, 2022
Articles, Opinion, Team Development, Visual Summaries

Site Reliability Engineering Culture Patterns

Who should read this: Introduction Despite its now antiquated sounding name, Site Reliability Engineering (SRE) as a discipline has strong future promise to proactively improve software reliability in production. As software complexity continues to increase, so will the need for better and better practice of SRE. It is undoubtedly an exciting but enigmatic field, with…

March 17, 2022
Articles, Case Studies

Rundown of Netflix’s SRE practice

Introduction A lot goes on in the background every time you load up your favorite Netflix movie or series. Engineers spread across Chaos Engineering, Performance Engineering and Site Reliability Engineering (SRE) are working non-stop to ensure the magic keeps happening. 📊 Here are some performance statistics for Netflix When it was alone on top of…

February 9, 2022