Analysis of SRE and platform setup at 10+ tech companies

In this article, you will see a breakdown of the platform setup and SRE practices within 12 non-FAANG technology companies. This is based on the case studies by Andrios Robert. “There is a lot of content available on how Google did [Site Reliability Engineering]; let’s uncover what happens with the rest of the world.” — … Read More

Is platform engineering at risk of shiny object syndrome?

So much has been debated lately about the emergence of “Platform Engineering” as a solution to software operations problems. It’s an interesting proposition. However, it is not your silver bullet that will fix all things one felt didn’t work out with Dev versus Ops, DevOps, or SRE. We are missing something very important in our … Read More

Where in team topologies does Site Reliability Engineering fit in?

We will explore the workings of the Team Topologies model and how Site Reliability Engineering (SRE) teams can fit into it. In more detail, I will share with you the following: an overview of the team topologies model the 4 team modalities it proposes, and finally… where SRE teams fit in team topologies Let’s get … Read More

Building the case for starting a software reliability team

This article aims to help engineering leaders consider issues before starting a software reliability team. Since I am an advocate for Site Reliability Engineering (SRE), we will now refer to such a team as the “SRE team”. Besides creating a new team, leaders face many responsibilities that are often invisible to individual contributors and their … Read More

How cloud infrastructure teams evolve – from start to maturity

I recently read a post by Will Lason, who started SRE at Uber. The post is called the Trunks and branches model for scaling infrastructure organizations. Several passages in the post covered how infrastructure teams can evolve from the startup phase. I felt it would be easier to comprehend the dense-and-rich advice with a visual … Read More

Cloud infrastructure success is a fine balance of budget and service quality

The visual summary below is based on a post by Will Larson, who started the SRE function at Uber. His post elaborates on a “trunks and branches” model for developing infrastructure-facing teams. It also covered an interesting perspective on the balancing act of budget and service quality. I will explain the visual summary underneath it. … Read More

25+ Site Reliability Engineering OKRs

Incident Response OKRs Reduce MTTR for on-call engineers by 5% Develop buffers to ensure incidents remain at < 75% of the error budget Mitigate false positive system alerts to reduce on-call staff costs Speed up the resolution of critical incidents by 5% Increase the coverage of 4-point SLIs from 90% of services to 100% Reduce … Read More

How SREs are unique in their approach to work

Site Reliability Engineers (SREs) are a rare bunch in the software community. But there’s little denying that the approach of Site Reliability Engineering is the future of software operations. Here are some things that make SREs a unique breed in software work: SREs look at the broader picture Ask any developer what they’re working on … Read More