Articles on better software operations practices

Cloud infrastructure success is a fine balance of budget and service quality

The visual summary below is based on a post by Will Larson, who started the SRE function at Uber. His post elaborates on a “trunks and branches” model for developing infrastructure-facing teams. It also covered an interesting perspective on the balancing act of budget and service quality. I will explain the visual summary underneath it. … Read More

How 6 system resilience patterns increase software reliability

Introduction System resilience thinking can inform better Site Reliability Engineering decisions. Specifically, it can affect how the SRE culture unfolds and handles critical situations. The system resilience concept is rooted in theoretical computer science. Don’t panic. I will explain how it can – in a practical way – support increased software reliability in production. We … Read More

Site Reliability Engineering Culture Patterns

Who should read this: Developers new to Site Reliability Engineering (SRE) who want to understand the culture Current SREs who are seeking to guide others like management on key aspects of SRE culture Technical leaders who want to create an ideal culture for effective software reliability practices Introduction Despite its now antiquated sounding name, Site … Read More

Rundown of Netflix’s SRE practice

Introduction A lot goes on in the background every time you load up your favorite Netflix movie or series. Engineers spread across Chaos Engineering, Performance Engineering and Site Reliability Engineering (SRE) are working non-stop to ensure the magic keeps happening. 📊 Here are some performance statistics for Netflix When it was alone on top of … Read More

25+ Site Reliability Engineering OKRs

Incident Response OKRs Reduce MTTR for on-call engineers by 5% Develop buffers to ensure incidents remain at < 75% of the error budget Mitigate false positive system alerts to reduce on-call staff costs Speed up the resolution of critical incidents by 5% Increase the coverage of 4-point SLIs from 90% of services to 100% Reduce … Read More

Runbooks for better incident response

Introduction I can confidently tell you that runbooks form a critical part of the incident response toolkit. I will also tell you that SREs are well-placed to start and oversee the development of runbooks. If you don’t have a runbook yet, let me entice you with the thought of checklist-type documentation to follow when you’re … Read More

SRE is not a monolithic role

SRE is gaining more traction and a misconception is gaining steam among senior stakeholders. That SRE is a monolith role like what “programmers” were in the 90s. Let’s burst that misconception… SRE is a broad, overarching responsibility that needs a multitude of role considerations to pull off properly. It is not a monolithic role where … Read More

How SREs are unique in their approach to work

Site Reliability Engineers (SREs) are a rare bunch in the software community. But there’s little denying that the approach of Site Reliability Engineering is the future of software operations. Here are some things that make SREs a unique breed in software work: SREs look at the broader picture Ask any developer what they’re working on … Read More