6 system resilience patterns for increasing software reliability

Introduction In this post, I will cover the following patterns of system resilience: Adaptive Response Superior Monitoring Coordinated Resilience Heterogeneous Systems Dynamic Repositioning Requisite Availability Let’s cover the definition of system resilience before exploring these patterns in greater depth. System resilience is the ability of organizational, hardware and software systems to mitigate the severity and … Read more

Rundown of Netflix’s SRE practice

Introduction A lot goes on in the background every time you load up your favourite Netflix movie or series. Engineers spread across Chaos Engineering, Performance Engineering and Site Reliability Engineering (SRE) are working non-stop to ensure the magic keeps happening. đź“Š Here are some performance stats for Netflix When it was alone on top of … Read more

25+ Site Reliability Engineering OKRs

Please read this before reviewing the Site Reliability OKRs below Many of the below OKRs are ambitious examples – more than what most junior SREs should be given Most OKRs would be the culmination of efforts by an entire SRE team, and not a sole engineer Numbers in the OKRs, e.g. 0.75%, have been arbitrarily … Read more

Runbooks for better incident response

Introduction Runbooks are a Site Reliability Engineer’s best friend. They are most useful when you envisage putting out the same fires again and again. Or at least do it without a 🤯 feeling. Why runbooks are useful in SRE incident response Here are 3 reasons why: Automated processes don’t always protect against issues — so software needs 10s … Read more

SRE is not a monolithic role

SRE is gaining more traction and a misconception is gaining steam among senior stakeholders. That SRE is a monolith role like what “programmers” were in the 90s. Let’s burst that misconception… SRE is a broad, overarching responsibility that needs a multitude of role considerations to pull off properly. It is not a monolithic role where … Read more

How SREs are unique in their approach to work

Site Reliability Engineers (SREs) are a rare bunch in the software community. But there’s little denying that the approach of Site Reliability Engineering is the future of software operations. Here are some things that make SREs a unique breed in software work: SREs look at the broader picture Ask any developer what they’re working on … Read more