6 system resilience patterns for increasing software reliability

Introduction In this post, I will cover the following patterns of system resilience: Adaptive Response Superior Monitoring Coordinated Resilience Heterogeneous Systems Dynamic Repositioning Requisite Availability Let’s cover the definition of system resilience before exploring these patterns in greater depth. System resilience is the ability of organizational, hardware and software systems to mitigate the severity and … Read more

25+ Site Reliability Engineering OKRs

Please read this before reviewing the Site Reliability OKRs below Many of the below OKRs are ambitious examples – more than what most junior SREs should be given Most OKRs would be the culmination of efforts by an entire SRE team, and not a sole engineer Numbers in the OKRs, e.g. 0.75%, have been arbitrarily … Read more

Runbooks for better incident response

Introduction Runbooks are a Site Reliability Engineer’s best friend. They are most useful when you envisage putting out the same fires again and again. Or at least do it without a 🤯 feeling. Why runbooks are useful in SRE incident response Here are 3 reasons why: Automated processes don’t always protect against issues — so software needs 10s … Read more