Introduction

In this post, I will cover the following patterns of system resilience:
- Adaptive Response
- Superior Monitoring
- Coordinated Resilience
- Heterogeneous Systems
- Dynamic Repositioning
- Requisite Availability
Let’s cover the definition of system resilience before exploring these patterns in greater depth.
System resilience is the ability of organizational, hardware and software systems to mitigate the severity and likelihood of failures or losses, to adapt to changing conditions, and to respond appropriately after the fact.
— Jackson, Scott. (2007). System Resilience: Capabilities, Culture and Infrastructure. INCOSE International Symposium.
I’ve given you an academic definition of resilience because it’s essential to be precise about what system resilience stands for.
Adaptive response for system resilience

Definition:
The ability to respond in a timely and appropriate manner.
Factors for effective adaptive response:
Superior monitoring for system resilience

Definition:
Monitor for and detect adverse events in a timely manner – before they become a critical issue.
Factors for effective superior monitoring:
This can be encompassed as part of a broader Observability capability, which covers logging, tracing and monitoring.
Engineers monitor for system issues, trace the origins of the problem and review logs to identify patterns.
Coordinated resilience for system resilience

Definition:
Increase the depth of resilience by increasing the number of obstacles a problem must pass before it impacts systems.
Factors for effective coordinate resilience:
Want a deeper understanding of Site Reliability Engineering culture?
👇 Take SREpath’s free 7-day SRE culture patterns course 👇
Heterogenous systems for system resilience

Definition:
Allows for redundancy of service delivery to reduce the occurrence of common-mode failures, e.g. serving high-traffic applications using one CPU
Factors for effective superior monitoring:
Dynamic repositioning for system resilience

Definition:
Increase the ability to recover from an incident by distributing and diversifying network distribution.
Factors for effective dynamic repositioning:
Requisite availability for system resilience

Definition:
Balancing act for making services and data available to users. Some systems must be available at all times, others not as much. Some data is privileged, other data is not.
Factors for effective requisite availability:
Parting remarks
We have now covered 6 patterns for system resilience that can support increased reliability of software. Do you know more? Reach out if you do.