6 system resilience patterns for increasing software reliability

Introduction

System resilience depends on orchestration of multiple patterns including superior monitoring, adaptive response, coordinated resilience, dynamic repositioning, heterogenous systems and requisite availability

In this post, I will cover the following patterns of system resilience:

  • Adaptive Response
  • Superior Monitoring
  • Coordinated Resilience
  • Heterogeneous Systems
  • Dynamic Repositioning
  • Requisite Availability

Let’s cover the definition of system resilience before exploring these patterns in greater depth.

System resilience is the ability of organizational, hardware and software systems to mitigate the severity and likelihood of failures or losses, to adapt to changing conditions, and to respond appropriately after the fact.

— Jackson, Scott. (2007). System Resilience: Capabilities, Culture and Infrastructure. INCOSE International Symposium.

I’ve given you an academic definition of resilience because it’s essential to be precise about what system resilience stands for.


Adaptive response for system resilience

Definition:

The ability to respond in a timely and appropriate manner. 

Factors for effective adaptive response:

  • speed of action (but not so fast that you miss critical details)
  • not forcefully staying on a rigid solution path once it’s been paved
  • understanding that problems can branch into escalating issues
  • pre-existing toolsets and processes to rapidly and accurately handle incidents

Superior monitoring for system resilience

Definition:

Monitor for and detect adverse events in a timely manner – before they become a critical issue.

Factors for effective superior monitoring:

  • gear your mindset toward reducing the likelihood and severity of system failures
  • develop systems so that you know when adverse events are happening
  • enhance this approach by knowing location, spread and extent of the event

This can be encompassed as part of a broader Observability capability, which covers logging, tracing and monitoring.

Engineers monitor for system issues, trace the origins of the problem and review logs to identify patterns.


Coordinated resilience for system resilience

Definition:

Increase the depth of resilience by increasing the number of obstacles a problem must pass before it impacts systems.

Factors for effective coordinate resilience:

  • integrate multiple software design methodologies like failover design, security integration, BDD/TDD and DevSecOps
  • increase developer education in the above methods to get above integrated effectively
  • adopt full-stack tracing to uncover and resolve issues at a multi-layer level

Want a deeper understanding of Site Reliability Engineering culture?

👇 Take SREpath’s free 7-day SRE culture patterns course 👇


Heterogenous systems for system resilience

Definition:

Allows for redundancy of service delivery to reduce the occurrence of common-mode failures, e.g. serving high-traffic applications using one CPU

Factors for effective superior monitoring:

  • minimise reliance on single components at strategic junctions across the system’s design
  • have a broad range of CPUs, GPUs, etc to support the system
  • where possible by budget or scope, create redundancies with multiple systems e.g. multicloud, multivendor
  • prevents issues like single-vendor failure causing downtime of applications or workflows e.g. Atlassian April 2022 loss-of-service

Dynamic repositioning for system resilience

Definition:

Increase the ability to recover from an incident by distributing and diversifying network distribution. 

Factors for effective dynamic repositioning:

  • most modern software already benefits from the foundation of repositioning, i.e. running on the cloud in distributed systems
  • Geographical repositioning – services are readily available from multiple zones
  • Cloud repositioning – readiness for services to go private cloud or on-premises as necessary
  • Dependency repositioning – altering connectivity between critical and non-critical systems, so that problems with the latter don’t affect the former

Requisite availability for system resilience

Definition:

Balancing act for making services and data available to users. Some systems must be available at all times, others not as much. Some data is privileged, other data is not.

Factors for effective requisite availability:

  • set tags like service-priority to signify the critical points when you visualise the system architecture
  • employ techniques like redundancy to increase the availability of important systems
  • make data non-persistent so that it is not open to corruption or compromise

Parting remarks

We have now covered 6 patterns for system resilience that can support increased reliability of software. Do you know more? Reach out if you do.

Leave a Comment