{"id":385,"date":"2022-04-01T23:42:00","date_gmt":"2022-04-01T13:42:00","guid":{"rendered":"https:\/\/sysmit.com\/cf22\/?p=385"},"modified":"2023-12-13T15:28:02","modified_gmt":"2023-12-13T05:28:02","slug":"6-system-resilience-patterns-for-increasing-software-reliability","status":"publish","type":"post","link":"https:\/\/sysmit.com\/cf22\/6-system-resilience-patterns-for-increasing-software-reliability\/","title":{"rendered":"How 6 system resilience patterns increase software reliability"},"content":{"rendered":"

Introduction<\/h2>\n\n\n

System resilience thinking can inform better Site Reliability Engineering decisions. Specifically, it can affect how the SRE culture unfolds and handles critical situations. <\/p>\n\n\n\n

The system resilience <\/em>concept is rooted in theoretical computer science. <\/p>\n\n\n\n

Don’t panic. I will explain how it can – in a practical way – support increased software reliability<\/strong> in production. <\/p>\n\n\n\n

We will cover six patterns that comprise system resilience:<\/p>\n\n\n

\n
\n\n
    \n
  1. Adaptive Response<\/li>\n\n\n\n
  2. Superior Monitoring<\/li>\n\n\n\n
  3. Coordinated Resilience<\/li>\n\n\n\n
  4. Heterogeneous Systems<\/li>\n\n\n\n
  5. Dynamic Repositioning<\/li>\n\n\n\n
  6. Requisite Availability<\/li>\n<\/ol>\n\n<\/div><\/div><\/div>\n\n
    \n\n
    \"System<\/figure>\n\n<\/div><\/div><\/div>\n<\/div>\n\n\n

    The above terms likely make little sense, but we will unpack each in a moment. <\/p>\n\n\n\n

    First, let’s define system resilience in the software context:<\/p>\n\n\n\n

    \n

    System resilience is the ability of organizational, hardware and software systems to mitigate the severity and likelihood of failures<\/strong> or losses<\/strong>, to adapt to changing conditions, and to respond appropriately after the fact.<\/em><\/p>\n\u2014 Jackson, Scott. (2007). System Resilience: Capabilities, Culture and Infrastructure. INCOSE International Symposium.<\/cite><\/blockquote>\n\n\n\n

    It’s a very academic definition but very precise in its meaning. The concept of system resilience is important for proactively addressing software performance and reliability<\/strong>. <\/p>\n\n\n\n

    Now, let’s unpack each of the six patterns of system resilience:<\/p>\n\n\n

    Resilience pattern #1: Superior monitoring<\/h2>\n\n

    What does it mean?<\/h3>\n\n\n

    Monitor for and detect adverse events in a timely manner, well before they can snowball into a critical issue. <\/p>\n\n\n

    How to apply it to SRE practice<\/h3>\n\n\n

    You can make for a superior monitoring effort by:<\/p>\n\n\n\n