Reduce software outage risk with passive guardrails

Shocking fact: only 10-25% of software outages are because of hardware or network failure. The rest are the result of human error like misconfiguration — paraphrasing Martin Kleppman, Designing Data-Intensive Applications

In this article, I will share with you how setting up passive guardrails in and around developer workflows can reduce the frequency and severity of incidents and outages.

We will cover:

  • why passive guardrails are important and
  • how they can be implemented

Why passive guardrails are important

Passive guardrails save Site Reliability Engineers (SREs) from becoming the secret police for management to shame developers when they make mistakes. Let me explain.

Extreme situations of an incident or outage may involve management openly chastising developers as a whole or individually.

Higher managers might delegate this task to the people who own reliability. Guess who that is? Yes, you, the humble Site Reliability Engineer.

Reality check: we can push the blameless culture as much as we want within engineering circles, but we have to consider that a large part of management doesn’t buy into our cultural fancies

A way to prevent this peril is to employ passive guardrails that keep developers within safe confines.

From a developer’s perspective, it would only seem to serve as a means to stay on a well-trodden “golden path”, as described by Spotify’s platform engineers.

By definition, guardrails are controls that prevent deviations from required behaviors.

Let’s delineate active versus passive guardrails.

  • Active guardrails can be seen as the rules and policies that govern day-to-day behaviors and must be consciously considered when deploying code or altering the system. They are seen as punishable if circumvented.
  • Passive guardrails, on the other hand, are a more subtle way to drive behavior. They create boundaries that form a relatively unconscious workflow after some time. They are seen as a mishap or accident if somehow bypassed.

Here’s a real-life, non-software example of a passive guardrail:

Think of when you drive along the highway. You are not thinking about the median strip and lines that set a passive boundary between your vehicle and traffic going in the opposite direction. But they’re there to keep your mind focused in the right direction.

Remember, developers are focused on launching features as quickly and efficiently as possible. Very few would (or at least should) be distraught at not having root access to production servers or being intentionally guided around tools and platforms.

Some benefits of passive guardrails for you and your SRE team will be:

  • less time, mental bandwidth, and energy spent on enforcing policies and procedures
  • less animosity from developers
  • reduced manual toil due to more automated processes and built-in mechanisms (passive guardrails inherently rely on automation)

Developers benefit as well, as they should. They can:

  • move faster with launching services to production without your active involvement
  • cut their risk of accidentally bypassing best practices & protocols because these will be already baked into the workflow

Techniques for implementing passive guardrails

We will cover in detail 7 techniques that support the passive software guardrail concept:

  1. Doubtless software system design
  2. Clone production to full-featured sandbox
  3. Pre-production checklist for developers to follow
  4. 2-person authentication for deploys
  5. Stagger rollout of code changes
  6. Have an early warning system for failure
  7. Service snapshots for rapid rollback

These techniques are an amalgamation of several ideas I’ve noted across several books, including Seeking SRE (Blank-Edelman, 2018) and Designing Data-Intensive Applications (Kleppmann, 2017), as well as SRECon talks like Confessions of a Systems Engineer by David Argent.

Let’s begin.

Doubtless software system design

In my opinion, a well-designed system should be the first step to setting passive guardrails for developers. It should remove any ambiguity or doubt from their minds when they get around to launching their service.

Related article:  How 6 system resilience patterns increase software reliability

This means having discretely developed and well-documented service boundaries, APIs, and admin interfaces.

When ambiguity is removed, the path to launch becomes crystal clear, with developer improvisation (and subsequently error) potential approaching 0%.

Here’s an example from Netflix on system design acting as a passive guardrail:

“Some changes we incorporate into tooling might be called a guardrail. A concrete example is an additional step in deployment added to Spinnaker that detects when someone attempts to remove all cluster capacity while still taking significant amounts of traffic. This guardrail helps call attention to a possibly dangerous action, which the engineer was unaware was risky. This is a method of “just in time” context.” — in Seeking SRE: Conversations About Running Production Systems at Scale by David N. Blank-Edelman

Achieving this kind of result may involve creating tool features or prompts to guide developers through key steps. All of this will be a balancing act because you risk making all of these components too restrictive or minimalistic.

Give enough power to these components for developers to continue being happy with them. This means consistently taking honest feedback from all stakeholders to improve the system design to meet their changing needs.

Clone production to full-featured sandbox

One of the most common complaints I have heard from operations engineers about developers is that “they code on ‘monster’ local machines that have 32MB RAM and then wonder why VMs in production with much less RAM allocation keep struggling”.

Developers are almost always working on features in isolation from production. There’s a good rationale for this: to prevent experimental work from negatively impacting real-world services and users.

And so, developers code away at their services with unknowns at play. Some developers may have:

  • some idea of what the data will look and play like but lack the complete picture.
  • an indication of resource demand, but even still nowhere near the true numbers.
  • little insight into how hard users are pushing the software in other areas of the system

The end result is that developers risk factoring code for an idealistic world that doesn’t truly reflect your software’s user base or data.

You can’t blame a developer for any of the above. They are working in a sandbox, after all.

A solution to this unfortunate problem is to continually give developers a realistic sandbox that reflects the service as it stands in production.

By doing this, you would be enabling a safe environment for developers to manipulate and test code in an environment that:

  1. reflects the “real world” system AND
  2. continues to safeguard the production system from experimental work

Pre-production checklist for developers to follow

Software teams launch services to production more frequently than ever before, and they make many ongoing tweaks to these services. It is critical to help them effectively launch changes.

Google has a dedicated team of launch coordination engineers (LCE) for this effort, but we are not Google. We’ll forego the extra title and cost as most organizations that aren’t Google-scale cannot justify it.

In many organizations, SREs review, consult, collaborate, and even contribute, but the final responsibility for production delivery remains with the product engineering team that owns a given service.

With tight resources in place, SREs can create pre-launch assessments that make sense for developers and prevent future mishaps in production.

The pre-launch assessment may be called a myriad of terms depending on the organization, like Production Readiness Checklist or Operational Readiness Review.

Google’s book, Site Reliability Engineering (2016), has a section on ensuring reliable product launches. Below are examples of the kinds of questions you’ll find in their pre-launch checklist:

  • Are you storing persistent data? If yes, make sure you backup the data (here are instructions)
  • Could a user abuse your service? If yes, implement rate limiting and query quotas. (here’s the link to a service to help you do this)
Related article:  Site Reliability Engineering Glossary

The book’s authors assert that “in practice, there is a near-infinite number of questions to ask about any system, and it is easy for the checklist to grow to an unmanageable size.” So they follow a few logical rules to elucidate the right path:

  • importance of questions must be substantiated from experience, like a launch disaster
  • instructions given to developers must be concrete, practical and reasonable
  • stay on top of changes in the system and reflect these in the question/instruction set
  • run regular reviews (once to twice a year) of the pre-launch checklist to ensure the above

You may also develop your own SRE checklist for covering key infrastructure components that the service will rely on. That checklist may cover issues concerning some or all of the following:

  • Security – authentication, secrets management, TLS, vulnerability scanning
  • Observability – availability metrics, tracing, monitoring, alerting
  • Storage and backups – statefulness, backup availability, and practices
  • Networking – VPCs, subnets, IPs, service discovery, mesh, and more
  • Performance – benchmarking, load testing, tuning components, etc
  • Capacity – horizontal scaling, vertical scaling, availability zoning
  • Cost management – reserved vs. spot resources, closing underused resources
  • Testing – automated testing after commits, scheduled testing, test coverage

You will need to factor in how dependencies will evolve. New ones will emerge and existing ones will alter or deprecate. There will be upstream and downstream impacts from these changes on components that rely on them.

Implement 2-person authentication

Not to be mistaken with 2-factor authentication (2FA), which relies on a single user confirming their intent to a certain action through a secondary device. You may have seen 2FA when trying to log into sensitive systems like your banking service.

Why not have the same level of corroboration when your system is, for example:

  • expecting a major commit or
  • a critical service is due for an update

But instead of the lone developer making these necessary commits by authenticating with their mobile device, have 2 people — both being engineers — sign off on the work.

What’s the rationale behind this?

  1. It puts a second pair of eyes on the code and pre-launch checklist
  2. It may cause an unconscious need within developers not to let their peers down

The humble engineer may think twice about their code quality before it reaches production. After all, “I don’t want sloppy code to get in the way of the good rapport I have with my colleague.”

Check out this infrastructure-as-code (IAC) example for 2-person authentication

Stagger rollout of code changes

Pushing code to production has inherent risks. Pushing code across the entire customer base, region, or organization at once amplifies risk. How can you mitigate this risk?

I recommend a segment-based staggered rollout to minimize the blast radius of erroneous code changes in production. Examples of segments include:

  • the least fussy customer base, then through the general userbase all the way out to more discerning customers
  • a single VM to a group of VMs to a region, then across regions

This process should be as automated as possible and should force the developer to think. In order to implement this guardrail effectively, you will need to work out:

  • how to segment your users and VMs for appropriate blast radius size
  • how quickly to roll out the changes across these segmented groups
Related article:  Renaming "post-mortems" of software outages for psychological safety

Set a reasonable time interval to allow for the detection of anomalies and failures in production.

Besides segmenting by audience or machine, you may also consider the following approaches to reducing blast radius:

I will explore them more in-depth in a future write-up.

Have an early warning system for failure

In a way, software systems are not that different from weather systems. You want to know as early as possible when problems are beginning to surface so that you can prevent a problem from turning into a full-blown disaster.

Or at least minimize the damage from the incoming disaster.

Observability is the name of this game, and it should serve as the bread and butter of any software engineering team. You may consider setting up a full telemetry suite that tracks performance as well as service uptime. This would include:

  • logging relevant system events
  • tracing across services for issues
  • monitoring resource usage and performance

As an early warning system, observability can advise developers and SREs where tweaks and more resources are required — before a deluge of users and usage takes the system down.

I will cover observability in-depth in a dedicated write-up in the future, as the breadth and complexity of the capability lend it such a privilege.

Service snapshots for rapid rollback

Let’s say you’ve done all of the above, and yet something still goes wrong. What will you do?

A viable solution is to take time-interval snapshots of services-in-production, as well as platform configurations (like .tf and YAML files). See an error crop up from the latest update? Roll back to a point that doesn’t break your software-in-production.

The time interval will depend on your ability and appetite to automate snapshots, as well as frequencies of deployment and platform changes.

You will essentially have multiple viable versions of the same code that can be shifted to and fro as necessitated by operational needs and challenges.

Netflix SREs employ this practice in their Spinnaker continuous delivery platform. The platform ensures new code changes are automatically deployed in a blue-green fashion.

I think we’ve all worked at companies where we upgrade something, and it turns out it was bad, and we spend the night firefighting it because we’re trying to get the site back up because the patch didn’t work. [Instead of that] when we go to put new code into production… we just push a new version alongside the current code base. It’s that red/black or blue/green – whatever people call it at their company… if something breaks in the canary… we immediately roll back. This brings recovery down from hours to minutes. — Coburn Watson, Director of Reliability, Performance and Cloud Infrastructure at Netflix, in the book “Seeking SRE: Conversations About Running Production Systems at Scale” (2018)

In effect, older code is still active on one VM while the newly deployed code is running on another. If the new code deployment fails, the system reverts to the most recent reliably-running code.

Bibliography

  1. Kleppmann, M. (2018). Designing data-intensive applications: the big ideas behind reliable, scalable, and maintainable systems. O’Reilly Media.
  2. Blank-Edelman, D.N. (2018). Seeking SRE. O’Reilly Media.
  3. SREcon Conversations with David Argent, Amazon (August 2020). [online] Available at: https://www.youtube.com/watch?v=F1HLaTUJy_s [Accessed 20-22 Oct. 2022].
  4. 5 Lessons Learned From Writing Over 300,000 Lines of Infrastructure Code. [online] Available at: https://www.youtube.com/watch?v=RTEgE2lcyk4 [Accessed 18-20 Oct. 2022].