SRE’s role in safer infrastructure-as-code (IAC)

Every Site Reliability Engineer will get involved in an infrastructure-as-code problem at some point in their career.

IAC is a tricky space with lots of issues that, despite its automation promise, can generate ongoing maintenance toil.

Let’s explore two areas in IAC to reduce the potential toil burden.

Keep your IaC code clean and organized

Jim Brikman is a consultant in Terraform, a popular infrastructure provisioning tool. He learned a few interesting things from writing 300,000 lines of IaC code alongside many teams that use the tool.

One of these learnings was that many teams keep their IaC code as a single monolithic file. It blew my mind when I heard this, but I can understand that a legacy team culture can lead to this.

For more than a few teams, all the environment configurations (development, QA, testing, staging, and production) were stored on one gigantic file. This led to several problems including:

  • difficulty finding the line of code that needed to be updated
  • long wait times (5+ minutes) for the monolith IaC code to execute
  • increased risk of missing red line errors e.g. “executing code but the database will be deleted”

The most concerning aspect was that everyone was given admin permission to execute the file. Many SREs might put their hands up at this point and say, “And so they should.”

But let’s consider the scenario where less experienced developers gain admin-level access and unwittingly make undesirable changes to production infrastructure.

In many scenarios like consumer apps or eCommerce, this might be an acceptable risk. But having worked extensively with healthcare systems, I can tell you it is not.

The same goes for finance, government, and many “old-school” sectors.

In summary, a monolith IaC file risks a minor change breaking everything. So make sure it doesn’t happen. And if it is happening, change it, pronto.

Now, making a monolithic codebase is a thing that most teams don’t plan to do, but it’s obviously happening if IaC consultants are finding them.

Maybe the team works with Oracle and SAP systems – hard to drop the taste for monoliths after doing that for a long time.

Here are Jim’s recommendations for banishing monolith risk:

  1. Isolate each environment as a separate IaC file or folder
  2. Break down each environment into various components e.g. database, VPC, frontend
  3. Make all the above components reusable

Let’s quickly unpack #2 VPC changes maybe once or twice a year while frontend might change 10 times a day, so by separating them, you’re reducing the risk of config changes affecting VPC 10x a day.

Modular architecture with smaller components reduces the surface area for attacks and errors affecting uptime.

Safeguard your IaC templates

After implementing IaC, feature teams can and are expected to increase their number and velocity of deployments.

In this situation, platform engineers (if there are any around) will equip developers with IAC templates. Otherwise, developers can source their own.

The latter brings with it an inherent risk.

Both externally and internally sourced templates may have vulnerabilities built-in — sometimes by mistake, sometimes intentional.

The end result remains the same: potential production-level security risks that may get overlooked and cause future downtime.

In fact, an analysis done by Unit 42 of Palo Alto Networks in 2020 found 200,000+ vulnerabilities in IaC templates. The risk threshold is more likely to be reached in teams that give developers full control of the IaC.

Sidenote: Sorry developer if you’re reading this, I’m not trying to berate you here, but not all developers are made the same.

Let’s take the encryption standards of PCI for example. Since 2018, PCI (payment card industry) has made TLS 1.0 a non-compliant encryption. Hackers can intercept and listen to traffic encrypted by that protocol.

A 2021 study by Azure operators found that developers were still trying to launch services to production with TLS 1.0 encryption! Almost 3 years later, for services that handled financial data.

The message here isn’t to “avoid IaC if you don’t know what you’re doing”. It’s that teams should tread carefully and handle risks before they escalate.

I’m not talking about old-school IT practices like management approval workflows. That would hinder deployment speed too much.

3 methods can provide a less stifling experience for developers while still assuring good IAC security:

  1. instituting 2-person authentication that the IaC has been coded correctly
  2. having the platform or SRE teams write up IaC templates and vet them with security
  3. set policy controls at the cloud service provider level

Let’s now explore these 3 methods in more detail…

Institute 2-person verification of IAC code

One person – for example, a developer – writes the infrastructure code while the other reviews and comments on it with security in mind.

SREs are well-positioned to be reviewers, as they understand that vulnerabilities can lead to downtime, which would eat into their error budgets and impact their SLOs.

After all, SREs have a vested interest in ensuring infrastructure has been derisked. Less infra risk means less likelihood of unexpected outages.

Platform/SRE teams collaborate with security

In some settings, SREs don’t interface directly with security teams, but in this instance, they should consider it.

It would give them a perspective on the kinds of infrastructure vulnerabilities that can bring down the system in production. They can also turn their learnings from security people into code that’s baked into IaC templates thus assuring security moving forward.

Set policy controls at the cloud provider level

Cloud (IaaS) providers may have in-built policy controls for the kind of infrastructure that should get provisioned e.g. Azure Policy.

But tread carefully as you don’t want to restrict teams from deploying fast and often by having overly onerous policy controls.

By nature, SREs need to have at least some degree of skill or understanding of IaC. This understanding needs to be constantly refined, as the potential IaC vulnerabilities evolve over time.

Most SREs are too busy for day-to-day checks and balances. Instead, they can — perhaps on a monthly basis — provide guidance to developers and run checks on the infrastructure code.

Wrapping up

This article hits only the tip of the IAC iceberg. There are many more considerations one could make to ensure the infrastructure is provisioned effectively and securely. As always, keep learning.