Cost-benefit analysis of infrastructure-as-code (IAC) – Boost software reliability

You might have heard that Infrastructure-as-code (IaC) contributes to better cloud-native software architecture.

But what is IaC, what are its benefits & trade-offs and how can it be improved?

This guide aims to give clarity around IaC through:

a rundown of infrastructure-as-code (IaC)
cost-benefit view of implementing IaC

It can serve as a starting point for business-specific conversations with stakeholders.

At some point, senior management may want to know what exactly they’re buying into (or have bought into already).

Introduction to IaC

Fact #1 — Running reliable software in the real world depends in a significant part on properly provisioned and configured infrastructure

Fact #2 — Your codebase is constantly evolving to support new features and feature improvements

Fact #3 — Infrastructure these days is a complicated combination of bare metal, VMs, networks and software-based abstractions like Kubernetes.

Fact #4 — The combination of infrastructure and evolving code needs careful orchestration in order to support the growing and volatile demands of users on production software systems.

Infrastructure-as-code (IaC) allows engineers to operate effectively with these facts in mind.

They define and manage infrastructure using code. You can automate what was previously a series of costly manual processes.

The changing infrastructure landscape that spawned IaC

In the past, the infrastructure that supported applications was seen as something that didn’t change often — at least not without noticeable capital investment.

This was a time when engineers were limited to the physical infrastructure tucked away in the basement or at a nearby data center.

The SDLC was similarly restrictive — developers wrote code, which then had to pass unit tests.

If all was okay, the code got deployed with the help of systems administrators who manually set up the infrastructure.

The work was clunky because:

A lot of the time, code was wrangled back and forth between developers and operations, as it failed tests or performed poorly in the production environment.
If software needed to scale up, system administrators needed to physically connect more bare metal to the network and then provision them.

The type of infrastructure in this olden-day scenario is called on-prem, short for on-premises infrastructure.

The speed and scalability of this approach would not be able to cope with the volatile infrastructure demands of today.

Demands on infrastructure are highly fluid

The on-prem model was okay to support demand levels for most software at a time when the few users around were not so dependent on technology.

But now, there are way more users today and they demand a consistently high level of application performance.

There is a clear need to be fluid and react quickly to corresponding changes in demand for infrastructure.

With a cloud-first architecture, scaling up to meet demand and subsequently winding it down is made easier.

The machines are no longer in your data center, and with virtual machine (VM) technology, they’re not even physical.

You can spool up 100s of new VMs at your cloud provider in minutes.

But doing this kind of spooling up action regularly and at scale comes with its own kinds of challenges, in particular, what can become a lot of manual work.

Manual infrastructure work is fine — until it isn’t

Ask any Site Reliability Engineer (SRE) what their unique superpower is, and they will tell you it’s their relentless pursuit of reducing toil i.e. manual, repetitive work. **

Without IaC, which I’ll get to in a moment, the engineer manually loads up VMs through command line prompts or clicks around on an administrator panel.

Either that or they send a ticket to the assigned infrastructure engineer to spool up a new VM.

This of course adds significant human-related lagtime, as well as error risk from the handoff effect.

Very briefly, the handoff effect implies that the work that gets passed from one person to another to contribute is more likely to have errors.

The errors in handoffs can occur because:

of how the work gets interpreted by each person along the chain
requirements can change in the time gap between request and action, but only the original request gets actioned

The above might seem like an acceptable risk if you’ve got a non-critical use case with small infrastructure needs, but for anything else with a reasonable scale, this becomes burdensome.

From my experience, high 10s of VMs and up can start chipping away on the cognitive load capacity of an infrastructure-focused engineer.

At 100s of VMs and up, manual work and handoffs are debilitating to an engineer’s workflow and hinder providing reliable services to end-users.

This is compounded by the fact that VMs that used to run for months to years now only run for a few days to weeks.

The frequency of change to infrastructure has accelerated and so ticket build-up to spool up and spin down VMs is a real possibility.

To me, this seems like a clear case of onerous manual work — a waste of highly-paid engineer time — at best, and a high risk for compounding human error at worst.

IaC reduces toil in infrastructure provisioning

IaC solves the toil problem by maintaining the infrastructure’s properties as a code within a file. When you need to change the infrastructure, you modify the code within the file. No endless clicking around control panels. No tickets to send.

With IaC, the code you add and amend automatically drives the infrastructure changes you need.

I know I repeated myself there, but this distinction of IaC is important to let sink in.

Cost-benefit view of implementing IaC

For implementations with a reasonable scale — high 10s of VMs as I mentioned earlier — the benefits outweigh the time, money, and energy costs required to do IaC.

If you’re starting out on IaC, your transformation cost will be minimized by 2 factors:

(1) your primary infrastructure is already in the cloud and

(2) there’s institutional or team knowledge of IaC tools and methods

The former is already a given in most modern organizations and the latter can be developed rapidly with an effective IaC capability development approach.

Here’s a quick rundown of the benefits:

code integrity – lowers technical debt of change through auditable, version-controlled code
lower cost – less engineer time ($$$) is spent on “yak shaving” (repetitive, manual tasks)
faster deployments – little lag time for new VMs to spool up once code changes are deployed
lower human error – no handoffs means less risk of human errors that lead to downtime and performance degradation
higher availability – reduction in non-availability of infrastructure during spikes in demand

Another sellable benefit of IaC is that it supports DevOps, which is very in right now.

This is the case because an easy-to-share code paradigm allows developers to get more involved in configuration and collaborate with production-focused engineers.

Now, let’s cover some costs and benefits of IaC in more detail.

Cost: IaC takes time to learn

IaC is a new paradigm for engineers who may be used to SSHing into a server and directly making modifications.

With IaC, engineers will note an additional step between their writing a change and the change being deployed to the infrastructure.

The engineer makes the necessary code addition or adjustments and pushes them to the provisioning tool, which then directs the changes to the infrastructure.

They first need to learn the code and secondly need to keep the habit and avoid the temptation to make direct “dial-in” changes to infrastructure.

The engineering group will need to invest in the ongoing development of engineers to ensure this happens.

One path involves implementing a culture (change) that fosters continuous development.

This could manifest as ongoing feedback and learning loops.

Benefit: IaC is flexible to many kinds of infrastructure

Infrastructure-as-code isn’t relegated to public cloud computing use cases. You can use it to define the physical infrastructure that you have on premises.

The benefit of using IaC in this situation is that every application gets assigned a distinct set of resources from the outset.

You gain greater visibility and granularity into how resources get allocated to applications.

Benefit: IaC assures consistency across environments

In my own experience, I’ve seen way too much code crumble in production.

The cause was sometimes simple like differing environments between stages of the software development lifecycle.

Developers were testing on a different environment — “localhost” — to what production would be.

The localhost was often souped up in comparison with the production environment planned by operators.

The concept of having a single source of infrastructure code for all stages reduces the risk of different resource allocations — and subsequently different performance — for the same feature or story.

This works all the way down to the granular level of matching OS versions, patch level, etc.

Differences in these granular properties are often the culprit behind code working well in testing, but not in production.

A live environment clone, created using the exact same IaC as the live environment, has the absolute guarantee that if it works in the cloned environment it will work in live. — Dan Merron & Shanika Wickramasinghe, DevOps consultants at BMC

IaC also ensures that different layers of infrastructure supporting your code are defined appropriately to suit your production requirements. These layers include:

IaaS artifacts like VMs, load balancers, databases
On-premises hardware
Platforms like Kubernetes

Cost: IaC needs consistent maintenance

The IaC code that you have today may not be viable in the near future. This is because the underlying infrastructure is constantly changing.

Kubernetes is releasing updates all the time. Operating systems need constant patching. New security rules get recommended.

IaC is a constantly moving target.

Subsequently, the necessary code for controlling infrastructure is always different from what it may have been earlier.

This calls for a consistent testing routine, so you can ensure that the code is up-to-date with all your up-to-date IaaS artifacts and platforms.

It also goes without saying that the engineers responsible for IaC will need to stay on top of the changes that occur — all the time.

Author
Recent Posts

Connect?

Ash Patel

Reliability Nut at SREpath

Ash has an unhealthy obsession with software reliability. Maybe it’s got to do with the trauma of working at a few companies where software kept slowing or went down while he worked to turn it around. His ma hopes that he can one day turn this passion into a respectable job or business. Still waiting…

Connect?