This article is intended to help non-technical stakeholders better understand Site Reliability Engineering. It is part of the SRE Digital Transformation series exploring how to integrate SRE into your organization.
I highly recommend that you start by listening to this episode of the SREpath Podcast for a deeper discussion on “What is Site Reliability Engineering and What It is Not SRE”
The definitions below will have a lot more context to them if you do so.
What is SRE?
Definition of Site Reliability Engineering
Site Reliability Engineering (S.R.E.) is an aspect of software engineering that aims to ensure the ongoing reliability of software systems.
Site Reliability Engineers (SREs) are production-level engineers focused on software performance once it enters the real world.
Their north star is software reliability.
“We want to keep our site up, always.” — JC Van Winkle, Site Reliability Engineer at Google Zurich.
Some organizations like Meta refer to this type of field as production engineering, and the individuals employed in it as production engineers.
But here’s the thing: the title “Site Reliability Engineer” is more versatile than “production engineer” because it doesn’t limit you to the production part of the software system.
It also saves getting mixed up with the vastly different production engineer roles that are prominent in the manufacturing industry.
Let’s unpack the SRE definition further
Site Reliability Engineering is a crucial discipline within software engineering.
Its aim is to increase the reliability of software once it has been put into production, ensuring that it functions optimally and is accessible to users.
S.R.E. aims to do the following:
- increase the uptime of software (because software does go down)
- boost the performance of software so that it runs at an optimal speed
- enhance the quality of the code that runs the software and
- fortify the security of software to protect it from intruders
In the words of its originator, Ben Treynor Sloss:
This statement highlights the importance of treating software operations as a complex and challenging engineering problem that requires a strategic approach and specialized expertise.
This approach breaks free from the outdated paradigm of “submit the code and we’ll run it on the servers.”
As a result, it is philosophically aligned with the widely-adopted DevOps movement, which seeks to eliminate barriers between software development and operations.
To accomplish this, Site Reliability Engineering entails fine-tuning the software and its underlying infrastructure, frequently by developing, tailoring, or designing bespoke tooling, as well as advocating for superior work practices.
The biggest misconception about Site Reliability Engineering is that it’s only focused on reliability from a narrow lens i.e. the software is accessible.
But as you can see from all of the above factors, there’s more to it.
Issues like performance, quality of code, and security affect the quality of user experience and the perceived reliability of the software.
Origin of Site Reliability Engineering
Site Reliability Engineering, as we know it today, was founded in 2003 by the visionary Ben Treynor Sloss, who currently serves as the Vice President of Engineering at Google.
He transformed a small team of just seven “production engineers” into a formidable force, now consisting of over 1200 Site Reliability Engineers as of 2016.
The rationale behind Site Reliability Engineering (S.R.E.) is a compelling one.
As Google continued to scale in growth, it recognized that it’d face reliability challenges.
The company realized that as software complexity increased, it would become increasingly challenging to ensure reliability. This recognition led to the development of S.R.E., which is a proactive approach to address the growing problem of software complexity.
At its core, S.R.E. is a set of practices that prioritize reliability in software development. The objective is to ensure that the software systems are reliable and scalable, even as they continue to grow and become more complex.
I will repeat what I mentioned earlier because it is critical to understand: S.R.E. is not just a reactive approach to fixing problems. Instead, it’s a proactive approach that seeks to prevent problems before they occur.
Now that we know the origin story of S.R.E., let’s explore the concept of reliability.
Definition of “Reliability”
Reliability, simply put, is the absence of errors.
Unpacking that definition further: Reliability, by definition, is the ability of a system to function correctly and consistently under various conditions.
Within the context of S.R.E., reliability refers to the software system’s ability to perform as expected, without any downtime or disruptions.
With this in mind, it is clear that change poses a significant threat to reliability.
This is why many critical systems such as airplanes and power plants still rely on legacy software built with the COBOL language from the 1960s.
In these applications, any change presents the risk of potentially catastrophic errors.
Every change introduces the possibility of an error arising and compromising the production system. Changes can include new code deployments, updates to infrastructure configurations, and more.
The specialized nature of software engineering roles means some of the work needed to assure the abovementioned reliability is missed.
But this kind of specialization leads to a problem further down the software development lifecycle (SDLC).
Due to the complex and dynamic nature of software development, it can be easy to overlook measures that ensure the quality of the software in production. This is where Site Reliability Engineering (SRE) comes in.
SREs are confident and responsible for ensuring that the software is reliable, scalable, and efficient. They work closely with developers, operations teams, and other stakeholders to achieve these goals.
By doing so, SREs play a crucial role in ensuring that software products are delivered on time and meet the high standards demanded by users and customers alike.