Articles on better software operations practices

Success factors for Site Reliability Engineering digital transformation

This guide will help you better engage in business-level conversations about Site Reliability Engineering with key stakeholders. It is part of the SRE Digital Transformation series exploring how to integrate SRE into your organization. Introduction Site Reliability Engineering (SRE) is a powerful tool for achieving high software performance and reliability in enterprises, as well as … Read More

How to pitch Site Reliability Engineering to executives and stakeholders

This article will help you communicate the advantages of SRE to stakeholders through 3 arguments. It is part of the SRE Digital Transformation series exploring how to integrate SRE into your organization. Introduction It takes confidence and conviction to introduce significant changes that may affect the entire team or organization. You will naturally face resistance … Read More

Reaffirming the value of SREs amid ongoing tech layoffs

I’ve been curious about the prospects for Site Reliability Engineers (SREs) as companies scale back headcount across the board. This opinion piece will unpack the pressing issue. Many experts predict an ongoing downturn in the tech job market that could last for the next 3-5 years. An unfortunate turn for many employed in the tech … Read More

Inside Disney’s Site Reliability Engineering practice

Introduction It is no small feat to run an ecosystem of entertainment experiences to delight a wide range of people, from young children to older “Disney adults”. Almost every Disney experience relies on a sophisticated technology stack working in the background. “Steve Jobs once said technology amplifies human ability. At Disney, we use technology to … Read More

Site Reliability Engineering 101

This article is intended to help non-technical stakeholders better understand Site Reliability Engineering. It is part of the SRE Digital Transformation series exploring how to integrate SRE into your organization. What is SRE? Definition of Site Reliability Engineering Site Reliability Engineering (S.R.E.) is an aspect of software engineering that aims to ensure the ongoing reliability … Read More

Recruiting developers into Site Reliability Engineering (SRE)

In this article, you will learn the following: Introduction Hiring in the Site Reliability Engineering (SRE) space is notoriously difficult. So it makes sense to figure out how to expand the hiring pool beyond existing SREs. One way to increase the hiring pool is to recruit developers (also known as SWEs) and gradually advance them … Read More

Rundown of LinkedIn’s SRE practices

Introduction LinkedIn has one of the most robust Site Reliability Engineering (SRE) practices around. After all, as the social network of record for jobseekers and salespeople, it is the 6th most trafficked website in the world, with over 1.5 billion unique visits per month. LinkedIn’s Site Reliability Engineers (SREs) ensure all that traffic gets served … Read More

Analysis of SRE and platform setup at 10+ tech companies

In this article, you will see a breakdown of the platform setup and SRE practices within 12 non-FAANG technology companies. This is based on the case studies by Andrios Robert. โ€œThere is a lot of content available on how Google did [Site Reliability Engineering]; letโ€™s uncover what happens with the rest of the world.โ€ โ€” … Read More

Is platform engineering at risk of shiny object syndrome?

So much has been debated lately about the emergence of “Platform Engineering” as a solution to software operations problems. It’s an interesting proposition. However, it is not your silver bullet that will fix all things one felt didn’t work out with Dev versus Ops, DevOps, or SRE. We are missing something very important in our … Read More

Reduce software outage risk with passive guardrails

Shocking fact: only 10-25% of software outages are because of hardware or network failure. The rest are the result of human error like misconfiguration โ€” paraphrasing Martin Kleppman, Designing Data-Intensive Applications In this article, I will share with you how setting up passive guardrails in and around developer workflows can reduce the frequency and severity … Read More