Articles on better software operations practices

Site Reliability Engineering Glossary

4 “Golden Signals” in Site Reliability Engineering Latency is the delay before data is completely transferred from one end to another. It is typically measured in milliseconds (ms) Throughput is the amount of data that can transfer across within a given period. It can be measured in bits/second. Error rate measures errors occurring in the system, such as … Read More

How cloud infrastructure teams evolve – from start to maturity

I recently read a post by Will Lason, who started SRE at Uber. The post is called the Trunks and branches model for scaling infrastructure organizations. Several passages in the post covered how infrastructure teams can evolve from the startup phase. I felt it would be easier to comprehend the dense-and-rich advice with a visual … Read More

Cloud infrastructure success is a fine balance of budget and service quality

The visual summary below is based on a post by Will Larson, who started the SRE function at Uber. His post elaborates on a “trunks and branches” model for developing infrastructure-facing teams. It also covered an interesting perspective on the balancing act of budget and service quality. I will explain the visual summary underneath it. … Read More

How 6 system resilience patterns increase software reliability

Introduction System resilience thinking can inform better Site Reliability Engineering decisions. Specifically, it can affect how the SRE culture unfolds and handles critical situations. The system resilience concept is rooted in theoretical computer science. Don’t panic. I will explain how it can – in a practical way – support increased software reliability in production. We … Read More

Rundown of Netflix’s SRE practice

Introduction A lot goes on in the background every time you load up your favorite Netflix movie or series. Engineers spread across Chaos Engineering, Performance Engineering and Site Reliability Engineering (SRE) are working non-stop to ensure the magic keeps happening. 📊 Here are some performance statistics for Netflix When it was alone on top of … Read More

25+ Site Reliability Engineering OKRs

Incident Response OKRs Reduce MTTR for on-call engineers by 5% Develop buffers to ensure incidents remain at < 75% of the error budget Mitigate false positive system alerts to reduce on-call staff costs Speed up the resolution of critical incidents by 5% Increase the coverage of 4-point SLIs from 90% of services to 100% Reduce … Read More

Runbooks for better incident response

Introduction I can confidently tell you that runbooks form a critical part of the incident response toolkit. I will also tell you that SREs are well-placed to start and oversee the development of runbooks. If you don’t have a runbook yet, let me entice you with the thought of checklist-type documentation to follow when you’re … Read More

SRE is not a monolithic role

SRE is gaining more traction and a misconception is gaining steam among senior stakeholders. That SRE is a monolith role like what “programmers” were in the 90s. Let’s burst that misconception… SRE is a broad, overarching responsibility that needs a multitude of role considerations to pull off properly. It is not a monolithic role where … Read More

How SREs are unique in their approach to work

Site Reliability Engineers (SREs) are a rare bunch in the software community. But there’s little denying that the approach of Site Reliability Engineering is the future of software operations. Here are some things that make SREs a unique breed in software work: SREs look at the broader picture Ask any developer what they’re working on … Read More