Jaeger tracing for observability beginners [Quick Guide]

Jaeger is a tracing tool that allows engineers to track issues among 10s, 100s and even 1000s of services and their dependencies. In technical terms, Jaeger collects “tracing data” for distributed services to populate Grafana dashboards that highlight downtime/slow-load risk and errors. This makes it an essential component of a strong observability practice. Observability depends … Read more

Starting an SRE team from scratch [Quick Guide]

Site Reliability Engineering (SRE) leaders face a myriad of responsibilities beyond what Site Reliability Engineers experience. Part of an SRE leader’s responsibilities include: getting buy-in from various stakeholders like developers, engineering leaders, senior leadership, and more defining the tools, processes, and backlog that Site Reliability Engineers will need to use in their day-to-day work and … Read more

Agile software teams need site reliability engineers to support ongoing success

I originally wrote an article titled, “Agile and SRE are not mutually exclusive” for Site Reliability Engineers (SREs). Most of them told me, “We already know this. Go tell the people running Agile in our orgs!” I can see their point. So here’s my go at explaining why Agile-made software needs the support of SREs … Read more

Google’s Site Reliability Engineering hierarchy (Remixed)

This post contains an original SREpath visual summary. This visual simplifies Google’s highly academic “SRE hierarchy” as an easy-to-explain journey map format. Here’s a sneak peek at the visual summary: Before we continue, let’s cover some background information… You most likely know that Google is the company that originated the Site Reliability Engineering (SRE) phenomenon … Read more

Site Reliability Engineering Glossary

4 “Golden Signals” in Site Reliability Engineering Latency is the delay before data is fully transferred from one end to another. It is typically measured in milliseconds (ms) Throughput is the amount of data that can transfer through in a given period. It can be measured in bits/second. Error rate measures errors occurring in the … Read more

Evolution of cloud infrastructure teams

Cloud infrastructure teams in detail The image above is an original SREpath summary of Will Larson’s “trunk and branch model”. It outlines the evolution of team/s focused on cloud infrastructure provisioning and management. While the original piece is very erudite, I thought I’d put my spin on the concept to support your understanding of this … Read more

Cloud infrastructure success factors

Image explained in more detail Long-term cloud infrastructure success depends on high Service Quality and a reasonable Investment Budget Service Quality can get torpedoed by morale busters like non-stop grunt work, not enough engineers etc. The budget can get torpedoed by continuously high cloud costs and an excess of branch teams working on problems that … Read more

6 system resilience patterns for increasing software reliability

Introduction In this post, I will cover the following patterns of system resilience: Adaptive Response Superior Monitoring Coordinated Resilience Heterogeneous Systems Dynamic Repositioning Requisite Availability Let’s cover the definition of system resilience before exploring these patterns in greater depth. System resilience is the ability of organizational, hardware and software systems to mitigate the severity and … Read more

Rundown of Netflix’s SRE practice

Introduction A lot goes on in the background every time you load up your favourite Netflix movie or series. Engineers spread across Chaos Engineering, Performance Engineering and Site Reliability Engineering (SRE) are working non-stop to ensure the magic keeps happening. 📊 Here are some performance stats for Netflix When it was alone on top of … Read more