Jaeger tracing for observability beginners [Quick Guide]

Jaeger is a tracing tool that allows engineers to track issues among 10s, 100s and even 1000s of services and their dependencies. In technical terms, Jaeger collects “tracing data” for distributed services to populate Grafana dashboards that highlight downtime/slow-load risk and errors. This makes it an essential component of a strong observability practice. Observability depends … Read more

SRE’s role in safer infrastructure-as-code (IAC)

Every Site Reliability Engineer will get involved in an infrastructure-as-code problem at some point in their career. IAC is a tricky space with lots of issues that, despite its automation promise, can generate ongoing maintenance toil. Let’s explore two areas in IAC to reduce the potential toil burden. Keep your IaC code clean and organized … Read more

Starting an SRE team from scratch [Quick Guide]

Site Reliability Engineering (SRE) leaders face a myriad of responsibilities beyond what Site Reliability Engineers experience. Part of an SRE leader’s responsibilities include: getting buy-in from various stakeholders like developers, engineering leaders, senior leadership, and more defining the tools, processes, and backlog that Site Reliability Engineers will need to use in their day-to-day work and … Read more

Agile software teams need site reliability engineers to support ongoing success

I originally wrote an article titled, “Agile and SRE are not mutually exclusive” for Site Reliability Engineers (SREs). Most of them told me, “We already know this. Go tell the people running Agile in our orgs!” I can see their point. So here’s my go at explaining why Agile-made software needs the support of SREs … Read more

Google’s Site Reliability Engineering hierarchy (Remixed)

This post contains an original SREpath visual summary. This visual simplifies Google’s highly academic “SRE hierarchy” as an easy-to-explain journey map format. Here’s a sneak peek at the visual summary: Before we continue, let’s cover some background information… You most likely know that Google is the company that originated the Site Reliability Engineering (SRE) phenomenon … Read more

Site Reliability Engineering Glossary

4 “Golden Signals” in Site Reliability Engineering Latency is the delay before data is fully transferred from one end to another. It is typically measured in milliseconds (ms) Throughput is the amount of data that can transfer through in a given period. It can be measured in bits/second. Error rate measures errors occurring in the … Read more

Evolution of cloud infrastructure teams

Cloud infrastructure teams in detail The image above is an original SREpath summary of Will Larson’s “trunk and branch model”. It outlines the evolution of team/s focused on cloud infrastructure provisioning and management. While the original piece is very erudite, I thought I’d put my spin on the concept to support your understanding of this … Read more

Cloud infrastructure success factors

Image explained in more detail Long-term cloud infrastructure success depends on high Service Quality and a reasonable Investment Budget Service Quality can get torpedoed by morale busters like non-stop grunt work, not enough engineers etc. The budget can get torpedoed by continuously high cloud costs and an excess of branch teams working on problems that … Read more