,

Site Reliability Engineering Glossary

4 “Golden Signals” in Site Reliability Engineering

Latency is the delay before data is completely transferred from one end to another. It is typically measured in milliseconds (ms)

Throughput is the amount of data that can transfer across within a given period. It can be measured in bits/second.

Error rate measures errors occurring in the system, such as bugs in code, network outages, or request errors like 500 error. It is expressed as a % of total requests.

Saturation is a measure of the load on your server resources. It can include measures like CPU utilization and memory & storage used.


SLAs, SLOs and SLIs for measuring SRE success

Service Level Agreements (SLAs) are contractual obligations between the service provider and service consumer/payer for a certain level of performance. The consumer may demand money if the SLA is broken at any point.

Service Level Objectives (SLOs) are the guide levels of performance for engineers to aim for. They typically correlate with SLA requirements. For example, they can be goals for a certain level of availability for a service over a given period.

Service Level Indicators (SLIs) are measures of performance that allow engineers to understand if they are meeting the SLOs for the system and, subsequently the business-level SLAs. For example, they can be the uptime metric for a particular service.


Software incident response lingo

On-call implies that the engineer must be available to respond to incidents, should they arise when they are not typically working. This may mean evenings or weekends.

Follow the sun refers to an incident response timing where the engineer needs to respond to incidents from sunrise to sunset.

Mean time to acknowledge (MTTA) is the average time it takes for an engineer to get to look at an incident from the moment it has been identified and paged to the engineer.

Mean time to recovery (MTTR) is the average time for the engineer to resolve the incident from the moment the alerting system picks up the incident.

Mean time to failure (MTTF) is the average time a system or service is expected to function before it experiences a failure, such as a performance-degrading bug or outage.

Mean time between failures (MTBF) is the average time elapsed between two incidents across a series of incidents.


What SREs do after an incident

Postmortems are events that engineers may undertake after an incident has been resolved or controlled. They may go through logs and analyse the root cause to identify patterns and prevent future similar incidents.

Blameless is the cultural mindset that (most) Site Reliability Engineering teams aim to have when going through an incident. They aim to find out what and not who caused the problem. Even if they find the person behind the incident, they seek not to blame.


How SREs make better systems