Episode 29 [SREpath Podcast]
Show notes
Sebastian and I continue our breakdown of notable passages from Chapter 1 of Google’s Site Reliability Engineering (2016) book by Betsy Beyer, Jennifer Pettof, Niall Murphy, et al.
We covered passages like:
- Monitoring is one of the primary means by which service owners keep track of a system’s health and availability.
- Efficient use of resources is important anytime a service cares about money.
- Humans add latency, even if a given system experiences more actual failures. A system that can avoid emergencies that require human intervention will have higher availability than a system that requires hands on intervention.
- SRE has found that roughly, 70 percent of outages are due to changes in a live system. Best practices in this domain use automation to accomplish implementing progressive rollouts.
- Demand forecasting and capacity planning can be viewed as ensuring that there is sufficient capacity and redundancy to serve projected future demand, the required availability.