25+ Site Reliability Engineering OKRs

Incident Response OKRs Reduce MTTR for on-call engineers by 5% Develop buffers to ensure incidents remain at < 75% of the error budget Mitigate false positive system alerts to reduce on-call staff costs Speed up the resolution of critical incidents by 5% Increase the coverage of 4-point SLIs from 90% of services to 100% Reduce … Read More

Runbooks for better incident response

Introduction I can confidently tell you that runbooks form a critical part of the incident response toolkit. I will also tell you that SREs are well-placed to start and oversee the development of runbooks. If you don’t have a runbook yet, let me entice you with the thought of checklist-type documentation to follow when you’re … Read More