Readme before reviewing the Site Reliability OKRs below
Please review these guidelines before you consider adapting the OKRs:
- Many of the OKRs are ambitious examples – certainly more than what most junior SREs should be given or could handle
- Most OKRs would be the culmination of efforts by an entire SRE team and not a sole engineer
- Numbers in the OKRs, e.g. 0.75%, have been created for illustrative purposes only – consider your metrics and goals for the numbers
Incident Response OKRs
- Reduce MTTR for on-call engineers by 5%
- Develop buffers to ensure incidents remain at < 75% of the error budget
- Mitigate false positive system alerts to reduce on-call staff costs
- Speed up the resolution of critical incidents by 5%
- Increase the coverage of 4-point SLIs from 90% of services to 100%
- Reduce manual toil from 25% of responder time to 20%
- Increase increment velocity in SRE project work with one-sprint reduction
- Reduce operational work from 65% of total work time to 55%
- Reduce incident recurrence from 8 out of 10 to 6 out of 10 incidents
- Assure realistic SLA targets in line with current SLIs for > 97.5% of accounts
System performance and resilience OKRs
- Reduce 50x errors from 1% down to 0.75%
- Increase failover design of # of microservices from the current 60% to 65%
- Reduce network latency among the top 5 services by 2.5%
- Increase average load speed of application by 0.25%
- Reduce open-source-software-related errors by 10%
- Reduce incident recurrence from 8 out of 10 to 6 out of 10 incidents
- Increase black swan event awareness among developers to 90%
- Plan for handling unexpected high demand up to 25% burst capacity
Developer support OKRs
- Drive rail-guided services from 40% to 50% of all new launches
- Speed up time to production for images by 20%
- Improve developer speed-to-publish by 10%
- Increase tool efficiency to < 2 same-purpose tools per category across teams
DevSecOps OKRs
- Reduce build security issues by 25%
- Drive DevSecOps awareness among developers to 75% of the headcount
- Drive security of database architecture with < 1 major incident per year
FinOps (Cloud Cost Control) OKRs
- Reduce the cost of stateful storage capacity by 10%
- Reduce total cloud billing by 1%
- Reduce vendor-based tool costs by 10%
- Reduce routine downtime maintenance costs by 3%
Work practices OKRs
- Increase increment velocity in SRE project work with one-sprint reduction
- Reduce operational work from 65% of total work time to 55%
Feel free to reach out if you have any questions about the above OKRs or want us to add a new OKR.