Rundown of Netflix’s SRE practice


A lot goes on in the background every time you load up your favorite Netflix movie or series.

Engineers spread across Chaos Engineering, Performance Engineering and Site Reliability Engineering (SRE) are working non-stop to ensure the magic keeps happening.

๐Ÿ“Š Here are some performance statistics for Netflix

When it was alone on top of the streaming world in 2016…

Netflix microservices performance statistics for 2016

For context, the average HD video connection is 8mbps, so 30+ terabits/second means there were 3,750,000+ connections at high-definition video bitrate at any given time.

In 2022, this number must be significantly higher, but I don’t have access to their current numbers.

SREs played an essential role in making sure all of this performance ticked over smoothly.

๐Ÿ”ฑ How SRE fits into the Netflix culture

Team formation

The SRE team at Netflix is known as CORE (Cloud Operations Reliability Engineering). It belongs to a larger group known as Operations Engineering.

SREs work alongside specialist roles interrelated with SRE work, such as Performance Engineers and Chaos Engineers.

What is Netflix’s SRE culture like?

The culture at Netflix is freedom and responsibility โ€” both are important to effective SRE work.

CEO Reed Hasting’s radical candor approach (now popularised by Kim Scott) has had an impact on this contradicting desire for freedom and responsibility.

The premise of radical candor is “be critical because you care about the other person“. This may make it easier for SREs to call out poor production decisions.

This would make it easier for SREs to tell their counterpart engineers when they are not following a more suitable path for solving a problem. Without looking like a jerk.

Developers must follow the “you build it, you run it” model. SREs act as consultants for developers in supporting them achieve the “you run it” part of the equation.

Of course, SREs will also act as the last line of defense when issues affect production.

For example, a testing service goes down, which will affect the ability to push code to production. SREs may join to resolve this issue by following incident response protocols.

Related article:  Rundown of Uber’s SRE practice

Most of the time, Netflix’s SREs work on solving problems that don’t have a straightforward fix. In such instances, RTFM may not work and a willingness to experiment and seek novel solutions may help.

Fixes can take minutes, hours, days, weeks, or months โ€” there is no fixed time to solve โ€” and can be larger projects that other teams donโ€™t have time for.

There is a lot of reading source code & documentation, sourcing experiment ideas, running experiments, and then measuring the outcomes

It can be done in solo missions or as a temporary problem-specific team.

๐Ÿงฐ How Netflix SREs support production tooling

Tooling ethos

Operations engineers at Netflix have spent years developing “paved paths”. These paths are designed to help developers leverage advanced tooling without reinventing the wheel.

Examples of what paths cover include: service discovery, application RPC calls and circuit breakers.

These paved paths are not prescriptive or enforced. Developers are allowed – even empowered – to deviate if they want to create a better path for their service.

SREs are of course there to help developers work out a better path’s design. A good idea to use their services because path deviators are still subject to attacks by the Simian army.

Simian Army is Netflix’s suite of chaos engineering tools that test out a functional system for its resilience when it’s attacked by an internal source.

It’s all useful as Netflix practices extreme DevOps โ€” you build it, you run it โ€” engineers do the full job of developing software, deploying pipeline and running code in production.

SREs codify best practices from past deployments to make sure production is optimal.

Tooling examples

Netflix is best known to SRE world for its Chaos Monkey tool in Chaos Engineering. But wait, there’s more!

Netflix’s SREs also work extensively with the following tools:

  • Canary tools for developers to check code and make sure there is no performance regression
  • Dashboards to review service performance like upstream error rates, alerts for supporting services
  • Distributed system tracing to trace performance across the microservices ecosystem
  • Chat rooms and pagers and ticket systems for the fun engineer-level support work
  • Actionable alerts โ€” check the right things, go off when appropriate, quiet when not
  • Spinnaker โ€” allows for blue-green deployment with multi-cloud setup (insanely powerful)
  • Pre-production checklist that scores each aspect of service before going into production
Related article:  Renaming “post-mortems” of software outages for psychological safety

Below is an example of an SRE codified tool, a pre-production checklist โ€” Is your service production ready?

Source: Jonah Horowitz, SRECon 2016

Netflix SRE capability highlights

๐Ÿ”ฅ Incident Response

Netflix engineering’s #1 business metric is SPS – Starts Per Second โ€” the number of people successfully hitting the play button.

Incident response practices are designed to ensure the highest possible % for this SPS metric.

Here are some of the practices that Netflix has affirmed in its incident response capability:

  • Get the right people into the room and make sure they can troubleshoot the incident
  • Document everything during the incident to help with post-mortem analysis
  • Post-mortems aren’t necessarily “blameless” โ€” something went wrong because someone did something, but rather than punish them, make them own it as a learning process
  • Short and to-the-point checklists for handling emergencies are codified in readily accessible manuals
  • Developers can assign metrics for their services to be addressed by SRE once certain thresholds are hit

๐ŸŽ๏ธ Support performance engineers

For Netflix operations, it’s not only about uptime but also about having the right level of performance for solid playback.

There is a need for consistently good service performance rather than one-off wins โ€” users should have acceptably low TTI and TTR.

Here’s what these two terms mean:

  • TTI (time-to-interactive) – user can interact with app contents even if not everything is fully loaded or rendered
  • TTR (time-to-render) – everything above the fold is rendered

SREs support performance engineers with activities like:

  • autoscaling for on-demand scaling โ€” saving money vs pre-purchased on-prem computer โ€” for encoding, precompute, failover and blue-green deployments
  • handle tricky issues involving autoscaling like under-provisioning of resources, but also sticky traffic, bursty traffic and uneven traffic distribution
  • support performance dashboards that cover load issues, errors, latency issues, saturation of resources (e.g. CPU load averages) and instance counts
Related article:  Site Reliability Engineering Culture Patterns

๐Ÿ‘พ Run Chaos Engineering at scale

Netflix is famous for its extensive use of chaos engineering to ensure all of the above metrics like SPS, TTI and TTR are going in the right direction.

What is chaos engineering?

Experimenting on a distributed system in order to build confidence in the system’s ability to withstand turbulent conditions in production โ€” Nora Jones, ex-Senior Chaos Engineer, Netflix

Chaos engineering is a capability heavily based on Netflix’s work from 2008 through the early 2010s. It builds on the value of common tests like unit testing and integration testing.

Chaos work takes it up a notch from these older methods by adding failure or latency on calls between services.

Why do this? Because it helps uncover and resolve issues typically found when services call on another like network latency, congestion and logical or scaling failure

Adding Chaos to Netflix engineering has led to a culture shift from “What happens if this fails” to “What happens when this fails”.

How chaos engineering can be done

  • Graceful restarts and degradations using the Chaos Monkey tool
  • Targeted chaos for specific components of a system e.g. Kafka โ€” “let’s see what happens when we delete topics”
  • Cascading failure checks – see how the failure of one part of the system triggers failure in other parts of the system
  • Injecting failure into services in an automated manner with limits on # of users affected โ€” experiment can be shorted if the SPS risks dropping below acceptable levels

Parting remarks

Netflix SREs goes through all of these amazing fetes to make sure that you can easily binge-watch your fave show this weekend!