,

Inside Spotify’s Site Reliability Engineering (SRE) practice

You’ve undoubtedly caught wind of the latest Netflix series, dubbed “The Playlist,” a show loosely inspired by the birth of Spotify.

Chances are, you may have already devoured it in one glorious binge-watching session.

As for me, I only got around to it recently.

I was enticed by a Youtube ad that hinted at a captivating tale of the inner workings behind Spotify’s software operations.

And boy, was I hooked.

What fascinated me was how much of Spotify’s early success hinged on the wizardry of their server operations.

Fear not, dear reader, I will delve deeper into this in just a moment.

It had me wondering whether Spotify’s practice of Site Reliability Engineering (SRE) would be just as enthralling.

And let me assure you, it most certainly is.

I’ve got an interesting story related to this toward the end of this piece.

Brace yourselves as I take you on a journey through the intricate web of Spotify’s SRE practice.

History of SRE at Spotify

Before SRE came in at Spotify

The magic of server-side work was part of Spotify’s early charm.

The Netflix series showed Daniel Ek (CEO of Spotify) challenging the former CTO Andreas Ehn to make Spotify fast with a song load time of less than 200ms.

Sub-200ms is the load time that is perceivable by the human ear as instantaneous.

Remember this was in 2006 when Internet capabilities did not readily and consistently allow for sub-second latency.

To achieve the “trick”, the engineers at Spotify created a hybrid fetch model.

This approach predicted what the user would want to listen to next and prefetched it through the peer-to-peer network, decreasing server load by 90%.

For the remaining 10% of the time, the search went to the servers to play songs that were related.

Client-level caching and prefetching songs 30 seconds before changeover also helped optimize playback and achieve a latency of 245ms.

But this server sorcery proved not to be enough to support future growth.

SRE was a response to hypergrowth

In 2011, Spotify faced the inevitable challenges of growth.

Monthly active users (MAU) more than doubled around that time from approximately 3 million users in 2010 to over 7.4 million users in 2011.

This fast expansion led to the quick development of supporting infrastructure, which in turn increased the underlying complexity of the infrastructure.

Growing complexity constantly challenged the reliability and scalability of the system.

To tackle this challenge, Spotify officially introduced Site Reliability Engineering, a strategic move to combat their growing pains and conquer the obstacles that lay ahead.

It was a pivotal moment as Spotify’s software-in-production was on the cusp of reaching hyper-scale proportions.

The stage was set for an audacious leap into uncharted (SRE) territory.

Spotify was inspired by Google’s success with Site Reliability Engineering practices, which were developed by Ben Treynor.

Its engineering leaders aimed to adopt a similar approach but one that was tailored to its unique challenges.

As Spotify’s user base and infrastructure continued to grow, the company scaled its SRE practices accordingly.

Scaling SRE practices at Spotify included:

  • expanding the size and scope of SRE teams
  • introducing more sophisticated practices like observability and internal developer platforms and
  • investing in building more scalable and reliable infrastructure

How SRE has helped Spotify’s tech work

Spotify SREs automate to cut repetitive work

The automation practices devised by Spotify’s SREs have proved to be a godsend for developers and product teams alike.

By relieving developers of the tiresome burden of manual and repetitive tasks, these practices allow them to channel their energy into the craftsmanship of feature design and higher-quality code.

The result?

A surge in productivity that paves the way for the efficient delivery of code, without compromising on quality.

Spotify SREs support DevOps practice adoption

SREs at Spotify excel in collaborating with software engineers, seamlessly integrating reliability and operational considerations into the development process.

This collaboration helps regular developers gain a deeper understanding of the operational aspects of their code and encourages them to write more reliable and resilient software from the outset.

Part of this effort involves regular measurement and monitoring of service and system performance.

This helps SREs give developers valuable insights into the behavior and performance of their applications.

By leveraging metrics and monitoring tools, developers can:

  • proactively identify bottlenecks within services
  • optimize code performance and
  • enhance the overall user experience of their applications
Related article:  #3 SRE vs DevOps vs Platform Engineering [Audio]

Spotify SREs support developer response to incidents

At Spotify, both developers and Site Reliability Engineers (SREs) play a role in responding to incidents.

When it comes to the question, “Who goes in first?” in incident response, SREs take the lead and are the first responders.

There are several reasons for this at Spotify:

SREs have the necessary tools, training, and knowledge to diagnose and mitigate the issues promptly

SREs follow well-defined incident management processes and participate in on-call rotations to ensure 24/7 coverage

SREs leverage their specialized expertise in system reliability and operations.

However, developers also play a vital role in this process.

They actively contribute their skills and knowledge, working hand in hand with SREs to tackle and resolve incidents effectively.

Depending on the nature and severity of the incident, developers provide their expertise in understanding the codebase and identifying potential root causes.

They collaborate closely with the SREs to investigate the incident, analyze relevant logs, metrics, and system behavior, and contribute to resolving the issue.

It’s a unified effort where both parties bring their strengths to the table, ensuring a comprehensive response to any challenges that arise.

Post-incident retrospectives, also known in Google’s SRE model as postmortems, involve a broader group of stakeholders, including developers.

These postmortems provide an opportunity for developers to contribute their insights, share lessons learned, and collectively work toward preventing similar incidents in the future.

By participating in incident response and postmortem processes, developers gain a deeper understanding of system failures and root causes.

This knowledge helps them:

  • improve their coding practices
  • make informed design decisions and
  • implement preventive measures

This combined effort toward effective response and continuous improvement ultimately leads to more reliable and robust software.

What is Spotify’s SRE culture like?

At Spotify, everyone cares about reliability

Spotify has cultivated a culture of reliability engineering that permeates every nook and cranny of the organization.

It’s instilled in its engineers the values of:

  • prioritizing reliability
  • thinking ahead about system resilience and
  • taking ownership of operational aspects

But how does this culture manifest itself?

Spotify involves both its SREs and developers in incident response and postmortem.

This approach leverages the expertise of SREs while harnessing the deep understanding of the codebase possessed by developers to address incidents effectively and enhance the overall reliability of Spotify’s services.

It’s a true collaboration fostering a culture of learning and shared responsibility for the reliability of their systems.

SRE spread thanks to the famous “Spotify model”

In true Spotify fashion where they not only revolutionized how music is consumed, they even revolutionized organizational structures.

SREs are embedded within cross-functional product development teams known as “Squads”.

But they are also part of communities of practice known as “Guilds”.

This “Spotify model” has created quite a buzz in the last 5 or so years.

It’s a topic that permeates conversations on Agile practices, with the idea of squads, tribes, chapters, and guilds taking large mindshare.

But here’s the kicker: Spotify’s ingenious model has transcended the tech realm.

It has spread like wildfire even to unexpected domains such as supermarket chains and, believe it or not, even banks.

It’s a testament to the far-reaching impact of Spotify’s innovative practices.

Despite detractors trying to undermine it, I doubt the model is going anywhere.

Let’s cover them briefly for context:

Squads

A group of individuals with different skills working together for a specific objective

They have the right to make their own decisions while aligning their roadmap with the company’s vision

Example squad: “Recommendation algorithm” squad, which is specifically focused on developing and optimizing algorithms

Tribes

Several squads come together to form a tribe, which works towards a shared mission, promoting alignment and collaboration

Example tribe: ”Discovery” tribe which is focused on the broader mandate of enhancing music discovery and recommendations

Chapters

Individuals with similar skills or interests gather in chapters to exchange knowledge and develop their expertise

Example chapter: “Machine Learning and Data Science”, which is focused on enhancing work with data and algorithms

Guilds

Guilds unite individuals across squads and tribes who share a common interest or passion

They allow knowledge and creativity to flow freely, resulting in breakthrough ideas and cross-pollination of talents.

Example guild: “Site Reliability Engineering”, which is focused on increasing the reliability of systems at scale

Using the examples listed above, the “Recommendation Algorithm” squad members might learn about data reliability by being part of the “SRE” guild.

Related article:  #23 – The Danger of Unreliable Platforms (with Jade Rubick)

They could then cross-pollinate this idea with their “Machine Learning and Data Science” chapter.

SRE as a guild within Spotify spans across:

  • multiple cross-functional teams as well as
  • collectives of value-stream-aligned teams

This allowed for effective seeding of the SRE practices throughout the organization.

It also enabled close collaboration between SREs and other technologists.

Spotify’s greatest gift to software operations

Introducing the Backstage internal developer platform (IDP)

Few tools are as public a testament to Spotify’s engineer-first culture as one in particular.

I am referring to the Backstage platform, Spotify’s born-and-bred internal developer platform (IDP).

Backstage plays a crucial role in fostering Spotify’s engineering culture by promoting developer autonomy and end-to-end service ownership.

How does Backstage help developer autonomy?

Through it, Spotify engineers gain access to a centralized hub to manage and support services, as well as knowledge sharing.

In terms of tangible examples, the platform provides a space for engineers to:

  • contribute to shared libraries
  • access monitoring and deployment tooling
  • write and document best practices
  • exchange ideas around improvements

This provides a few key benefits. Backstage helps:

  • reduce duplication of efforts
  • accelerate onboarding for new team members
  • promote increased cross-team collaboration on improvements
  • increase the capability to experiment, innovate, and iterate on projects

Golden paths form the backbone of Backstage

Backstage incorporates the concept of “golden paths” as a way to provide streamlined and standardized processes for developers.

Golden paths are predefined and recommended paths that guide developers through the necessary steps and best practices for common tasks or workflows.

In the context of Backstage, golden paths are predefined templates, workflows, and guidelines that help developers follow proven practices.

This ensures consistency across projects.

Golden paths serve as a starting point or steps for specific tasks such as:

  1. creating new services
  2. deploying services and
  3. setting up monitoring and alerting

By following golden paths, developers can:

  • save time and effort by leveraging preconfigured and tested configurations
  • better align with the company’s engineering standards
  • reduce repetitive manual setup and decision-making — essentially reinventing the wheel
  • reduce their potential for errors arising from a DIY approach

These paths are not static. Developers can build upon the golden paths to enhance their effect.

What does this have to do with Site Reliability Engineering or more broadly speaking, software operations?

Here’s the answer: a consistent approach to launching services to production means that they are more likely to be reliable in production.

Backstage serves as a valuable resource, contributing to increased productivity, code quality, and overall reliability of the Spotify service ecosystem.

Developer empowerment led to cloud cost savings

Spotify hit a critical crossroads in its growth story. At one point, the cost of infrastructure outpaced revenue growth.

Management scrambled to find ways to curb cloud costs.

But they felt that they couldn’t impose new cost controls from above. After all, Spotify cherishes engineer autonomy above all.

So they looked at the issue as an engineering problem.

Spotify’s Insights Cost team devised a brilliant strategy that leveraged the popularity of the Backstage platform.

James Governor of Redmonk spoke with this team to learn more.

They developed a Cost Insights plugin that integrated directly within Backstage.

The premise is simple: engineers and their squads are made in charge of handling the costs associated with running their service/s.

Cost control becomes part of the engineering workflow rather than an afterthought for finance teams to manage.

Was this premise successful?

Yes.

Helping engineers get directly involved in cost decision-making helped Spotify cut its annual cloud spend by millions of dollars.

Very briefly, here are some things that help developers control costs:

Labeling cloud provider resources to match their own internal component and service names versus relying on billing info from the Cloud provider

Internal chargeback model to bill other teams for costs incurred by a jointly-owned service

Drill down into the cost of specific components and cloud provider services

According to Janisa Anandamohan of Spotify’s Cost Engineering, engineers are natural optimizers.

Giving them a task to tweak costs is yet another opportunity to optimize parameters for the better.

Engineers wanted to save costs on their services for 2 other reasons:

  1. as a matter of pride among their peers, showing off their wins as a competitive game and also
  2. realizing that savings would mean greater profit in the organization, which would boost personal shareholdings
Related article:  What is Site Reliability Engineering?

The good news for you is that the Cost Insights plugin is open-source just like the entire Backstage portal. [Before you think it, this is not a sponsored post]

Think of this as a PSA.

Because I know that Spotify isn’t the only organization dealing with high cloud costs.

The mood will only get bleaker as we migrate more workloads to the cloud, build new services there, and use more abstractions like serverless.

Who will be responsible for managing all this as it unravels?

The first key is to engage and enable the people who built the services.

The second key is to enable their new workflow where they already hang out. In the case of Spotify, it was the Backstage platform.

Through all this, Spotify has fostered a culture where cost optimization becomes an enjoyable endeavor for the people best placed to make it happen.

Culture of Continuous SRE Improvement

By now, you may have noticed that Spotify embraces a culture of ongoing learning and growth.

SREs are encouraged within their Guild to:

  • experiment with new technologies
  • share knowledge across the organization and
  • stay updated with industry best practices

This ethos manifests in the following tangible ways:

Hack Weeks

Spotify organizes regular Hack Weeks, one of which spawned the Backstage platform as an open-source project.

Spotify’s teams and individual contributors are encouraged to:

  • prototype new features
  • explore emerging technologies and
  • address pain points in the existing systems

Retrospectives

I previously mentioned how SREs and developers participate in retrospectives of incidents.

Spotify’s approach to retrospectives creates a safe space for:

  • starting open and honest discussions
  • enabling teams to celebrate successes
  • investigate challenges and potential areas of optimization

Teams review successes and shortcomings and suggest practical measures to apply lessons learned in future projects.

Peer Feedback

Spotify encourages a culture of constructive peer feedback and knowledge sharing.

Squads engage in regular feedback sessions where individuals provide insights, suggestions, and observations to help their colleagues grow and improve.

There are two more aspects to Spotify’s culture of continuous improvement: psychological safety and game days. I will cover them in the story below.

When a Spotify engineer deleted it’s US Kubernetes cluster

Spotify’s culture of continuous improvement is best embodied in a story told by one of its infrastructure engineers, David Xia.

In 2018, he was working on a test project and created a test cluster that emulated the configuration of a 50-node US production cluster.

He had tabs open for both the US and test clusters on the same screen.

By accident, he went into the wrong tab and deleted the production cluster that served Spotify’s US users.

This would set off almost any manager I’ve worked with in the past. Not at Spotify. The engineer was not berated for making the mistake and instead was calmly asked to bring it back online.

He jumped into action and spent the next three and quarter hours working to bring that US cluster back online.

In that process, he learned their recovery process was not adequate with:

  • bugs in cluster creation scripts
  • incomplete and incorrect documentation
  • inability to resume a cluster creation process

This experience allowed him to take back his learnings to the next retrospective. A guardrail was created by engineers to prevent accidental cluster deletion.

It also became a routine to practice disaster scenarios, as well as perform recoveries. Essentially game days.

This would not have been possible if the engineer did not have the psychological safety to look at his mistake with a learning mindset.

Parting words

You have now experienced the masterpiece that is Spotify’s Site Reliability Engineering (SRE) practice.

Born out of the relentless growth of its user base and infrastructure, it’s a testament to Spotify’s commitment to engineering excellence.

What makes this engineering marvel truly remarkable is how Spotify has crafted a culture of reliability engineering.

It has intertwined the expertise of SREs with the ingenuity of developers. Both factions join forces with DevOps practices at their foundation, responding to incidents with confidence.

What’s remarkable is the effort they’ve put into codifying practices like communities of practice, developer experience, and cost control. These practices are crucial to ensuring the ongoing success of cloud computing.

It’s only through the company’s commitment to continuous improvement that it’s been able to birth and continuously innovate these practices.

And it has reaped the rewards.