Inside Disney’s Site Reliability Engineering practice – Boost software reliability

Introduction

It is no small feat to run an ecosystem of entertainment experiences to delight a wide range of people, from young children to older “Disney adults”.

Almost every Disney experience relies on a sophisticated technology stack working in the background.

“Steve Jobs once said technology amplifies human ability. At Disney, we use technology to create digital experiences that bring magic to people all around the world.” — Jason Cox, Director of Global SRE at Disney

Disney’s SRE teams have ensured that the magic keeps happening, even as experiences and their underlying technology become more and more complex.

History of SRE at Disney

Jason Cox has been the Director of Global SRE at Disney since 2011.

He has helped Disney remain a global leader in the entertainment industry. This is by keeping the company up-to-date as technology has become a critical driver in entertainment.

His promotion to Systems Reliability Engineering tzar at the company saw its challenges.

📌 For the record, SRE at Disney is denoted as Systems Reliability Engineering.

He had already been at the company for several years prior, working in the operations team of Disney’s Internet group. The biggest hurdle he saw to effectiveness was that they and every other group operated in siloes.

There weren’t only siloes among departments like technology, product, etc. Disney had four large divisions with separate autonomous CTOs. These divisions were: Studios, Consumer Products and Interactive, Parks and Support, and Media Networks.

While this allowed each division to operate independently, it also resulted in institutionalized Shadow IT. Every division had its own way of doing things, even for very trivial work.

However, rapid growth necessitated cross-company changes, so a DevOps transformation was undertaken. Jason and his team applied DevOps principles to each part of the business, not just technically but culturally, by breaking down silos and scaling DevOps.

The goal was to improve communication and collaboration, getting technologists from various disciplines to work together and embrace new ideas.

The following issues were on top of mind for the SRE group:

As Disney’s businesses expanded digitally, the workload and firefighting increased. Development teams faced immense increases in workload as their server counts jumped from 10s to servers to 100s and 1000s.
Bureaucracy and manual processes — like tickets for even the smallest of requests — slowed down engineering teams’ ability to deal with business needs and customer demands for rapid iteration e.g. cloud accounts took weeks to provision rather than minutes
Production systems were suffering from low reliability, security, resiliency, and quality
Agile work practices were improving the velocity of development that was hitting production, but that itself was challenging operations because of the scale and speed of changes to production systems
Engineers were burning out due to high cognitive load trying to perform operational heroics during extended periods of firefighting and having no time to improve on the work – they only had time to react and move on to the next problem

Disney’s servers had nicknames after Snow White’s dwarves — grumpy, sleepy, and dopey to reflect server behaviors — but this amusing behavior revealed a more significant issue of difficulty staying on top of server configurations

😅 Servers began to take on the personality they were named after

Grumpy regularly showed errors
Sleepy kept suffering from high latency
Bashful would disappear from the network for days

To address these issues, the newly minted SRE teams worked to create a more centralized IT infrastructure that would streamline operations across all divisions.

They championed several initiatives including:

the implementation of company-wide systems of operation — like self-service portals — that would allow different departments to work more efficiently
adoption of newer technologies such as cloud computing and virtualization, which have allowed Disney to scale its operations more effectively
building infrastructure-as-code (IAC) and coupling it with the application code while leveraging technology such as containerization and serverless architecture
shifting focus to reliability with the aim of delivering more reliable applications and experiences through platform abstraction

“Our DevOps transformation at Disney focused on technology, leadership, and community. Technology is crucial because it amplifies human ability.” — Jason Cox, Director, Global SRE at Disney

With Jason’s timely vision and his team’s hard work, Disney has been able to stay ahead of the competition and remain a leader in the entertainment industry.

How SRE has helped Disney’s tech operations

The Systems Reliability Engineering (SRE) team at Disney revolutionized its technology landscape by emphasizing the importance of core DevOps processes.

By incorporating best practices such as continuous integration and delivery, automated testing, and monitoring, the team was able to improve the efficiency and reliability of various systems and applications.

The team also worked closely with cross-functional teams to identify and address key pain points, resulting in a more streamlined and effective workflow.

As a result of these efforts, the Disney SRE team was able to significantly enhance the overall performance and scalability of Disney’s technology infrastructure, leading to improved customer experiences.

In particular, they addressed 2 key challenges…

Disney’s challenge: poor visibility across systems

With operations spread out across multiple locations and environments, it was crucial for Disney SREs to have a way to track and analyze data from all of their systems in one place. To achieve this, they invested in sophisticated technology and trained their staff to use it effectively.

They recognized the importance of keeping a close eye on their operations, so they sought to implement systems that could provide them with comprehensive monitoring and observation capabilities.

They also established clear protocols for identifying and addressing issues, as well as for reporting on progress and performance.

By taking these steps, Disney was able to ensure that their systems were running smoothly and efficiently, which in turn allowed them to provide the high-quality experiences that their customers have come to expect.

↪️ Solution: comprehensive observability

Disney employs a variety of methods and technologies to ensure the effectiveness of their monitoring and observability processes. In addition to Splunk, which they use for log analysis, they also utilize Grafana for metrics visualization and PagerDuty for incident management.

Disney’s use of Splunk allows them to efficiently analyze logs, which helps to identify potential problems and expedite the process of resolving them.

By using Grafana to visualize metrics, Disney can gain a better understanding of their systems’ performance and proactively address any issues that arise.

Furthermore, PagerDuty’s incident management capabilities ensure that the appropriate teams are notified in real-time when any critical events occur.

In summary, Disney’s strategic use of these tools and techniques enables them to maintain a highly effective monitoring and observability system, which is essential to ensuring the efficient operation of their complex technological infrastructure.

Disney’s challenge: driving consistent reliability at scale

Disney faced a significant challenge in achieving consistent reliability at scale across all their environments and locations. This was especially challenging due to the sheer size of the organization and the diverse range of locations where they operate.

↪️ Solution 1: deploy configuration management

To address this challenge, Disney turned to the use of configuration management tools such as Puppet and Chef. These tools have been crucial in helping Disney achieve consistent infrastructure across their numerous environments and locations.

In fact, with the help of Puppet and Chef, the average time to deploy a new environment was reduced from 2 weeks to just 2 hours.

By having a centralized system for managing configurations, Disney is able to ensure that all of its systems are running smoothly and are up-to-date.

Let’s go through 2 examples where configuration management has made a positive impact:

Example 1 of configuration management

Before implementing CM, employees would spend eight hours each night manually updating the 100 servers involved in the “Toy Story Mania” attraction. But now, thanks to configuration management, a single person can update the entire fleet in just 30 minutes!

By enforcing configuration and converging each system together, configuration management has also helped reduce system drift for Disney. This means that each system is more consistent and performs at a higher level, leading to improved operations and better results.

Example 2 of configuration management

Disney was also able to ensure consistency across the 220 stores they have across the U.S., each with multiple point-of-sale devices. By converging these devices through configuration management, employees could easily verify that everything was working as intended.

→ This is important because it allows the stores to provide a consistent experience for customers, regardless of which store they visit.

In addition, configuration management helps to ensure that employees are able to spend more time helping customers and less time troubleshooting technical issues. By streamlining its technical infrastructure, Disney has also been able to reduce costs associated with maintenance and support.

→ This has allowed them to invest more in other areas of their business, such as marketing and product development.

↪️ Solution 2: bespoke automation tools

In addition to using open-source tools, Disney SREs also developed their own internal configuration management tool called the “Disney Deployment Framework.”

This framework allows them to automate the application deployment process and ensure consistency across different environments. By having a tailored solution that fits the unique needs of the organization, Disney is able to achieve even greater levels of reliability and consistency.

Disney SREs also emphasize the importance of testing configuration management code rigorously. They have developed a tool called “Simba” that allows them to test changes to infrastructure code before deploying it to production. By doing so, they can catch any issues before they cause problems for the business.

The impact of configuration management has been “truly magical” for Disney. By streamlining their IT processes and ensuring that everything is running smoothly, they are able to focus on delivering an exceptional experience to their customers.

What is Disney’s SRE culture like?

The 3 C’s value system

The value system at the Disney company consists of better, faster, safer, happier.

“How do we go for higher quality? That is taking it to the next level of quality.”

“Go faster. Got to get it to market faster.”

Disney’s SRE team culture takes inspiration from this for three values of its own — the 3 C’s:

This value set has allowed operations-facing engineers at Disney to:

become less transactional and more integrated with the work
do less manual work and drive more self-service and automation

“Let’s fix the job title”

Until 2017, Disney’s operations engineers / SREs were called “Systems operators”. They changed the naming to “Systems Engineers” to reflect that they weren’t just the people “operating the train” but also those who designed the train, built the train, the tracks, and the bridges.

Disney’s SREs began their journey at the company by working with other teams to espouse the above ethos — that they were integrated with the value chain and not outside of it.

We began to engineer our future, and as part of that, we became embedded with the teams that we’re supporting the product teams and the business teams. — Jason Cox

Toward a generative culture

SRE leadership has eschewed the traditional management mantra of fear, power orientation, command-and-control, and bureaucracy. They aim to lead with a generative culture that empowers knowledge workers to achieve their edge.

At the same time, the SRE teams have a unique way of making engineers not become arrogant with thoughts of grandeur that they are A-player rockstars above everyone else.

Fostering a service mindset

“My team sits at the bottom of the corporate hierarchy” — at least that’s the mindset that Jason Cox has instilled in his SRE teams. They are there to be of service to the business stakeholders and the technologists on the ground.

In Jason’s words, “My goal is to say, ‘How can I help?’ So as I go into each one of these segments, I say, ‘I’m with corporate. I’m here to help.’”

People on the ground are still apprehensive when he and his team approach them, especially with the “I’m with corporate” phrase being part of their approach.

He adds that they work and continually communicate to make sure that people and especially engineers on the ground don’t see them as an imposing force, “here to take away your fun”.

Developing T-shaped skillsets

SRE teams acted as tech evangelists that were helping champion the positive changes mentioned earlier across the organization, as well as augmenting product teams with their T-shaped skills.

T-shaped skills refer to the concept that SREs should possess a broad range of skills and knowledge across different fields, while also having deep expertise in a particular area.

For instance, a Site Reliability Engineer may have experience in software development, operations, project management, security, and more. At least a very thin layer of one or several of these areas, enough to be empathetic and have an understanding to lean in and help.

However, they should also have a deeper understanding of one area, such as cloud computing, security, networks, or automation. By having this broad and deep skill set simultaneously, SREs can better understand and empathize with their colleagues in different disciplines.

A culture of continuous learning

This in turn has helped Disney’s SREs foster more effective collaboration and problem-solving. Furthermore, possessing a diverse range of skills and knowledge can help SREs identify and address issues that may arise in a complex system.

To continue developing their skills, Disney ensures its SREs participate in a community of practice, nicknamed Jedi Engineering Training (JET).

This community has two distinct benefits:

New technologies are promoted while a community of people is fostered around discussing technologies and collaborative problem-solving.
External and internal experts visit to speak about their current projects. The community uses these insights to work through problems, adopt innovations, and connect with each other.

Parting words

As you have read, Disney’s Site Reliability Engineering (SRE) has revolutionized its technology landscape by emphasizing the importance of core DevOps processes such as continuous integration and delivery, automated testing, and observability.

They have also worked to create a more centralized IT infrastructure that would streamline operations across all divisions.

Disney’s SRE culture consists of collaboration, curiosity, and courage. They aim to lead toward a generative culture that empowers knowledge workers to achieve their edge.

What Walt Disney himself said about Disney’s secret rings very true to how a successful Site Reliability Engineering team can operate:

“There’s really no secret about our approach. We keep opening new doors and doing new things — because we’re curious. And curiosity keeps leading us down new paths.”

Author
Recent Posts

Connect?

Ash Patel

Reliability Nut at SREpath

Ash has an unhealthy obsession with software reliability. Maybe it’s got to do with the trauma of working at a few companies where software kept slowing or went down while he worked to turn it around. His ma hopes that he can one day turn this passion into a respectable job or business. Still waiting…

Connect?