Rundown of Uber’s SRE practice

Introduction

Every time you push a button like the above to request an Uber ride, you activate a sequence of (micro)service requests.

You’d never know unless you look under the hood because most of these services run solely in the background.

But almost every such service is critical to the start and completion of the Uber ride experience.

We’ll explore how Uber’s engineers assure that this goes on without trouble. The effort ties into Uber’s SRE practice, which has been part of the broader infra practice since 2014.

📊 Here are some performance stats for Uber

Uber has come a long way since its early days in San Francisco as a black-car-for-hire service…

Statistics for Uber's performance in 2021 including number of rides completed, number of microservices and the database write rate

SREs played an essential role in making sure all of the rides, microservices, and writes worked to reliable perfection.


What is Uber’s contribution to better SRE?

Like Netflix is known for Chaos Engineering, Uber is best known for its engineering prowess in distributed tracing.

It makes sense. Let me explain why this is the case.

Netflix is focused on the playback of media files so the start of playback is most important — hence chaos engineering using the “Simian Army” to assure high start rates.

For Uber, this is not the key metric. Yes, they have a resilience suite similar to Simian Army, but tracing is their holy grail. Here’s why…

Uber’s metric is requests handled i.e. rider request connecting with driver availability with destination mapping request — a simplistic but reasonable view.

This means a lot of API calls on many services. What kind of services are called per ride? Here are just some of them:

  • Logistics and routing
  • Maps and navigation
  • Supply & demand market
  • Fraud detection
  • Payments
  • Surge calculation
  • SMS, calling, messaging
  • Data warehousing + modeling

Completed requests across the entire service chain are the bread and butter of Uber. When things go wrong, it’s about tracing the multitude of API calls to find faulty service/s.

That’s why one of Uber’s engineers, Yuri Shkuro, developed a distributed tracing tool that could handle Uber’s heavy serviceload and traffic. This tool is called Jaeger (video).

I wrote about Jaeger a couple of years ago, so I’ll link to that post once I’ve reposted it here. But that is beyond the scope of this overview of Uber’s SRE practice. Let’s continue…


🤝How SRE fits into the Uber org and culture

Team formation

SRE teams at Uber can be found in 2 forms: as embedded teams or as handlers of infrastructure-at-large.

Embedded SRE teams work alongside or within engineering teams responsible for services like data engineering, front/backend and real-time matching.

Infrastructure-at-large SRE teams take ownership of an aspect of infrastructure across the entire Uber organization. Examples of aspects include security, storage, compute and observability.

It appears that an SRE team owns observability across the entire org and instruments it into all services!

According to Uber’s first SRE hire, Rick Boone, the observability function is too critical to Uber’s 99.99% uptime goal for implementation responsibility to be distributed.

It’s much easier to have one focused team do it well than ask and pray for hundreds of disparate teams to do it well.

How did SRE start at Uber?

The SRE practice in Uber was started in October 2014, a very tumultuous time for the company.

An earlier systems meltdown prompted a wake-up call about making a shift toward a reliability-focused culture.

Until this point, Uber’s 500 or so services were the responsibility of the overloaded “core infra” team. Will Larson, founder of SRE at Uber wrote at great length about this.

Prior to the existence of SRE in Uber, the core infra team had 15 engineers reacting to a myriad of operational issues including on-call, platform work and more.

There was little to no time to improve the systems, to prevent recurring issues that were, in the long run, chewing up a lot of engineer time.

Culture for Uber SRE

The SRE culture at Uber can be summed up in 4 words: small team, outsized impact. Every. effort. must. scale.

Give me a lever and a fulcrum on which to place it, and I shall move the world.

– Archimedes quote that aptly describes SRE work

In 2016, for every SRE at Uber, there were:

  • 14 services
  • 30 engineers
  • 640 servers

Compared to SRE work at other tech companies, Uber’s SREs have interesting challenges like persistent trip tracking.

Persistent trip tracking in terms of what you can see is when you see the driver’s car icon move on the map toward you and after you’ve been picked up, toward your destination.

Uber is keeping real-time data on 100s of 1000s of trips across the world at any given time. It is a real challenge to ensure all of this data continues to stream smoothly.

In order to achieve this, SREs are given a high degree of autonomy to work on solutions they feel the system needs.

We’ll cover Uber SRE’s 3 cultural traits in making this happen shortly.


Want a deeper understanding of Site Reliability Engineering culture?

👇 Take SREpath’s free 7-day SRE culture patterns course 👇


🎯 Uber SRE goal parameters

Mission – Highly performant, highly available systems

Uptime goal – 99.99% uptime

Performance target – subsecond end-user latency (<500ms)

Business goal – minimize operational cost per ride


3 cultural traits of Uber SRE work

⏩ Continue to move fast

Uber’s services call for high reliability but at the same time, in their early days, the SRE team faced a unique challenge that now forms a large part of the wider SRE ethos.

They had to assure highly available, performant services without slowing developers down like would happen when more traditional ops workflows are in place.

They achieved this through having SREs with the mindset of serving the needs of service teams with leverage in mind. Their goal was to automate critical workflows to scale their work.

🙏🏽 Engineer autonomy is sacred

I’ve reviewed blogposts from senior Uber engineering leaders down to SRE interns. There is a clear pattern of, to an extent, respecting an engineer’s need to do their own thing.

The thinking behind this is “make what you think or know will make things better and we will try to love it and make it work with the wider system”.

⚠️ This kind of thinking doesn’t come without its challenges. That’s exactly what happened as Uber grew very fast into a very large company. Quoting Will Larson:

“Uber’s wider infrastructure organization had an organic approach to problem solving. It was fundamentally a bottoms-up organization that had local strategies without a common overarching strategy, which created quite a few points of friction.” (source)

I’d say give engineers autonomy, but have working groups in place to ensure all these autonomous works are being aligned properly with the overall direction.

🚩 Hit the trifecta of influence and impact

There are 3 factors that play into the kind of influence and impact an engineer can have. These are:

  1. high internal impact
  2. high ability to influence decisions and actions
  3. high external impact

In a small startup, engineers experience #1 and #2. Every action they take affects the internal mechanics of the startup. Plus they get to influence a lot of decisions.

In a large tech company, engineers experience #1 and #3, but not a lot of #2. Their work touches a lot of people inside and outside the org, but they rarely influence decisionmaking.

At Uber, engineers experience all 3 points. Their work touches a lot of people inside the org plus millions outside the org, but they also get to influence decisions within the org.


Parting remarks

Uber SREs have been rightfully given a lot of autonomy to fix the unique problems that the application can face.

After all, with 3000+ services as of June 2022, there’s a huge surface area for things to do go wrong. It truly is amazing what they do in the background so you can safely ride from A to B.