{"id":635,"date":"2022-07-20T22:48:51","date_gmt":"2022-07-20T12:48:51","guid":{"rendered":"https:\/\/sysmit.com\/cf22\/?p=635"},"modified":"2023-12-13T15:28:02","modified_gmt":"2023-12-13T05:28:02","slug":"rundown-of-uber-sre-practice","status":"publish","type":"post","link":"https:\/\/sysmit.com\/cf22\/rundown-of-uber-sre-practice\/","title":{"rendered":"Rundown of Uber’s SRE practice"},"content":{"rendered":"

Introduction<\/h2>\n\n\n

Every time you push a button like the one below to request an Uber ride, you activate a sequence of (micro)service requests.<\/p>\n\n\n\n

\"\"<\/figure>\n\n\n\n
<\/div>\n\n\n\n

You’d never know unless you look under the hood because most of these services run solely in the background. Yet every service contributes to the start and completion of the Uber ride experience. <\/p>\n\n\n\n

We’ll explore how Uber’s engineers ensure this goes on without trouble. The effort ties into Uber’s SRE practice, which has been part of the broader infrastructure practice since 2014.<\/p>\n\n\n

\ud83d\udcca Performance statistics for Uber<\/h2>\n\n\n

Uber has grown significantly since 2010 when it was a car-for-hire service in San Francisco.<\/p>\n\n\n\n

\"\"<\/figure>\n\n\n\n
<\/div>\n\n\n\n

Site Reliability Engineers (SREs) ensure rides, microservices, and writes work as reliably as possible. <\/p>\n\n\n

What is Uber’s contribution to Site Reliability Engineering?<\/h2>\n\n\n

Like Netflix is known for Chaos Engineering<\/a>, Uber is best known for contributing to distributed tracing<\/strong> through its open-source Jaeger tool. <\/p>\n\n\n\n

How is it that Netflix developed chaos engineering and Uber developed tracing? <\/p>\n\n\n\n

It comes down to the internal problems each organization needed to solve. Netflix is focused on the playback of media files so the start of playback is most important. <\/p>\n\n\n\n

Hence Netflix’s focus was on assuring a high starts-per-second through chaos engineering. They developed the “Simian Army” tool suite to achieve this. <\/p>\n\n\n\n

For Uber, starts-per-second is not the key metric. They have a resilience suite similar to the Simian Army, but tracing is their holy grail. <\/p>\n\n\n\n

That’s because Uber’s golden metric is requests handled <\/strong>i.e. rider requests connecting with driver availability and geolocation. This kind of request effort requires making a lot of API calls on many services. <\/p>\n\n\n\n

What kind of services are called for every ride? Here are a handful of examples:<\/p>\n\n\n\n