Analysis of SRE and platform setup at 10+ tech companies

In this article, you will see a breakdown of the platform setup and SRE practices within 12 non-FAANG technology companies.

This is based on the case studies by Andrios Robert.

“There is a lot of content available on how Google did [Site Reliability Engineering]; let’s uncover what happens with the rest of the world.”

— Andrios Robert, founder of Runops.io

You might be thinking, “Why don’t I read Andrios’ writeups instead of this?”. You should at some point, but chances are that you are pressed for time.

I have trawled through Andrios’ case studies and pulled together common threads, which I’ll explain in my analysis toward the end of this article.

Let’s first cover a few specifics about the companies being covered:

  • growth or late-stage startups and mid-sized companies — no FAANG or early-stage startups are represented below
  • they are all Brazilian companies — founded there or having a branch there — but their experiences should still translate across to other regions

Expanding on the last part:

Yes, these companies are based in a faraway region. However, I believe their experiences can highlight what companies in similar industries would be doing elsewhere in the world.

I doubt there is an appetite in Latin America to reinvent the wheel when serving cloud applications similar to what we see elsewhere.

Let’s have a business-level picture of the companies in a table format:

Company Industry What they do Founded Funding/Revenue
Hash FinTech Brazil’s Square Feb 2017 $58.7m, Series C
Dafiti eCommerce Fashion store for all Nov 2010 $250m, Series C
Creditas FinTech Consumer loans Apr 2012 $1.1b, Series F
TOTVS ERP Enterprise systems Sep 1983 $3.8b market cap
Empiricus Research FinTech Investment tracker 2009 Unknown
Leroy Merlin (Brazil) eCommerce Home/Hardware 1923 $84m revenue 2020
Dock FinTech Banking APIs 2014 $280m funding
Loggi Transport Delivery app 2013 $507m, Series F
Deliver Center FoodTech Delivery app management 2016 Unknown
Quintoandar Property Long-term rental marketplace 2012 $700m, Series E
Natura eCommerce Beauty shop 1969 $9.6b revenue in 2021
PicPay FinTech neobank 2012 Acquired for undisclosed sum
What they do and Funding/Revenue data obtained from Crunchbase

As you can see, these are not early-stage startups or small businesses.

Related article:  25+ Site Reliability Engineering OKRs

Now let’s collate the common DevOps and SRE elements across these organizations:

Platform setup

Platform

  • K8s (Hash, Dafiti, Creditas, TOTVS, Empiricus, Dock, Loggi, Delivery Center, Quintoandar, Natura, PicPay)
  • AWS Lambda functions (Empiricus, Dock)
  • Istio (Hash, Dafiti, Empiricus, Loggi)
  • Terraform for IAC (Dafiti, Creditas, TOTVS, Natura)

Monitoring

  • Prometheus + Grafana (Hash, Creditas, TOTVS, Empiricus, Quintoandar, Natura)
  • NewRelic (Dafiti, Creditas, Delivery Center)
  • Datadog (Empiricus, Dock)
  • PagerDuty (Empiricus, Natura)

Codebase

  • Pre-microservices monolith (Leroy Merlin)
  • Diverse multi-language codebase (Dafiti, Empiricus, Delivery Center, Quintoandar, PicPay)
  • Single language codebase across org (Creditas, Leroy Merlin, Loggi)

SRE practices

  • Developers can autonomously deploy to production (Hash, Creditas, Empiricus, Dock, Natura, PicPay)
  • Product teams are on-call to support their product (TOTVS, Leroy Merlin, Dock, Delivery Center, Quintoandar, PicPay)
  • Product teams participate directly in platform/tooling matters (Hash, Loggi, Delivery Center, Quintoandar)
  • Feature flags enable safer rollouts (Leroy Merlin, Loggi)
  • Emulate Google SRE principles to an extent (Hash)
  • Deployments must pass the readiness checklist (Hash)
  • SRE teams set alerting based on the product team’s specifications (PicPay)
  • SRE teams responsible for after-hours incidents (PicPay)
  • Break SRE work into multiple streams e.g. observability etc (PicPay)

My brief analysis

Platform

Kubernetes – It shouldn’t be surprising that Kubernetes (aka K8s) is the platform of choice for 11 out of 12 studied organizations. Leroy Merlin was the only exception, still migrating from legacy monolith to microservices architecture.

Lambda – A small subset of companies uses Lambda functions for executing code. Its lesser use makes sense because Lambda requires using AWS, learning a new AWS service, and can drive up cloud costs without persistent oversight.

Related article:  Where in team topologies does Site Reliability Engineering fit in?

Istio – Service mesh seems to be getting visible use in production systems now. I wasn’t sure if it would make it based on how K8s practitioners viewed it in 2020.

Terraform – No surprise that Terraform is being used extensively by many of these organizations for their infrastructure provisioning capability.

Monitoring

As expected, there is a fair spread of tooling used for monitoring. Nonetheless, the open-source Prometheus and Grafana combination holds the largest share, reflecting the wider CNCF community uptake.

I am curious whether any of these use commercial implementations of Grafana.

The market among commercial monitoring tools (NewRelic, PagerDuty, Datadog) is divided with a handful of companies taking up each offering.

Codebase

There was a fairly even split of organizations with single and multi-language code architectures.

Organizations with a single language may have promoted their hiring brand as a single language shop or hired from within a network of engineers who all worked in that one language.

There are 2 ways to look at language diversity in multi-language organizations:

  1. Language popularity
  2. Different service types

Regarding language popularity – some organizations scaled up as different languages became popular and others faded in popularity; their engineering workforce would reflect this transition

Regarding different service types – some services may function better in a certain language than others e.g. Golang for infrastructure services, Python for data scripting, C for Linux services, Javascript for frontend, etc.

Feature flags seem to have a way to go before they are widely discussed and adopted. That may be because they add another layer of complexity to deploys.

Related article:  How cloud infrastructure teams evolve - from start to maturity

SRE practices

I found a few common threads but many unique practices too. It seems like how the organization practices software reliability remains unique to its environment.

I’m sure a lot more common practices exist, but they are not as obvious to discuss as say, “We run EKS with Redis and Kafka monitored by NewRelic”.

Many organizations were open to having developers autonomously push to production but also making them responsible for any incidents that resulted.

Some organizations like Quintoandar additionally incentivize their developers to sit on incident call rotations

It was also interesting to note that many organizations have their developers actively participate in platform and tooling matters.

Here are interesting examples of how SREs are utilized::

  • as consultants to support incident capability – Delivery Center’s SREs are responsible for structuring the product team’s on-call schedules
  • last-resort support, to be called upon to solve critical or tricky problems
  • create a center of excellence for highly secure, reliable, and performant software practices like at Natura

Only one organization made SREs responsible for after-hours incidents.

Concluding words

I trust that this brief analysis of Andrios’ SRE team case studies has given you a new or clearer perspective on how organizations can set up their platform and SRE service culture.

Ash Patel