Analysis of SRE and platform setup at 10+ tech companies – Boost software reliability

In this article, you will see a breakdown of the platform setup and SRE practices within 12 non-FAANG technology companies.

This is based on the case studies by Andrios Robert.

“There is a lot of content available on how Google did [Site Reliability Engineering]; let’s uncover what happens with the rest of the world.”

— Andrios Robert, founder of Runops.io

You might be thinking, “Why don’t I read Andrios’ writeups instead of this?”. You should at some point, but chances are that you are pressed for time.

I have trawled through Andrios’ case studies and pulled together common threads, which I’ll explain in my analysis toward the end of this article.

Let’s first cover a few specifics about the companies being covered:

growth or late-stage startups and mid-sized companies — no FAANG or early-stage startups are represented below
they are all Brazilian companies — founded there or having a branch there — but their experiences should still translate across to other regions

Expanding on the last part:

Yes, these companies are based in a faraway region. However, I believe their experiences can highlight what companies in similar industries would be doing elsewhere in the world.

I doubt there is an appetite in Latin America to reinvent the wheel when serving cloud applications similar to what we see elsewhere.

Let’s have a business-level picture of the companies in a table format:

Company	Industry	What they do	Founded	Funding/Revenue
Hash	FinTech	Brazil’s Square	Feb 2017	$58.7m, Series C
Dafiti	eCommerce	Fashion store for all	Nov 2010	$250m, Series C
Creditas	FinTech	Consumer loans	Apr 2012	$1.1b, Series F
TOTVS	ERP	Enterprise systems	Sep 1983	$3.8b market cap
Empiricus Research	FinTech	Investment tracker	2009	Unknown
Leroy Merlin (Brazil)	eCommerce	Home/Hardware	1923	$84m revenue 2020
Dock	FinTech	Banking APIs	2014	$280m funding
Loggi	Transport	Delivery app	2013	$507m, Series F
Deliver Center	FoodTech	Delivery app management	2016	Unknown
Quintoandar	Property	Long-term rental marketplace	2012	$700m, Series E
Natura	eCommerce	Beauty shop	1969	$9.6b revenue in 2021
PicPay	FinTech	neobank	2012	Acquired for undisclosed sum

What they do and Funding/Revenue data obtained from Crunchbase

As you can see, these are not early-stage startups or small businesses.

Now let’s collate the common DevOps and SRE elements across these organizations:

Platform setup

Platform

K8s (Hash, Dafiti, Creditas, TOTVS, Empiricus, Dock, Loggi, Delivery Center, Quintoandar, Natura, PicPay)
AWS Lambda functions (Empiricus, Dock)
Istio (Hash, Dafiti, Empiricus, Loggi)
Terraform for IAC (Dafiti, Creditas, TOTVS, Natura)

Monitoring

Prometheus + Grafana (Hash, Creditas, TOTVS, Empiricus, Quintoandar, Natura)
NewRelic (Dafiti, Creditas, Delivery Center)
Datadog (Empiricus, Dock)
PagerDuty (Empiricus, Natura)

Codebase

Pre-microservices monolith (Leroy Merlin)
Diverse multi-language codebase (Dafiti, Empiricus, Delivery Center, Quintoandar, PicPay)
Single language codebase across org (Creditas, Leroy Merlin, Loggi)

SRE practices

Developers can autonomously deploy to production (Hash, Creditas, Empiricus, Dock, Natura, PicPay)
Product teams are on-call to support their product (TOTVS, Leroy Merlin, Dock, Delivery Center, Quintoandar, PicPay)
Product teams participate directly in platform/tooling matters (Hash, Loggi, Delivery Center, Quintoandar)
Feature flags enable safer rollouts (Leroy Merlin, Loggi)
Emulate Google SRE principles to an extent (Hash)
Deployments must pass the readiness checklist (Hash)
SRE teams set alerting based on the product team’s specifications (PicPay)
SRE teams responsible for after-hours incidents (PicPay)
Break SRE work into multiple streams e.g. observability etc (PicPay)

My brief analysis

Platform

Kubernetes – It shouldn’t be surprising that Kubernetes (aka K8s) is the platform of choice for 11 out of 12 studied organizations. Leroy Merlin was the only exception, still migrating from legacy monolith to microservices architecture.

Lambda – A small subset of companies uses Lambda functions for executing code. Its lesser use makes sense because Lambda requires using AWS, learning a new AWS service, and can drive up cloud costs without persistent oversight.

Istio – Service mesh seems to be getting visible use in production systems now. I wasn’t sure if it would make it based on how K8s practitioners viewed it in 2020.

Terraform – No surprise that Terraform is being used extensively by many of these organizations for their infrastructure provisioning capability.

Monitoring

As expected, there is a fair spread of tooling used for monitoring. Nonetheless, the open-source Prometheus and Grafana combination holds the largest share, reflecting the wider CNCF community uptake.

I am curious whether any of these use commercial implementations of Grafana.

The market among commercial monitoring tools (NewRelic, PagerDuty, Datadog) is divided with a handful of companies taking up each offering.

Codebase

There was a fairly even split of organizations with single and multi-language code architectures.

Organizations with a single language may have promoted their hiring brand as a single language shop or hired from within a network of engineers who all worked in that one language.

There are 2 ways to look at language diversity in multi-language organizations:

Language popularity
Different service types

Regarding language popularity – some organizations scaled up as different languages became popular and others faded in popularity; their engineering workforce would reflect this transition

Regarding different service types – some services may function better in a certain language than others e.g. Golang for infrastructure services, Python for data scripting, C for Linux services, Javascript for frontend, etc.

Feature flags seem to have a way to go before they are widely discussed and adopted. That may be because they add another layer of complexity to deploys.

SRE practices

I found a few common threads but many unique practices too. It seems like how the organization practices software reliability remains unique to its environment.

I’m sure a lot more common practices exist, but they are not as obvious to discuss as say, “We run EKS with Redis and Kafka monitored by NewRelic”.

Many organizations were open to having developers autonomously push to production but also making them responsible for any incidents that resulted.

Some organizations like Quintoandar additionally incentivize their developers to sit on incident call rotations

It was also interesting to note that many organizations have their developers actively participate in platform and tooling matters.

Here are interesting examples of how SREs are utilized::

as consultants to support incident capability – Delivery Center’s SREs are responsible for structuring the product team’s on-call schedules
last-resort support, to be called upon to solve critical or tricky problems
create a center of excellence for highly secure, reliable, and performant software practices like at Natura

Only one organization made SREs responsible for after-hours incidents.

Concluding words

I trust that this brief analysis of Andrios’ SRE team case studies has given you a new or clearer perspective on how organizations can set up their platform and SRE service culture.

Author
Recent Posts

Connect?

Ash Patel

Reliability Nut at SREpath

Ash has an unhealthy obsession with software reliability. Maybe it’s got to do with the trauma of working at a few companies where software kept slowing or went down while he worked to turn it around. His ma hopes that he can one day turn this passion into a respectable job or business. Still waiting…

Connect?