Cloud infrastructure teams in detail
The image above is an original SREpath summary of Will Larson’s “trunk and branch model”.
It outlines the evolution of team/s focused on cloud infrastructure provisioning and management.
While the original piece is very erudite, I thought I’d put my spin on the concept to support your understanding of this valuable concept.
Teams evolve from early-stage to growth phase chaos to mature & stable.
The final mature stage is arguably the (lower stress) goal of every manager responsible for cloud deployments.
The typical cloud infrastructure initiative starts simply with one team that handles almost everything.
This is mostly generalist work like provisioning virtual machines on AWS cloud and incident response when outages occur.
Over time, as you get more users and/or develop more intensive applications, you’ll start noticing prickly problems emerging all over your cloud footprint.
The original team handling the overall system might not have the mental bandwidth or skills to resolve these emerging issues.
Most “trunk teams” (the earliest form of cloud infra teams) are proficient in setting up a few of the core AWS or GCP, or Azure services.
Using AWS as an example might mean they know how to run EC2 instances with S3 storage and VPC.
Anything beyond that will pose a challenge or even be impossible for them.
For example, data may need to be stored in distributed databases – a highly complex situation.
AWS alone has a suite of 200+ services to serve very specific use cases, most of which would be beyond the scope of a generalist trunk team.
This is when cloud infrastructure managers may add branch teams to the main trunk to solve specialised problems like internal workflow blockers and recurring outages.
Examples of branch teams may include storage, compute, chaos, and performance management.
Site Reliability Engineers may join these specialist teams as they are specialists in handling complex software systems. Their scope would cover reliability, scalability, and more.
Want a deeper understanding of Site Reliability Engineering culture?
👇 Take SREpath’s free 7-day SRE culture patterns course 👇
Branch teams may also have specialized knowledge to identify which of today’s minor problems will become tomorrow’s big problems.
This level of ability and knowledge means that branch teams don’t normally participate in routine provisioning work or incident response.
They need their beauty sleep to work out the next evolution of the application’s cloud infrastructure.
Nonetheless, there may be occasions where branch team engineers may rotate in-and-out-of trunk teams on temporary assignments.
They may use this opportunity to identify issues that can be resolved with their specialist expertise.
A mature cloud infrastructure organization will have the thinnest possible trunk of engineers.
This thin trunk might be branch engineers rotating in and out to do on-call work (because it’s not glamorous work).
At the same time, there may be many specialized branch teams solving long-range problems.