#30 Clearing Delusions in Observability (with David Caudill) – Boost software reliability

Episode 30 [SREpath Podcast]

Show notes

How critical is observability (o11y) to SRE work?

To me, observability is the core foundation practice for all other SRE practice areas. Without it, you’re flying blind.

David Caudill is not afraid of making controversial viewpoints about this area. But he doesn’t do it for glory. To him, it’s about driving better practices.

After all, it’s not in his interest to promote (sell) shiny objects. He runs engineering teams at Capital One, one of America’s largest banks, so he wants a functional offering more than overly done hype.

He believes that delusions are getting in the way of our success in observability. We explore some of them in this episode of the SREpath podcast.

You can connect with David via LinkedIn

More about our conversation

David’s stance is simple: observability itself is not bad; it’s just that we are often not doing it with the right mindset! We are seeking elegant technical solutions to complicated real world problems. Notice the disconnect?

I had a chat with him a few weeks back about all of this.

Our conversation touched on ideas you likely won’t have heard anywhere else including:

➡️ David’s analogy for observability of software architectures “from monoliths to rocks, pebbles and gaseous clouds”

➡️ The need to handle cognitive load effectively when it comes to your observability system by simplifying measures (inspired by Google)

➡️ Moving toward event-based SLOs rather than leaning too heavy on time-based metrics for your observability

Let’s unpack each of these:

Observability of software architectures “from monoliths to rocks, pebbles, and gaseous clouds”

Did I mention that David likes to use fun and peculiar analogies to highlight his ideas? I love it, and you might too.

But what does he mean by this particular analogy (above)? 🤔

David put it like this (quote truncated for brevity):

“They start with a monolith, and they decide, ‘we could break this up into a few service oriented architecture’ blobs [rocks]… then go to microservices [pebbles], and then into lambdas [gaseous clouds]. Before long, you’re down to really, really intense atomicity in this architecture.”

This is a common enough pattern in more than a few organizations I know inside details of. It makes sense as microservices is the shift that cloud pundits have been pushing for a decade now.

Serverless these days is not the quiet one in the corner.

So all seems well and good. How much can atomicity hurt?

Well, it can be painful if your observability system can’t handle it.

David added the fact that this kind of atomicity adds cardinality to observability. It adds overhead that itself needs to be observed. This overhead can start to feel as large as the application itself.

I’ve heard people echo David’s sentiment that this overhead can become a completely invisible cloud of noise around your application that has nothing to do with your application.

A cautionary tale indeed for carefully reworking your software architecture. Be sure to keep your o11y capabilities in mind when doing so!

Handling cognitive load when it comes to observability systems

Our discussion of architectural woes in o11y drove me to bring up an area that is so critical to successful engineering work but is often neglected: cognitive load.

Observability systems are overloading engineers with too much data and things to see.

David’s suggestion to my assertion around cognitive load made me chuckle and grimace at the same time:

“I’m very much of the opinion that you want to preserve simplicity as long as you can. There’s a lot of anticipatory architecture changes that happen really on the optimism like, ‘Oh, it’s going to blow up!’ I have worked in environments where it didn’t blow up and it was still really complex because we were prepared for a legion of people that never showed up.”

He added that engineers need to “avoid complexity like the plague” and keep things as simple as they possibly can.

Moving toward event-based SLOs rather than solely time-based metrics

Drawing from experience, David recalls his attempts at making time-based SLOs work. Over time and several failures, he learned that event-based SLOs are more appropriate in many situations.

He stresses the rationale for not using time-based SLOs through this quote:

“Because not every minute of time is the same as every other minute. And if your service goes down when no one is using it, who cares? That’s not a problem. And, you know, you’re not, in most cases, contractually obligated to provide this many minutes a month.”

Doing so can become “a really confusing side quest” that’s difficult to connect to reality. David emphasizes that SLOs are a sociotechnical construct and because of this, they need people to buy into what you’re trying to achieve.

His experience has shown that it’s a lot easier to get people behind simpler SLOs like error rate rather than time-based SLOs.For example, you could set SLOs for the number of 50x that are occurring in your services.

Your service might normally return a 0.1% error rate, but if that number doubles, it can start a conversation. The key behind doing any of this is knowing that you can’t do bottom-up with SLOs. You need senior leadership support to make it happen.

If you can’t get that buy in and that can be really tough. This could explain the sheer volume of failed SLOs in the industry today.

You need to invest your time in SLOs, but keep it simple. David recommends to not jump into the whole hype surrounding SLOs from the market.

Here are 10 more takeaways from the show:

Understand your observability billing model: Gain a clear understanding of your service provider’s billing model to avoid unexpected expenses. This helps in planning and optimizing costs associated with high cardinality data and log retention.
Employ cost optimization tactics: tactics like sampling and selective logging can help manage costs without sacrificing the visibility needed for effective monitoring. This is particularly valuable in high-traffic scenarios where full resolution logs are not financially justifiable.
Enhance your log retention policies: Use automated tools to manage log retention, ensuring that you’re only keeping what’s necessary and utilizing cost-effective storage solutions like Amazon S3 for long-term storage.
Make querying and data access cost-effective: Look into using tools like Amazon Athena for querying large datasets at a low cost, which enables you to access your logs as needed without incurring high storage and processing fees.
Take the time to prioritize your data: Focus on logging and monitoring the most critical aspects of your system to avoid information overload and reduce costs. This involves understanding what data is truly valuable for your specific context.
Learn to differentiate status vs. diagnostic information: Status indicators and diagnostic data often give mixed up as solving the same problem. They don’t. Simplify your monitoring by providing clear status indicators (red, yellow, green) for quick assessments and reserve detailed diagnostic data for deeper analysis.
Effectively leverage the power of metrics: Develop work metrics that reflect the actual performance and health of your platform. This approach helps in quickly identifying issues without sifting through irrelevant data.
Simplify your incident management: Aim for a “single pane of glass” dashboard that offers a straightforward view of system health to reduce time spent in identifying and debating the presence and scope of issues.
Educate your stakeholders and manage their expectations: Help your various stakeholders understand the cost-benefit analysis behind observability practices including data logging and retention strategies that foster support for cost-effective practices.
Go beyond the metrics mindset: Recognize the value of logs and stack traces as critical diagnostic tools that can support your monitoring and alerting. Ensure that your observability strategy integrates these elements effectively for pinpointing issues.

You will not want to miss his insights (and his charismatic humor).

In Episode #30 of the SREpath podcast, David Caudill gives me his take on clearing delusions in observability.

I’m sure your team will appreciate you clearing any accidental misgivings they may have developed about observability 😁

Author
Recent Posts

Connect?

Ash Patel

Reliability Nut at SREpath

Ash has an unhealthy obsession with software reliability. Maybe it’s got to do with the trauma of working at a few companies where software kept slowing or went down while he worked to turn it around. His ma hopes that he can one day turn this passion into a respectable job or business. Still waiting…

Connect?