#14 Faster Incident Resolution through Data-Driven Notebooks (with Ivan Merrill) – Boost software reliability

Episode 14 [SREpath Podcast]

Ash Patel interviews Ivan Merill who is head of solutions engineering at Fiberplane.

Ivan shares insights about making sense of the big data that comes from observability and incident response, to improve learning and drive faster incident resolution in the future.

He also sheds light on the importance of fostering collaboration and sharing knowledge within the Site Reliability Engineering (SRE) community, emphasizing how open communication and information exchange can lead to faster incident resolution and more effective post-incident retrospectives.

We also take a quick detour halfway in to learn more about what Fiberplane and Autometrics can do to turn vast data into actionable insights for quicker incident resolution and more insightful retrospectives.

Episode Transcript

Ash Patel: Great to have you on Ivan, on this informal talk about Fiberplane and how you contribute to the SRE space. So, to get started, tell me about you, your company, and how both help Site Reliability Engineering and the engineers that work in this space.

Ivan Merrill: Firstly, hello thanks for having me. I’m Ivan Merrill and I work at Fiberplane. So what does Fiberplane do? Fiberplane make collaborative notebooks for SRE, so we have a product called Fiberplane Studio, which aims to integrate a few other tools and provide notebooks, this kind of notebook concept, maybe taking a little bit from data science and things like that.

To provide a single place where SREs can work together. So think of highly collaborative tools, like at a simple level, Google Docs or Notion or something like that, but applied to the concept of SREs. From my point of view, the thing that excited me about this was I’ve been on so many incidents where every team is working in silos and they’ve all got their own data that they’re all looking at, and it’s kind of difficult for them to share that within an incident. Maybe they’re just posting screenshots or links or something into a chat. Maybe then they’re not sharing anything, which can kind of breed distrust amongst teams about what people are talking to each other.

There’s this concept that I came across a while ago called the watermelon effect, where you have an incident and every team says everything looks green here, everything looks green here, everything looks green here. So, you know, it’s green on the outside, but actually, it’s red underneath. I really like that analogy at the moment. That’s my favorite analogy. And I think this is something whereby by allowing teams to do everything in the open and share everything with each other. That’s helping get past that situation.

So, that’s Fiberplane. We’ve also got an open source micro framework called Autometrics which is designed to help teams instrument their code. And we can maybe cover that in a little bit. But I’ll move on to me. So my background is in monitoring and observability.

I worked for some pretty large financial institutions running monitoring teams and implementing tools on some very important banking systems. And started getting into, into SRE particularly when I started people leadership and wanted to build sustainable processes for supporting the tools that my team were responsible for. So now I’m passionate about it and trying to bring some new ideas, new thinking around some of the principles that we do in SRE and particularly the incident and monitoring space to people and done a few conference talks around more scientific approaches to incidents and how we can change the paradigm for monitoring and instrumenting code to make it more accessible for developers.

I’m particularly interested in the human side of SRE. Not just the tooling and the practices, but as I said, the sustainability of what we do and making it good for people. In a nutshell, that’s Fiberplane and that’s me.

Ash Patel: So you said you’ve worked in the financial space for a while.

That would be an interesting place, bring SRE into, there are a lot of places that are working on it. And in the UK, I’ve noticed that there’s a huge interest in the space, and people are actually quite open about it. That may not necessarily be the case in other regions. What would you have to say to them if they were saying, Hey, look, I don’t think we need to worry about this reliability thing.

We’ve been okay without it for so long.

Ivan Merrill: I think it’s a kind of interesting one, right? Because if you look at financial institutions, they have always cared about reliability. I mean, they have to. They are regulated. companies will get in trouble with the regulators if they lack reliability.

And that’s painful for them. And, you know, there’s huge reputational risk for failing to meet their reliability goals. And, historically, a lot of them have been performing some really good SRE type practices for a long time now.

Things like switching over to secondary sites and continual practices of disaster recovery and all these kind of things are actually if you think about it, really important stuff that they’re doing to ensure reliability, so I think they’ve always been interested in reliability, maybe not kind of necessarily having a specific role.

You know, the SRE role and everything else like that. But what SRE does as a core set of practices. It allows people to really understand what it is to be reliable and for the first time codifies a lot of best practice and a lot of really good ideas that they can do. I don’t think You need to get into whether or not they have the role or not, as I said.

But it’s more making sure that they do have these ideas, they’re probably going to adopt these ideas and everything else like that.

It’s a long rambling answer to say, you know, banks and all financial institutions have always been interested in reliability, maybe not in this kind of codified way. But I think as soon as they see that other companies are having success with it, they’re naturally going to go towards it.

Because, as we all know, reliability is a product’s number one feature, right? So they are at the the top of the scale, pretty much in terms of organizations that care about reliability.

What do you think? Because you said you’ve noticed the SRE role kind of have a greater adoption in the UK.

I think that’s kind of global, not just the UK.

Ash Patel: I think it really helps in the UK that it’s a very open network. People are communicating ideas with each other a lot more often .

I find in some other regions there is a bit more closed offness about how we can actually bring about things. And so I’ve noticed that people are traveling from other regions to the UK to actually try and learn what actually is happening, or how they can do things. Because they’re not learning from their peers within wherever they are.

It’s been one of those learning experiences I’ve had where I’ve actually noticed that, okay, a lot of people are traveling here, and let’s figure out why, and I’ve had some amazing conversations in London in particular more so than maybe even in the other, I’ve had some great conversations in the U. S. as well, but there’s a degree of openness that I was quite surprised by in London.

Ivan Merrill: I think the UK is both a center for for technology and a center for banking. Right? So it’s one of the leading places for both of these things.

So you are going to find a lot of experts in this place and a lot of people who are you. really passionate about it. Who want to share what they’re doing. And I think that, you know, probably does make it more open, right? Which is a really good thing.

I think there’s some interesting ideas around SRE best practice sharing and stuff like that, that are going on right now. I think it’s a sign of maturity in this space. When I first started getting into it, it was like we read the book, you know, the famous book, we understand the practices, and then we see the tooling start to kind of follow. Looking at the types of talks that we’re seeing now come into conferences in this space, they are moving more into the the human side of things and everything else like that, which is…

Really important. It’s really, really great to see.

Ash Patel: The socio technical aspects. I believe that’s what it’s called, right?

Ivan Merrill: Exactly. So it’s something that I’ve been fortunate to do a bit of research in and kind of look into. I’ve been really, really fascinated by this idea that Richard Cook was the one that kind of talked about this above the line, below the line thing, whereby, you know, humans are part of the system, right? These socio technical systems complex ones, humans are part of it.

And the way I like to think about it , the way I present it in conferences is like, how long would your system run without any human intervention? What would happen if you still had users, but no one was allowed to touch it from your organization, how long would it run?

How long would it be secure? All these things. So we are part of it. Right. And I think that it’s important that we recognize that we are above the line. When we interact with the actual infrastructure that we update the code, we do releases, deployments, we are the bit of the system that introduces change and it’s right that we think about the human processes and practices and we make sure those are sustainable and reliable.

if we look after our humans, they’re more likely to be reliable humans. I guess in a simple way, someone that’s had more sleep because they’re not constantly waking up from on call, someone that’s not stressed with having too much work on, someone that’s not feeling The pressure because there’s a lack of psychological safety in their environment, and they’re going to get blamed for doing something wrong. The people that are in a good place, they are likely to be more reliable.

And there’s plenty of research, more psychological research to show that, you are more likely to make better decisions when you’re happier, healthier, not under stress, had a good night’s sleep and all those things. It’s really great that We are getting there.

I think that’s one of the things where collaboration, from the Fiberplane point of view was something that grabbed me because for me it is this intersection between technology and humans that is The most interesting thing.

This actual thing of it’s humans that are working together, right? Technology is supposed to be here to, to enhance our lives, to improve our lives. Not take over them or to consume them. And I think, looking at it from that kind of angle, this ability to aid, make it simpler for humans to interact with each other is a really kind of interesting thing for me.

Still plenty to go, we’re still pretty new, still pretty early in the SRE space certainly nowhere near as mature as other areas like I said, I mentioned Dr. Richard Cook’s name and he comes from a medical background and you’ve also got aviation and everything else like that.

So these industries have far more experience in this space and are much more mature, but, we’re getting there.

Ash Patel: You used to be a teacher, right? Because you mentioned all these professions and I just remembered you mentioned in your talk that you were a teacher. So I would love to understand how those experiences have shaped you into how you understand the world and how you think about SRE in the way that you do.

So what do you think there are some parallels between how your experience as a teacher and what you do now?

Ivan Merrill: I’d almost forgotten I’d mentioned that. So many, many years ago when I finished my degree and I was looking for a job, I spent a year doing mainly supply teaching. I was teaching technology actually, doing their European computer driving license, which is kind of basic kind of word skills and everything else like that.

All my family are teachers. My mom and dad are now retired teachers. My brother and sister are both teachers. My oldest and one of my closest friends is a teacher. My aunt was a teacher.

I also did some tennis coaching as well in the education space. Same as my brother. I really do enjoy some of the best bits of my work now helping others and educating others. What I realized about teaching for me is that to see the rest of my family do it. They had a passion that perhaps I didn’t quite have.

And for me, education is just such an important thing. I really wants to say I’ve always actually wanted to work in technology so I’m going to leave teaching for the people that really have the passion for it because it is so important, but it did reinforce to me the importance of actually training others, the importance of, within our industry, education.

I think that, Far too many times tech can be in a situation where it’s quite scary for newcomers.

There is this kind of thing of, Oh, I’m not gonna get into that because I don’t know enough about it. Or like we didn’t think about kind of open source projects. I saw a great talk from someone who was helping the CNCF communities and she said please don’t think that you have to be involved in this project, you know, you have to know this project inside out before you can contribute to the community. Please don’t be scared. And that’s a really good point because, you know, there is this kind of thing of, we all need to be masters in everything we do, but I think there isn’t as much of, of kind of the, the education side that I think I would like to see and everything else like that.

It’s really, really important that we do help people because, otherwise we’re making the same mistakes over and over and over again. We’re just different people in different forms. And that would be a shame.

And I think within any organization, it really is incumbent on the people that have been there a while that do understand systems to, to help new people to understand some of the things that they’ve learned about, the systems over time, you can’t share experience, but you can share expertise.

And that’s something that we need to remember. Also, make a space where people understand that it’s okay to make mistakes. From my point of view, I’ve made some pretty large mistakes. I’ve kicked thousands of people off internet banking at one point.

Some pretty bad, big errors. I’ve learned from those, but also I was fortunate at the time that I had someone that it was my manager that understood and helped me and supported me in those situations.

So yeah, education. In general life, it’s a hugely important thing, but also, as I said, particularly within tech where it can appear really, really daunting, right?

Ash Patel: I’ve got some pretty controversial takes on education in this space because I am in disagreement with some people’s views on what they think an education is in this.

In that, you’re talking about getting expertise, you’re sharing expertise. I feel that some people are actually hoarding expertise because they feel like it’s protecting their role if they share this with someone else that’s not. And I’m telling them, no, it’s not. The more people you have with this knowledge, the more secure your role is because if you’re the only one who’s carrying what you know, it’s not no longer the case of you’re the invaluable asset.

It will go to people higher up that Well, if only one person is actually doing this job, maybe it’s not that important. I know it sounds really strange, but I’ve had conversations with senior management people who are like, well, we’ve got one or two people working on this. It doesn’t really mean that it’s a big thing for us, right?

We need to have a whole team of people who are pretty good at this for us to actually give any real attention to it.

Ivan Merrill: Yeah, but also the impact to the organization, right? If you’re one person, you can only have one person’s impact to an organization. If you can educate five people. Well, now you’ve got, maybe you and five other people having the same impact to the organization. So if there’s stuff that you’re good at, stuff that you believe in. Help others understand it. Help others get there and you’re more likely to make what you think is really important to the organization because you’ve got more skilled people.

As you say, it’s more likely to be deemed something that’s important to an organization and less likely to be deemed something that is either not important or too high risk because you don’t have enough skills in that space.

So, yeah I really absolutely agree. I’ve always found more value myself in enabling others than perhaps necessarily doing things Myself. I think that’s one of the joys of some of the people leadership stuff historically that I’ve done, but also going to conferences and talking about things.

Whenever I do a talk, I try and present something that maybe people won’t have heard of. So it’s great when you get feedback of people saying, Oh, that was interesting to me. I hadn’t thought about it that way. Just try and change people’s thinking a little bit around things.

We met at a conference in London where I was talking about Karl Popper’s theory of falsifiability. How we should look to disprove things in an incident rather than prove things. And just a simple kind of mindset switch there to try and kind of help people because I know from personal experience that I’ve been on incidents and we’ve been spending Several hours going down in avenue of investigation.

We think it’s really good. We’re looking around for supporting evidence And everything, we seem to be finding is yeah this is absolutely what’s what’s the case like this really looks like it’s it’s most likely to be the cause and then just like that, something happens and we’re like, Oh, no, that’s not it.

It can’t be the problem.

Whereas if we’d have just thought about it a bit more and thought, okay, let’s try and disprove this as early as we possibly can and keep trying to disprove it rather than prove it, we might’ve removed that as an avenue of investigation earlier and ultimately got closer to the actual problem. So things like that are things that I’ve learned from my experience. And so, I want to share with people. I think it’s important that I do and hopefully help others.

Ash Patel: And of course, your collaborative notebooks can help in achieving this, right?

Ivan Merrill: It’s really important that people do work together. Coming back to this human side, I might have an incident, I think that incidents are generally involving a number of people or even a number of teams and everyone has their own view of the system, right? Their own kind of mental model. That they have of the way that the system works and based on their interactions and experience and everything else like that, right? So maybe a developer or someone that’s written some code contributing to one of the APIs or whatever has a particularly good understanding of the code based on that API.

But me as more of a observability person, I have a better understanding of it in terms of the data that the system emits looks like and interpreting that into something. So, in order for us to really understand what’s going on, in order for us to really resolve an incident quickly, that’s going to take some kind of collaboration, right?

It’s going to take some of your knowledge, some of my knowledge, some of someone else’s knowledge. And historically that’s been difficult or we’ve been using tooling that is not ideal for, for the situation. Chat tools are great and there’s a lot of, a lot to be said for them, but they’re not designed for kind of incident resolution.

And then, historically we had the infamous war room or NOC. That certainly worked when everyone was in the same location in the office. But that’s not the case anymore. And increasingly, we’re seeing where incidents are somewhere you’ve got many people located around the world, and your infrastructure can be located all the way around the world.

It’s hard to bring everyone together in that way. So really important that we think about technology as an enabler and what we want is collaboration. You know, the ability to bring these diverse opinions, these diverse views, these diverse mental models, diverse sets of experience together into one place.

That’s where Fibreplane comes in really. And I think coming back to that education piece. When you’re working together and you’re writing a PromQL query. I’m getting the data back and everything else. It’s actually there in the notebook and it’s recorded forevermore. So what you’re doing is as you investigate is you are creating this incident artifact I like to call it.

This record of what happened right so that you can actually firstly learn from it and later on because you understand what people were doing and if you do maybe have someone more junior or less experienced in this particular part of the system, or just tired and struggling to think about things, then they have a record of what was done previously.

And then you can start thinking about post incident analysis and seeing what people did as well and starting to think about avoiding things like some kind of postings and it’s easy for people who maybe weren’t involved in the incident to provide some level of judgment.

Some normative language. This should have been obvious to them if we look at the information or something like that. And clearly it wasn’t to the people at the time, right? So why wasn’t it obvious to them? Actually having a record of what they did, the order of what they did it and everything else really helps enable further learning opportunities without getting caught into some of this quite natural thing of the person looking back who has maybe more experience in that particular part of the system thinking it was obvious but not able to understand the, the context of the decision making that was happening during an incident.

Ash Patel: It definitely beats going with confluence and just trying to make your wikis from scratch, because you can pull so much data from different places.

Ivan Merrill: Yeah, exactly. Confluence is a great tool. In my old team, we used it heavily for documentation and things like that.

Really great place to do all of that more static stuff. But again, it’s not domain specific, right? It’s not designed for SREs to integrate with their SRE tooling and if you look at something that we’ve built is around integrating with Prometheus, Elasticsearch, Loki. Primarily so far those open source tools.

We are seeing more now a new generation of tools coming around the incident space. Right. Historically we’ve had companies like PagerDuty become really synonymous with the incident space.

But a lot of it has been around the incident flow, the incident management process, right? How we communicate incidents out to a wider audience, but not actually how we investigate incidents. And that’s definitely where we sit in. So from a Fiberplane point of view, I should mention that we’ve got an open source thing, Autometrics, which is quite interesting as, as well.

Ash Patel: Okay, let’s say someone’s looking at fire and they say, how is this different from any of these ChatOps tools that have been approaching us and they’re integrated with slack.

We can do an incident response with them. You would say you’re different, but it might not be clear to them from just looking at what you have to offer. How would you differentiate that? I know you’re not in the marketing and sales part of your company, but I think you can really get this across in a very elegant way, and I I would love to hear it from you.

Ivan Merrill: I’m not sure if I can say it elegantly, but I’ll try. So firstly, chatOps tools are really great and they can help you actually go through the incident flow and bring in the right people, identify who is the incident commander and all that good stuff. Right? And I think that’s all important, but again, it’s around the overall managing of an incident.

It is who’s got what role within an incident, right? What that doesn’t help you with is actually the resolution of an incident, the investigation of an incident. The managing of an incident, we can all know who’s got the right role and who’s got the right thing. That’s great, but someone does actually need to go and look at the data.

Work out what’s going on. And that’s where we come in. That’s where we really excel. And as I said before, we’re dealing with increasingly complex systems. And use the old analogy of cattle versus pets, when we had those old pets, there was maybe one system that we could look at, and we had a sysadmin that really understood it, and they had all the information needed to manage that system in their head, but that’s obviously long since not been the case.

And it does take several teams quite often multiple groups of people that have their own, maybe data stores, with the advent of open telemetry, it’s never been easier to have different teams using different systems that work best for them.

And so how do you actually bring that data together in a way that helps people? You know, work together and speed up that incident analysis, and that’s Where we come in. Allowing people to run Prometheus queries, run Elastica queries or whatever.

Bring in their metrics, bring in their logs into one place and work together in the open. It’s removing some of the blockers that people have with this stuff, because it is possible to reach a situation whereby you’re almost reaching a situation where teams are interacting with each other, but because they can’t maybe understand what’s going on from that other team, they say their system’s fine, but how do we actually know their system is fine, right?

I don’t believe their system is fine. We’ve definitely seen that happen before, and that’s a problem, right? There is distrust there, that is very much a human thing. How can we actually help people to work in the open because fundamentally, I think that one of the biggest differentiators between an incident that’s resolved really quickly and an incident that’s taken a long time is, it’s generally going to be the people.

It’s who’s involved in that incident, how quickly you got the right teams into that incident, how quickly they can get up to speed and understand. what’s gone on so far with the investigation and and everything else like that. So having a tool as I said that’s domain specific that’s designed for people to work on these incidents all in one place and then record everything that’s happened so that you can learn on it going forward is a really big thing for me and as I said Different from having that overall kind of management and kind of the role stuff defined in major updates and stakeholder communication and stuff like that.

Ash Patel: I think it’s an important thing to do as well. That’s what I thought when your colleague Nele was talking about data and the kind of data that you can attain from systems. She was mentioning something that we’re going to talk about in a sec, the autometrics, but I want to just get back to the collaborative notebooks, because with the notebooks, you’re managing data a lot more effectively.

Organizations love the concept of data, they just struggle to handle it. And I think that’s what excited me about what you guys are working on, and that’s why I wanted to have this conversation with you, record it, and share it with people. Because it’s not just about, for me, it’s not about having a product that goes out and say, Hey, cool.

This is a product. It’s about people who are actually doing something interesting in the space and they have something interesting to say and do and that’s what you guys are doing.

Ivan Merrill: I think there’s massive thing that companies are showing whereby talking about how much data they’re collecting. The millions of metrics. How they’ve scaled their monitoring system so big and stuff like that and I think well, you know, well done you that is a really great and I’m sure complex engineering challenge that you have overcome.

But is it useful? How quickly does an engineer get to the right bit of data that they actually need.

Quickly, right?

You’ve solved the engineering problem, but that doesn’t actually give any value to your organization.

The value to your organization is how efficiently you can use that data. And, that’s to me, the more interesting thing and the thing that actually gets you to the value, right?

Ash Patel: This brings me on to autometrics. And that was the other thing that got me going. I actually went into this breakout session. You were actually doing one of the larger sessions. And then your colleague, Nele was in a breakout session in the conference. And I thought, yeah, okay, let me sit through this.

I’ll just see what’s up. About 10 minutes into it, I was blown away by what autometrics can actually do. I would like it if you could explain autometrics.

Ivan Merrill: So when we started talking about Fiberplane studio with different companies and organizations and speaking to lots of SREs at conferences, we realized that actually there was a more fundamental problem, which I definitely resonate with, where Companies are struggling to actually instrument their code properly in the first place, or well enough. Maybe developers don’t know how to do it, or aren’t interested in it, unfortunately, in some cases.

So how are we going to get to the place where in order for companies to be able to effectively use Fireplane Studio, they need a good data set, right? They need useful, actionable data. This one team does it really well, but this other team doesn’t.

And again, going back to my previous point, we need to make it in a way that the value for your organization isn’t in instrumenting your code. It’s being able to use the data that provides. Autometrics is the end result of that. It’s something that we released open source. It’s completely open. We’re really pleased to see that we’ve had some other people outside of the organization begin to get involved and contribute as well, which is really, really great to see and what it does is It’s aimed at putting metrics in the hands of developers aimed at making, I think we’re calling it, developer first observability, which is really an interesting concept.

So I did a talk actually this week in Kubernetes Community Days Austria and my talk was around the mental models that we all have are different.

The mental model that someone who’s written a large part of the system that understands the code base has, will be very different to someone like me, more on the traditional ops side of things, but the data that we generate from our monitoring systems, from our observability tools, are much more of something that I would understand, but bear little resemblance in most cases to the understanding and the mental model that a developer would have.

And that makes it more difficult for them to get involved in incidents or even to understand the value of monitoring and observability. Historically with the whole DevOps movement, we talked about this left shift about moving stuff earlier in the software development life cycle.

And if we want to do that, one of the best ways to do that is to show people in that development phase that there is value in their actions. There is value in them instrumenting the code. But it’s difficult when they have to do a huge amount of translation between their mental model, which is the functions, the modules of their code base, and the data that is being generated.

So Autometrics allows people to decorate their functions of their code base and for every function that’s decorated, you will get three metrics. You’ll get the rate, errors and duration. So the red metrics as they were . Three useful metrics that you can use to understand how many invocations of each function there have been, the percentage of errors for it and how long it’s taking.

These are all designed to be Prometheus first, but they’re open telemetry kind of based and compatible, so you can actually send them wherever you want them, but the point here is that this is something that Developers can understand.

These are metrics that are useful to them. They can make sense of these metrics. And you can build service level objectives, SLOs off that, and then alerts of those SLOs, using the SRE book best practice.

So really important stuff and trying to make it much, much easier, much simpler to get to the point of value. Which is, we can do something with this data. We use autometrics ourself on our own code base and if we get an alert, we can look at the SLO that’s causing that alert and say which functions are contributing to this SLO.

And then we use the labels to say which functions are called by this function. So we then have this metrics version of tracing, whereby we can go from our top level function that’s contributing to our SLO, see that that’s either got a high error rate or a particularly long latency at the moment, and go down a level and to watch which functions that’s calling, see if any of those have got a high error rate.

So we can actually traverse our code base pretty quickly to understand where we are seeing the issues. It’s really designed to be useful. We are not generating metrics for metrics sake. We are generating metrics that are designed to be used and we spent a lot of time trying to make sure that they are genuinely useful and are presented in a way that that is useful for people.

That’s what a Autometrics is. It’s another thing that certainly For me is really exciting because it’s something that I have genuinely seen the way that we can now troubleshoot our application. Don’t forget that the incident or the issue or something might not be caused by a code issue.

It might be that we find the function that’s causing us the problems is calling an external dependency or something like that. But being able to go from alert to where in the system very quickly, even without necessarily having much knowledge of the code base. I mean, I can troubleshoot an incident on our system and it’s Rust based and I do not know Rust at all.

But I can understand which function is causing the problem and have a pretty good idea about what’s going on, which is great and makes it all much more accessible to people. And again, particularly coming back to helping people at 3am when they’ve just woken up and stuff, removing them out of that.

The main knowledge that’s required, you don’t need to understand PromQL, you don’t need to understand any of these things, you can just get the metrics that are useful and use them, which is really important.

Ash Patel: With a level of granularity that you don’t get from other tools.

This is what really got me when I was looking at the demo by Nele.

She was able to, like you said, instrument a particular function and show this function could be the cause of this problem within these thousands of lines of code, but it could be this specific function. Are there any other tools that do this?

Ivan Merrill: I don’t know. I don’t know. I think there’s some kind of profiling tools that are looking to do something around developer based stuff.

But nothing that I have seen has tried to use quite a simple idea, which is a metrics based on functions using kind of quite simple technology. We’re just using open telemetry or Prometheus client libraries generating some simple metrics and doing this stuff.

So I’ve not seen anything. And I think, you know, the fact that it says said fully open sourced and everyone can use it is really great.

We’ve talked about education. I’d love to help people get to a really good place in terms of instrumenting their application.

And with no experience necessary in the observability space, have a pretty good idea of what’s going on with their application, and be able to do a level of troubleshooting without needing some external help or spending a long time understanding the many gotchas around PromQL or anything else like that.

It’s really exciting. I’m really hopeful that that’s going to take off and get some adoption.

Ash Patel: Well, that’s what got me excited because I went into this knowing about a instrumenting services but to instrument specific functions was something I had never seen before and I actually went to another conference in Amsterdam, the DevOps Enterprise Summit, and I told some SRE managers who were struggling with issues related to this about what the Autometrics tool does, and they couldn’t believe it, so I knew I needed to get you on to explain it like you have.

Ivan Merrill: As I said, completely open source, so people can go to autometrics.dev and have a look at it. If you do want to see it and everything, there is a ability to run it all locally and test it on your local machine so you can actually see.

Have a little play with it yourself and understand what it’s doing. I really do recommend it because when we look at the SRE world and I think this is something that’s probably the reason that I got so excited about SRE in the first place was the Dickinson’s hierarchy of reliability.

What is the most foundational part of this kind of reliability pyramid? It’s monitoring. And that’s always been this thing that for many years now has been my real focus and I’ve spent a long time and done many talks and spoke to many engineers and tried to get them excited about monitoring and everything else like that.

And it can be quite a hard sell in times, but seeing the stuff and seeing how the SRE space was calling out that this was so foundational and so needed was such an important thing to me. I really do have this massive passion for helping other people reach the situation of,

Hey, we can observe our application.

Hey, we understand what’s going on.

And I think it’s something that can work for all aspects. You know, it’s not just for production systems. It’s for all phases of your SDLC. So from the moment you’re writing your code being able to observe it is, is really important.

If you are going to do some testing and maybe you’ve got a pre prod environment don’t be in that situation, which I’ve seen a quite a few times where something breaks in production and you go back and look at it in pre production and the monitoring systems are showing you that, yeah, it was definitely breaking in pre production as well.

No one was looking at the monitoring because that’s a production concern, right? So the more we can reach this situation where people understand the value of it, the better, but one of the best ways to do that is to make it easier for people to get that value, to make it just like so easy to use, so easy to get value from that it’s just silly not to.

And I think Autometrics really, really does do that. I really do strongly recommend people to check it out.

Ash Patel: I like that you brought up the fact that it’s not just about production because a lot of people I’ve been talking with, especially Sebastian, who cohosts this podcast.

We talk about it. Obviously we haven’t recorded that part yet, but we do talk about needing to do observability well prior to getting into production. And there was actually a VP of engineering who spoke about this in Toronto a couple of months ago.

It’s so important and I’m glad you brought it up.

Now, that brings me on to the next thing I want to ask you. You’ve been working with SREs for a while now. What is one of the biggest anti patterns you see in this space?

Ivan Merrill: Now I’m going to give you a monitoring observability answer because I’m from the monitoring observability space.

I think one of the biggest things is thinking that all incidents are almost the same or relying too much on a static kind of either runbook or dashboard or something like that. And not realizing that an incident is an ever evolving thing. You aren’t just having the same incident over and over and over and over again, because if you are, that’s an engineering priority thing. You should be going and fixing that.

And you’re causing your people pain through that. That’s not sustainable for your people. And not giving them a good thing. Right at the beginning of the call, I mentioned the watermelon effect.

And I think we just need to be mindful of that. Understanding that my dashboard can be green, but that does not mean that my service is fine. And we must remember that dashboards are great. They’re great at telling a particular story. And we can’t rely on one dashboard. It can’t be all things to all people.

It should be there for a specific reason. And actually in incidents, we need to constantly use new data and hone our troubleshooting skills. Don’t just think that there is something that we can read the rule book and everything is going to be fine.

Troubleshooting is a skill and it’s something that people, can develop. And there is a reason that some people just seem to know what’s going on. They seem to have this kind of smell of what’s going on in an incident. And it’s because they’ve probably been involved in quite a lot of incidents.

And more specifically, probably around this particular system or service or whatever. And so their ability to create a hypothesis and disprove it very quickly is much more elevated, and they’re able to do that really, really quickly.

The quicker we can start creating good hypotheses and disproving them, the quicker we can get to a solution. And that’s not gonna happen if we’re just there thinking, well, I’m gonna look at this screen and it’s gonna tell me what’s going wrong, or I’m gonna read from a playbook or a runbook or anything else like that.

I don’t in any way want to say that runbooks aren’t great. I certainly had them and I was a leader and managing a team, but we had them as a very light touch thing like.

If you see these symptoms, you might want to consider these things.

This is how you can check these things.

Not a, see this, if you see A do B. That’s not going to happen.

And similarly with dashboards. Dashboards are really, really good, really powerful stuff. A great way to bring value to your data. You know, don’t expect them to be all things to all people.

That’s really hard. Build them for a specific reason. But when it comes to incidents, to troubleshooting stuff, go where the data takes you. Don’t stick to the same old thing.

Ash Patel: For 80 percent of incidents, maybe you can do the runbooks, and you can just follow that step by step checklist. But for 20 percent of the incidents, probably the ones that are really going to hit you hard, you’re going to need to do a little more investigating, you’re going to need better skill set in terms of troubleshooting, and that requires building experience, learning from data, and being able to follow a path that’s not so linear.

And it’s a non linear pathway, you’re going and finding new solutions, going down new rabbit holes. And seeing where that takes you.

Ivan Merrill: Yeah, absolutely. And to follow through with Dr. richard Cook, if you consider the fact that every time you see a problem and you fix it, then you are continually fixing the stuff. Then it actually takes more to break your system next time.

If you are constantly adding more resilience and reliability to your service because you know you are doing the right SRE approach, you’ve had an incident, you’ve seen it, you’ve resolved that incident and then gone back and fixed some of the underlying contributing causes for that.

Then it’s gonna take more for your system to break next time.

Which means it could be an even more catastrophic issue, so we have to bear that in mind.

With the complex systems that we’re building and that we’re evolving and upgrading and improving and adding more to, the… Opportunity for catastrophic failure to occur is always there. So don’t get too relaxed, but just make sure that you understand the data, that you are able to interpret it and understand what’s going on.

Ash Patel: I was thinking about some things and I was going to ask you what kind of tip would you give to SREs about working lives? And I feel like you have answered that in a way.

Be a better troubleshooter, learn how to do that better, learn how to learn.

Or would you give another tip as well?

Ivan Merrill: I think those are really, really, really important. I think if I was going to sum it all up into something, I would say everything you do has to be sustainable. And it’s something that certainly having been an SRE, kind of team manager myself, that was something that really drove me.

The first thing I did on taking over the team was looking at how many times we get called out and what we can do to remove that.

There are some things we could invest engineering time in making it so that this particular type of issue wasn’t impacting us or enabling other teams to be more self service.

All of those things are really important because fundamentally as an SRE , we are driven to want to solve interesting problems.

I think that’s why most of us work in technology. We have the capability to solve new problems in innovative ways. And that’s what we like doing. So we talk about in SRE kind of culture about removing churn.

And that really has to happen. And I think it has to happen in all ways and to make it more sustainable for your team because ultimately the people are part of the system and we need to make sure that we always always remember that.

So make sure that everything is sustainable for you and your team.

I think if you’re doing that, then you’re pretty much on your way in terms of the whole SRE world and way of working.

Ash Patel: Great to have you on Ivan, and thank you for joining me.

Ivan Merrill: Thank you very much for having me on. It’s been great to talk about all things SRE, and I’ve really enjoyed it, so, thanks again.

Author
Recent Posts

Connect?

Ash Patel

Reliability Nut at SREpath

Ash has an unhealthy obsession with software reliability. Maybe it’s got to do with the trauma of working at a few companies where software kept slowing or went down while he worked to turn it around. His ma hopes that he can one day turn this passion into a respectable job or business. Still waiting…

Connect?