#12 From Incident Firefighting to Reliability First (with Robert Ross)

Episode 12 [SREpath Podcast]

Ash Patel interviews Robert Ross who is the founder and CEO of Firehydrant, an incident management platform.

Robert talks about his experiences as an SRE and making tools for making developers’ lives easier.

He also shares his insights from offering incident management software to SREs and other software incident responders. Highlights include defining the broader concept of reliability, making smarter choices for handling incidents, and more.

Episode Transcript

Ash Patel: Thank you for joining me, Robert. You founded FireHydrant several years ago to deal with incident management as a service.

Can you tell me about what your relationship is with SRE?

Robert Ross: My relationship with SRE started well before FireHydrant. I was the catalyst for FireHydrant, but I was a site reliability engineer at Namely, which was a payroll benefits company.

And when managing the systems that handle all of the HR software, the payroll software that supplied a lot of money, we did well over a billion dollars a month in payroll to people. So that is a system that has to be reliable. So I had a lot of work there.

Before Namely, I worked at Digital Ocean and I was an on-call engineer there as well, so dealt with a lot of incidents, a lot of fires.

Sometimes I was accidentally an arsonist and started Firehydrant on the side to be the tool that I wanted selfishly for myself as an on-call engineer as a side project. And then lo and behold met a couple of the right people, the exact right time and was able to turn it into a company.

And we’re almost five years old now.

Ash Patel: Five years. Wow. That’s a long time.

Robert Ross: Almost.

Ash Patel: It’s been a pretty natural transition for you though, because you’ve mentioned now your history with site reliability. You worked as a site reliability engineer.

Why SRE in the first place?

How did you fall into this wonderful world of site reliability?

Robert Ross: I think the reason is because I like building tools for other developers. I enjoyed when I did do it in my career, building products for the end user. And DigitalOcean, your end user is a developer dominantly. I like building software that developers use and get value out of.

And SRE is kind of this wonderful land where you’re always building software for other developers to give them, maybe safety, maybe it elevates them to be able to do other things. They don’t have to build software such as automated rollbacks for themselves. They can just build the software they want to build.

And I get a lot of enjoyment seeing those engineers be successful without having to think about certain things that I get to think about. That’s always been a happy medium for me. I get to build cool things with really cool technology and elevate other teams as well.

Ash Patel: What can FireHydrant do specifically to make an SRE’s life easier?

Robert Ross: A principle of mine is if I come on a podcast, I try not to be a salesperson as much as I humanly can. But I do think that you should know a little bit about what we’re up to. So FireHydrant: we’re a full-cycle incident management tool.

What does that mean? We go from the moment you declare an incident all the way through a retrospective postmortem, if that’s your nomenclature, incident analysis.

We allow people to define the process that they want to have for the assembly phase of an incident, for the mitigation phase of an incident, and for the retro phase of an incident.

We do that with our feature that we’ve had since 2019 called runbooks. And that was the defining feature that really gave us our grasp on the market. It allowed teams to say, well, every time we have a Sev1 incident, we want to create a Slack channel, a JIRA ticket, a zoom bridge, and light up a bat signal.

Like it allowed teams to do that without having to think of anything. And while each of those things independently might only take maybe five minutes. How hard is it to create a Slack channel? How hard is it to create a Jira ticket? If you have an engineer: it’s late at night, it’s at the end of their workday, maybe they’re out on a date with their significant other and they suddenly get paged, those five minutes really do matter.

The assembly time is what we really pull down for folks. We know that mean time to resolution, depending on where you start measuring it, it changes, that’s not really a great measurement, but we do think that the assembly time over time will bring down your MTTR.

So we help folks really focus on getting the right people in the room, tagging the right services that are broken, which can then bring in the right teams because those services have owners. And then we can automatically update a status page from a runbook. And it really just brings out all of the process work for an engineer and allows them to think about the most important thing, which is mitigating the incident.

The way I think of it as if you call 911 and in the United States emergency services and you have a fire in your apartment. Imagine… I live in New York city… so this is a very real thing, but imagine that fire truck has to navigate through a lot of traffic before they can get to your house to help put out the fire.

Imagine if that traffic could just disappear every time you call 911. That’s what FireHydrant does. We aim to basically make it as easy as possible for your fire department, your on call engineers, whoever fights your fires to get to the incident as fast as possible, no traffic.

Ash Patel: There are a few people who are trying to solve this problem for first responders or incident responders, whatever you want to call them.

I wanted to learn a little bit more about what the competitive space looks like to you. What are your thoughts on what the incident response market looks like? If that’s even what you would call it.

Robert Ross: I think that the incident response market is colliding. There’s a few different problems to be solved when it comes to incidents.

You have alerting tools that have been around for quite some time. You have tools like automatic rollbacks and like there’s a couple of vendors in that space and then you have the incident response tools, which is very simple, like open an incident and automate some things. And then there’s even retrospective tools.

There’s chaos engineering tools. Like there’s a lot of tools that are all focused on this concept of what I would say is reliability. And I think what’s going on right now is the space is seeing that these point solutions can’t exist much longer, at least not at the scale that if you took venture capital should be at now.

So we’re going to see a lot of what I think will be like a platform maneuver. We’re going to see folks go, well, we’re not an incident response tool. We’re an incident platform. We call ourselves a reliability platform a lot because we do so many things that are not in the moment, incident related, we have status pages.

We have the retrospectives parts. We have a lot of components to reliability that are not just incidents including a service catalog.

I also think that there are some older players in the space that haven’t maneuvered the change in the way that we build software and the reliability requirements that go with it.

So overall, I think what’s happening is that there are a couple of planets. Like vendors and their own specific niche thing that they’re solving. And we’re going to start to see a galaxy form. Similar to how we saw that happen with observability many years ago. We asked the question, what logging tool did you use?

Related article:  Uber’s Site Reliability Engineering (SRE) culture and practices [Audio]

And maybe you said Splunk, maybe you said something else, maybe you rolled your own. We don’t really say that anymore. We say, what observability tool do you use? And I think that that is indicative of the path that the reliability space will also take. I do think that they are big enough to be separate.

You could make a case for observability also consuming it at some point, but I don’t think that’s going to happen in the next five years.

Ash Patel: You’ve written articles about incident management specifically. One of them is about… delineating the type of incident management we have, whether it’s centralized or whether it’s non-centralized.

Can you elaborate on what you would mean by centralized versus distributed incident command?

Robert Ross: When we think about central versus decentral, it really comes down to, do you have a team that is very specifically assigned incident management? And we can see there are older forms of this.

Maybe it’s a network operation center. Maybe it’s the incident command team. There are some larger, more I would say, mature organizations when it comes to the SRE mindset. They’re just more mature around incident management teams. You can actually see a trend if you have LinkedIn insights, you can type in incident commander and the job openings have gone up.

There’s a good amount of companies that are hiring specifically incident commanders. Stripe has a team for incident response. Twilio has a team for incident response. And what’s happening is we’re seeing that for these companies, incidents are such a big deal. Any amount of downtime can cost a lot of money that they will hire people specifically to help manage those problems.

And then there’s the decentralized world where you actually train all of your engineers, potentially as one example, maybe your technical program managers, your project managers, they actually get incident command training and they are all independently responsible for incidents for their pod.

It really depends on what stage of company you’re at, I think, to justify one or the other. And at the end of the day, it is a cost concern. If you have enough incidents at a sufficient scale. It may make sense to pay someone into that amount of money. They probably will have enough work. We work with some companies that have teams of over 10 people of incident commanders might seem like a lot, but they are so big and their market caps are in the billions that it makes sense for them to do that.

A lot of other companies we work with that are moving more into like the service ownership model. Not that the other companies aren’t in service ownership model, but if you’re in a service ownership model, it’s a lot easier actually to also kind of have that decentralized where every team is on the hook for their incidents that come up.

So I think it’s an interesting model. Which one fits your company is what we consult people on when they’re in the beginning of their journey.

Ash Patel: So if you look at the downsides of a centralized team, you would be looking at things like those incident responders are going to get very exhausted if they’re on call. Or is it the fact that you’re hiring a lot of people that you can deal with that?

Robert Ross: Some of the organizations that we work with are in that huge scale, billions of dollars, they follow the sun. So they have incident commander teams 10 hours apart from each other. And that works well for them. Because then they can do a very natural handoff between those teams when the hours align, because they’re at a global scale. They need to follow the sun.

Ash Patel: I figured that because I’m thinking, you cannot have a bunch of people in the same time zone trying to do that. It just doesn’t work. Like in the healthcare space, people doing night shifts and then switching to day shifts. It’s exhausting for the people working.

I mean, in healthcare, people just have to do it. But if you have an option in the tech space, you shouldn’t be telling your people to do anything like it.

Robert Ross: It’s a good point because I think a lot of the time with incidents, we forget that the human toll. If you don’t sleep, that’s bad, not only for like your work relationship, but just it’s physically bad to not sleep.

So it’s not like you should have an incident command team of say three people and expect to have 120 hours of incident management each week. You should probably expect to have much less and have the additional capacity to flex when you need it. If you’re constantly running a sprint, you’re going to burn out and you’re going to burn that team out.

So if you are going to go for that centralized command, don’t try to put the math on it of like, how many incidents do we have times the hours it takes that we think command should be working on it.

You should bake in a lot of flexibility, a lot of slack.

Ash Patel: One of the things I’ve seen a few incident responders do over the time was, they’re like, well, I’m a night owl, so yeah, I’ll handle the night shift.

And then they end up dealing with, okay, that was fine when I was actually just playing video games, or I was just hanging out at night, but actually being on call that time, it’s exhausting. Have you met people like that?

Robert Ross: I have met people that are just at a different stage of their life, I would say. I think that everyone is kind of like, Oh, well, like I was that person.

I would take on call shifts when it was the weekend or whatever it was. Cause I was just at a different stage of my life. I was younger. I had less worries and concerns. I didn’t have kids. I didn’t have pets. It was just easier for me to do that and say yes. But now, I think it would be harder.

And I think that’s why you’re starting to see a lot of regulation come around, like paying for on call. The European union is starting to push this very, very heavily. And you’re going to see that, I think more and more, and eventually we’ll likely see legislation in the United States for that too.

Because if you’re working at 2am, even if you’re not doing anything, but you’re on call, you’re still working. Your brain doesn’t just turn off. Nobody’s brains can do that really. So I do think that we’re going to see some change there but there’s always going to be someone that’s more willing to take that on call if they stay up a little bit later and, you know play some video games or whatever.

And if they get paged, it’s just easier for them to jump in, I guess. I have worked with people like that.

Ash Patel: I suppose a lot of companies then seeing the shortage of people who are willing to go and do incident response full time or be SREs or do on call, whatever you may call it , difficult.

So they’re considering, well, maybe this you build it you run it model that became big at Netflix is a good thing. And more and more companies are taking it on. How are you finding it on the ground with developers responding to this kind of incident response model? That distributed incident response model?

Related article:  #17 Lessons from SRE’s Wild West Days (with Rick Boone)

Robert Ross: I don’t know if I can speak to the true emotional response at a broad scale for engineers. What I can say is that we do see that once it does pick up, that people have just more effective response. It doesn’t mean that they’re resolving things faster, but maybe the learnings are better.

Because the team that solved it will be able to have a much more comprehensive retrospective, they’re the owners of that team. They can use and say different words than a team that doesn’t know anything about that service that potentially jumped in and fix something.

I think that you build it, you own it was first said in 2006 by Werner Vogels, the CTO of AWS. And we’re seeing that happen more and more now because the tooling is allowing that to happen.

I think that you build it, you own it has taken a little bit longer to become the standard because.

One, you had engineering teams that were already in flight and had their own cultures and shifting to a you build it, you own it culture is, it’s a great deal of work and energy and cultural shift. And you might need to hire different people to work in that model. So because of that, we’re only just now really seeing you “build it, you own” it take broad effect.

I realize that if you’re a mid market company, you’re early enterprise, you might’ve been you built it, you own it for a long time, but the financial institutions and the large scale companies like that are only just beginning on that journey And I think that it helps with reliability because you are able to own your metrics more, you’re able to own your quality more, and it kind of points all of the accountability to that team.

And that might sound scary, but it’s a better model, honestly, especially as we start to work in this more microservice world where everything is a monolith with moons and those moons are the microservices that people own. You’re responsible for your territory. You’re responsible for making sure it’s 12 factor that it has a production readiness checklist that’s checked off.

And overall, all of those patterns are going to lead to a more effective incident response. The way I explain it is imagine – we’re going to go back to the fire fighter analogy, when you start a company called fire hydrant, there’s just no escaping it but it works here. If I call 911 right now and I give them my address, they’re going to give me the fire department down the street.

They’re not going to give me a fire department from the Bronx or from Manhattan or from Queens or Staten Island. They’re going to give me a Brooklyn Williamsburg station. And why is that? Well, a, they’re closer. They know the area. They don’t even need to look at a map probably.

They know which streets are one ways and which ones are not. They know where fire hydrants are. So it’s much better to have the fire station where the fires are gonna break out, be the ones that are always responding. Now, in the cases where that fire station is out and they are somewhere in the neighborhood because of some other fire and in a unfortunate scenario, there’s two issues that the fire station has to respond to.

Then guess what? Now you have a neighboring station that can maybe come in. Maybe there’s a team in your organization that understands the dependencies and has a similar tool chain around their services, similar to yours. Maybe they can jump in now while you’re dealing with something else. So it’s just a better model, I think, for incident response to also be in the service ownership world.

That’s why FireHydrant’s had a service catalog since day one. We’ve always been able to track which teams own which services and which changes are in there too, because change is always the biggest contributor to an outage.

Ash Patel: Aren’t there whole companies built around just that one thing: the service catalog.

Robert Ross: There are. You see some great companies out there like Cortex. We have a lot of friends there and they’re building a whole company on this because the problem is that big for companies.

Ash Patel: So how do you not tread on each other’s toes in that situation?

Robert Ross: I think when you look at a company like Cortex, they are very much a catalog. They’re very much a scorecard company as well. So they’re helping organizations build, I would say, just better software. And then where the touch point is, okay, you have this list of services. And when they break, you can kick off an incident in FireHydrant. They have an integration with us as well, and they can track all of the incidents for the services in Cortex that are linked in services in FireHydrant.

So that’s the integration point today.

Ash Patel: I’m glad we clarified that because I wasn’t sure where they finished and where you started. Well, I was , but I wanted to clarify it for people listening.

Robert Ross: Well it goes back to the beginning, right? Like what’s happening in the competitive space.

We’re not competitive with Cortex. We don’t have no intention of being, but we’re in the same universe right now. And we’re in the reliability universe. And you’re going to see that a lot more, I think in the next five years.

Ash Patel: I suppose they’re not your competitors, but there are competitors and you’re trying to carve out your own niche within the incident response management space. So let’s talk about your customers. We’re going to talk about your ideal customer to understand what they would be like. What kind of problems are they facing when they first reach out to you?

Robert Ross: When folks first reach out to us for incident response, incident management, all of the bits and pieces in between there, commonly we hear, we don’t have a process today, or we have a process and it’s very manual. It’s also a healthy blend of the two. We have a process and we don’t like it because it’s too manual.

We also hear we’re trying to put more people into incident management responsibilities. So going to that service ownership trend. And so the first thing that we’ll ask folks is we’re going to try to get a sense of what level of maturation are they at? And more often than not, we’re going to see folks like they can respond to incidents, but it’s a very manual process, or they have this homegrown tool that does one thing, but now they want to upgrade that.

So we’ll help them codify that process in a FireHydrant runbook and allow them to say, when an incident happens, we want a Slack channel, Zoom, you know, all of that, talk track, all of the pieces that an engineer on call person has had to do in the past manually. That’s usually the first thing that folks are trying to solve.

The second thing is they want a central place to do it. They want it to be in Slack. They want one tool that’s the canonical storage of all of their incidents. And the retro is associated with that and the analytics that come with that.

You have to have some ROI behind it, so we have to be able to provide folks with the analytics portion of it as well. And usually in the first four months, we just did a report on this. When all that is said and done, we help them get their assembly time down very, very fast. In some cases to nothing, we actually ran this report and we saw in the first four months you pay for our product with the savings that we returned back to you.

Related article:  How developers can survive “you build it, you run it”

So that is reducing that assembly time and getting that really down and reducing the level of burnout for engineers, the opportunity costs. That’s the first thing that folks want to know how we can help them.

Ash Patel: Do you deal with SMEs or enterprises? Is there a particular space that you’re more interested in helping?

Robert Ross: We have companies that range from 10 engineers on the platform to over 3000.

We work with financial institutions, we work with small businesses, we work with a very large bank to a very small developer tool. So there’s really not a scale that the tool doesn’t have some level of impact for. I think that’s a very unique thing about this space. Incidents impact any company.

You don’t need to be at any scale to have incidents. You’re going to have incidents no matter what.

Ash Patel: So where do you see your category heading in the next couple of years?

Robert Ross: I think the category is going to revolve much more around reliability and less about incidents over time.

We are a firm believer that incident management is going to be a core tenant of people’s businesses. And it is in many cases in the next few years. That is our major belief. We also think that you’re going to need a consolidated platform to do a lot of this incident work, which will eventually flow into reliability.

And again, The universe is going to start to form and be a tangible thing where we think that reliability is where the puck is headed, not just incidents.

The only reason we call them incidents is because it impacts reliability. We care about the reliability part.

If we didn’t have incidents, we wouldn’t even think about them at all, which is an obvious statement, but the only reason we respond to incidents is because it impacts our reliability.

Ash Patel: I suppose that answers the next thing I was going to ask you, where do you see SRE heading? It’s going to be more and more reliability, right?

But do you have any specific input ?

Robert Ross: I don’t know if I have anything that’s going to be earth shattering or innovative here. I mean, SRE, the R is reliability, right? That’s the thing we care about. I’ll probably say what a lot of really smart folks are also saying. Charity Majors has one of my favorite quotes, which is the nines don’t matter if the customers aren’t happy.

The version I have of that is that your reliability is not something that you get to define, unfortunately. It’s this kind of thing that you influence the most in a positive way, but you have no control over what people say about your reliability, which is what I would say is your true reliability.

So for example, there are a lot of companies out there that have SLOs and SLAs, and those are valuable in their own right, but that is not your reliability. Your reliability is what your customers say and feel about you. And there’s a lot of things that your product will do. Or not do that will make them feel like it is less reliable.

Incidents is the biggest one. If they try to use your product and it’s a 502 error and they can’t use it, obviously they’re going to say, this is not a reliable product. But there are aspects of your product that are also going to impact your reliability that you will never track in an SLO or never even think of it or brush it off and say that won’t matter.

And those are the things like maybe something for a subset of customers took an extra 500 milliseconds to load. That might never trigger an SLO ever in some companies, but for that pocket of users, they felt that, and it reduced your reliability score just that much. Maybe customers are driving along on an otherwise very smooth road and they hit a pothole and then they hit another pothole. And then a third, they’re going to go, Oh, this city’s roads are not smooth. Even though they just had five miles of smooth roads, the moment they hit three potholes, they’re going to go to, “these roads aren’t great”. So you have to think about how people feel using your product and that’s your reliability.

All of the things that we’re building at FireHydrant from now to forever is going to be to help your end users feel like your product is reliable because at the end of the day, that’s the only thing that matters. It doesn’t matter if you were up for 99. 99%, but for the five minutes they tried to use you, you were down, I think it’s five or eight minutes a year.

I think that’s where the world is going to go. It’s just going to be, how do we make the users happier using our products, which will then increase a fictitious reliability score.

Ash Patel: Let’s wrap things up. What piece of advice would you give to SREs regarding their work? Having been one anything you’d like to give as an advice for them? It can be anything related to their working life. I think you will see in a good amount of organizations that there is a title change from DevOps to SRE. I think it’s potentially slowed down in the last couple of years, but speaking from experience, I was on the developer operations team, and then the next week I was on the SRE team.

Robert Ross: The difference was zilch. There was no quantifiable and now you do things differently. It was really just a title change and maybe it was to help with recruiting. Maybe it was to, I don’t know. But if you have the title SRE or even you are given SRE responsibilities, I think it’s important to realize that your job is to build software.

Your job is to build software for the purpose of more reliable end user experience.

And I see a lot of teams get caught in the, you can’t build anything because you’re constantly firefighting. And I would say really take into account how much time you’re spending not building software versus doing the ad hoc work.

I think Google plants their flag at 50%.

That over 50 percent of the time, if you’re over 50 percent toiling and not building software to help with reliability, they can hit the stop button and they can say something has to change. We need to hire, whatever it is. Because right now I’m not doing the engineering part of my job. I’m doing the very ad hoc, you know, duct tape, mud and sticks fixing, and that’s not going to be a good thing over the long run.

So I would say take a stand and say that this is not SRE. What we are doing right now is not SRE. I have the title of SRE, but what we are doing is not SRE. And I think I have more people feel empowered to say that we’re actually going to see the industry shift.

Cause we talk to a lot of people and we say, what’s your dial SRE? And we ask what they do. And that’s, it’s not a lot of SRE. It’s a lot of ad hoc work.

Ash Patel: Thank you, Robert, for taking the time to talk about interesting things in SRE and incident management. Really appreciate it.

Robert Ross: You’re welcome. Thanks for having me. This was fun.