,

#8 Software Reliability Ninja Who is NOT An SRE (with Pablo Bouzada)

Episode 8 [SREpath Podcast]

Ash Patel interviews Pablo Bouzada about his beliefs on software reliability as a non-SRE software engineering leader. They discuss the importance of leadership to drive effective reliability changes in the software system, as well as the challenges of providing reliable service within video streaming giant, ViaPlay.

Read the Episode Transcript

Don’t want to listen right now? You can read the full conversation below

Ash: This is an interview episode of the SREpath podcast. Pablo Bouzada is a software engineering manager at Viaplay and has a considerable interest in Site Reliability Engineering (SRE).

He will join me to share his learnings from driving SRE principles within his organization.

Thank you for joining me, Pablo. You’re an engineering manager at Viaplay, and I understand SRE is a big part of your work. Can you explain to me what is Viaplay, and what is your role specifically at the company?

Pablo: Viaplay is a Swedish streaming company. We are mainly in the Nordic market.

That means Sweden, Denmark, Finland, Norway, Iceland, and also other countries in continental Europe, like Netherlands. My role as a manager is to support teams to work in the best way they are able to create the best experience for our users.

I think the main goal for the whole company is to create the best experience for our users. And this draws from Site Reliability because we are able to connect technology KPIs to demands of the users.

For instance, one of our KPIs is the time between the user hits the play button and the time that the user gets the first frame and is able to watch the film, the series or the Formula One race they want to see, and this is, in some sense, a connection between the technology and the business.

And for me, it’s the main important part of SRE. That we are connected with the core business of the company.

It’s not something like doing technology for technology’s sake. It’s more that we are getting the best system possible for our users.

And this is fun. This is something that all the company are aligned to do together because it’s not only one team or one individual, but the whole effort of all the company.

Ash: It really is a full effort for the company to get behind something like SRE.

We met at a conference in London a couple of months ago where you spoke about SRE and culture.

But there’s an interesting fact. Your title doesn’t specifically mention SRE, Yet. you have a strong relationship with SRE because you talk about the cultural aspects and you have ideas related to the concept of reliability.

How did you get into thinking about SRE and reliability practices?

Pablo: I think that the point is it’s not the title. We talk about the principles we try to achieve.

It’s more like we want to create the best environment for our developers. We want to create the best environment for our operational crew. And with that, we want to get our goal. That is to create a good streaming platform to get more users or retain the users that we have.

It’s more like that. We don’t expect to be a site reliability engineering certified company, but we want to grow in the Number of users we have, also earn money, also create better content. Connect or be aligned with all that.

Sure, we have a team that has the name, reliability team but they are the ones that are connected with the stability of the platform in the moment that we have really crucial events.

For instance, a Formula One race or we have a Premier League match.

We have people that come together and check that everything is working and then if something is not working as expected, try to figure out how to solve and so on and so on.

But for sure, I think that one of the main reasons that I joined the company was the challenge that I wanted to achieve for myself as a manager to be in that kind of complex context.

We have not only static content like a film or series or something like that. But sports – that makes things really complex to manage.

In the last Formula One, we had more than 4 million users in the same race at the same time.

That’s our peak of complexity.

And for that, we work a lot with trying to improve our tools, internal tools, for sure, our system with ideas of how to improve from several points, like DevOps, platform engineering, and site reliability engineering.

And then we use those principles in the way that we work, but we don’t have Site Reliability Engineers, for instance.

We have developers in a team that care about something specific in the system. We could call them Site Reliability Engineers for sure. But we prefer to maintain focus on the thing they are working on, not on the title they have.

And for sure, my title is generic like a new manager that covers a lot of topics but we try to get more in mind that we care about the specific systems. to support our main platform, or maybe a platform for other users, or tools for our internal developers, or whatever.

Ash: I agree with you on that point of not needing labels.

Even though I have a bias, having a site called SREpath.com and talking about Site Reliability Engineering, and helping people develop the role, I do agree with you that you don’t need to have that specific title to do things that are related to SRE.

100%. And a lot of companies, if they are too scared to jump into SRE, should consider it.

How long have you been thinking about things like this in terms of reliability of software? Because it’s not something that comes naturally to engineering managers, and I’ve worked with many over the years. So how long has it been?

Pablo: I think that, as you say, it is something like a journey that you start to try to improve your skills.

And trying to help the teams that you work with, to improve the way they are working. And then you get ideas from other companies, but also for the trends that we have in the industry.

That’s something that sometimes you have to take carefully because sometimes you are getting something that is so trendy in the industry or so trendy in some point of time but does not make complete sense for you.

Related article:  LinkedIn’s Site Reliability Engineering (SRE) Culture and Practices [Audio]

But some companies try to do, “Okay. We want to do that because it’s the trendy thing. Then we want to follow that like fashionistas. We want to follow that wave.”

I started as a developer, worked as a backend developer for more than, maybe 15 years, something like that. And then typical path of team lead and then more like architect, and then be a manager.

And then I got ideas as part of the development team to introduce a continuous integration, continuous deployment, DevOps and then the natural path to Site Reliability.

It’s something that you get in a natural way that you will try to improve things and then you get that idea so that you get that mindset of thinking of observability, think of reliability.

It’s like you’re doing a puzzle then you get, “Okay that piece fits on my puzzle”.’

This was for me, but now I ensure that people maybe with less experience could start from the very beginning doing those things. This is a good thing that I think that we have in the industry.

We don’t have to start everything from scratch. We are moving on the shoulders of giants. And then we get people that support other Companies or individuals to improve.

You work on a solution several times and then you’re sure that that gets the impact that you expect and from that, you could start to build something.

We don’t need to start every time from, “Okay, we have a server, and then we have to install Tomcat and then we have to install MySQL database, and then we have to manage all the versioning, and so on, so on, so on, because we have a lot of tools that cover all that.

Ash: Okay, so you’re saying you’re exploring ideas on solving problems and then you come up with a solution and then you work it a few times.

And from that, if it’s actually something that’s really useful, you build on it and you make it open source and bring it out to everyone else.

Pablo: And the same with the way to work, we have a lot of companies, a lot of blogs, books, and people that are able to explain the meaning of things that are working for several times.

And then it’s something to start from. It’s not something like, “Okay, I have to understand microcomputing in order to set up a mobile application in the market”.

Ash: Yes, there would be a lot of people out there who are not even in your company willing to share ideas with you on how to solve operational. challenges.

You’d have some very unique challenges in your organization.

Because you did mention that ViaPlay does video streaming when you’ve got 4 million concurrent users. It may not sound like a lot to some people, but it is a lot because video is data-intensive, bandwidth intensive.

You probably have to think a lot about capacity planning. Are there any interesting things that your company does in particular?

I know you mentioned earlier before we started this interview that you measure a lot of things. So can you enlighten the listeners about what you measure?

Pablo: Sure. We have, I think, the typical things that all the streaming companies measure like time to recovery, bandwidth, and those kinds of things. But also our, operational team defines their own metrics.

And this is something interesting in the way that we work because I think we are a quite open company.

The first week of each month, the operation team releases a document with the SLAs, SLOs, and SLIs that we got the past month, and today, I have that document in my email waiting for review.

This is interesting because we are in very good numbers in our metrics.

And this is something amazing because it’s not something that only the technical people get, but the whole company gets and then it’s something that means a lot for us.

For sure, we have latency as a metric. We have the initial buffer.

Failures in the login. We have a bunch of things that we take care of and then with that, we work, the next month. We continually get that feedback from our own system.

This work It’s not only once per month, but this is a way that we could get trends in our system.

If we are improving something, for example, we are increasing the SLAs or whatever, we are in risk of getting bad numbers then we have to push to improve something.

This is something that drives us. It’s something that is part of Site Reliability Engineering that the metrics, the data drives you to define the next steps.

It’s not something like, “Oh, I want to improve that animation that we have in the login.” Maybe this means nothing for a user. It’s not affecting any of our KPIs. But we try to improve based on data. I think this is the important thing.

Ash: My takeaway from what you just said is that to have an effective SRE team, you have to have a data-driven organization.

Netflix has a few metrics, and I’m wondering if you think about those metrics at all. They obviously are the pioneers in streaming. technology, or the first people to be noticed for having a big streaming service.

That’s probably the better way to put it.

They have starts per second, so the number of people who can successfully hit the play button, and time to interactive, how long it takes for the contents of the actual app to load, and time to render everything above the fold is rendered.

Do you have any thoughts around any of those kinds of metrics?

Pablo: We see Netflix like a mirror. It’s a mirror for us. It’s something that we try to reflect on and then get similar metrics.

In the document here, we have the “joined time”.

That is the time that the user will start the application to something interactive to use, and the same with the content at the time that someone looks for something and then hits the play.

For us, hit the play is the thing. All of our effort is set to get the content to that person in that moment.

I think the main difference that we have with Netflix, is that we have live streaming of sports. That’s more complex because we have to deal with, for instance, 4 million concurrent users in a Formula One race.

Related article:  #5 Where does SRE fit into your organization’s structure? [Audio]

We have our metrics and we want to stay below our targets.

We measure that a lot. We get something like two seconds to show the first thing or the interaction.

Ash: Some people listening might be thinking I’m trying to fish to see if you are doing all the best practices like Netflix.

But what I’m trying to say is you are a data-driven organization. You are thinking about these things, but these are things that have been thought through before.

Like you said before, you’re standing on the shoulders of giants. And it’s a good idea to see what practices they’re doing and bring it in, and it makes your life easier.

Pablo: This is the point, I think.

Not reinventing the wheel constantly, but thinking about what other companies did in the past and then get that insight from them, because there are lessons that that company learned.

And I think this is something really amazing for our industry that we have people that explain how they failed in their companies. Think about that.

It is something amazing that, for example, a bank could share their internal information in that way.

There is no other industry that shares those kinds of things, but we have people sharing, “Okay, we have that and that, and we create that internal tool for doing things. And then, you know, that we will open source it and then you could use it if you want.

Because we are really good doing that and then we want to get things working in a good way. This is something amazing in our industry including things like Site Reliability Engineering.

And then you get a lot of tools that are based on that.

They share the ideas and then other companies get ideas and think, “Okay, how could I implement that and support other companies to get that running?

What are the benefits that they will get with that?

And then the proposition is something is impossible to think of in other industries that a company shares a lot of that kind of information to others and then leads that movement for “Okay, this is a bad way to work. And we want to, as an industry, we want to work in a better way.”

And then you get DevOps, you get Agile, you’ve got continuous integration, continuous deployment. You got those kinds of things in some way for free from your competitors, you know.

It’s something that sometimes blows your mind because it’s like you are getting so much information.

But yeah, we move in that way that you go to an event, start talking about the way that other people work and then get ideas and go deep with people.

For instance, I remember in the speaker dinner of London SRE event, I talked with people in charge of DORA metrics inside Google.

Without any problem, they share the information they have with me. But my company are using Amazon Web Services, and they didn’t care about that.

It’s “Oh, you’re using that, oh yeah, and, what service are you using? Oh, that one or that other. Oh, nice, nice.”

And this is something. We are a big community of people that work in different companies, but we share a lot of ideas, a lot of knowledge.

This is amazing.

Ash: There is, in my opinion, no other part of an organization that really shares this much information with outsiders openly and doesn’t get all worried about it, which is such a beautiful thing.

Speaking of organizations, setting up something like reliability in an organization is a challenge.

I know this from helping so many people try and do it. What are some of the challenges that you personally faced in your organization and trying to get ideas around reliability accepted?

Pablo:

I think that we are moving in the right direction because we are trying to improve our systems using that idea that comes from outside and then getting inside our mindset or our tools and so on.

This is something that I think is the best way to change things that inspire people to do things, not to say, “Okay, we want to set that goal and then we want to become a DevOps company.

That for some people makes no sense.

And then the first thing that they will struggle with is “What does Site Reliability mean for me?”

Does it mean that my position in the company is in danger?

I have to take some consideration to maintain my role, position, hierarchy, whatever. That creates a lot of friction.

But if you instead say something like, “Okay, we want to improve that SLA. And we think, for that, why not use that thing that comes from Site Reliability Engineering?”

But you don’t say specifically that we want to become a Site Reliability Engineering company.

We are using ideas that come from there in our context without giving too much consideration for where the ideas come from and instead say, “Okay, it’s something that we want to try”.

This is important when you are changing the way that your company works.

One of the things that is really important when you are managing a change in a company: don’t force, don’t push so hard, the thing you are trying to introduce, but get the space for trying things for proving hypotheses, for experiments.

It’s something like connecting dots. You want to get the final thing, but it’s not a linear or straight line.

if you said something like, “Well, we want to use Splunk for logs.” An overly strong manager might say, “This is my idea, do it”.

But it’s better to say, “Okay, wait, figure out how that tool works and how we feel using that tool and get to the next level.

It’s something like connecting dots. It’s more like, “Okay, maybe you were able to connect dots and then see how that works”.

And if something is not working, “Okay, we could connect to a different dot”.

And with that, we will finally get our main objective, that is we want to get a tool to manage our internal data in a better way than we have now.

We have, for instance, a mess of different sources and so on.

Related article:  #18 Winning at SRE in Banking and Telecom (with Troy Koss)

And then why not try to do things, do experiments with data, proof of concept, and then moving from there? Not trying to impose things and then say, “Oh, this is a tool that we have to use. And no other options”.

Okay. Maybe you could decide to start with something because you have in mind that this is a tool that will work well, but also leave space to people to manage in a different way to get there.

Maybe it’s another tool that works better.

And with that, you are managing a change in that way saying, “Okay, why not try that thing? And if it’s working, continue with that. If it’s not working, maybe we have to rethink about that. But why not try that?”

Ash: Change is such a big thing. It’s so complex. It can show up a few antipatterns.

If you’re trying to bring in a change and then you start seeing things that are going wrong.

What is the biggest anti-pattern that you’ve seen in the broader field of reliability in software?

Pablo: I think the main antipattern is having a culture that does not allow anyone to fail.

You are not able to change anything because people will be staying in the same point they are or they were for the last 10 years or 20 years. They want to be there and not move anything.

You have to create the idea that failure is an opportunity to learn. This is the first thing. If you are not able to work in that line, then I think this is the antipattern, trying to introduce something in a context, in an environment, in a company, that does not allow any failure.

Ash: That kind of situation is like trying to move Everest. It Would feel impossible.

Pablo: But If you are able to fail in a company and you introduce experiments, you will get new things, and at least you will learn.

The best is that you will improve, but at least you will learn things and then you will improve in the next step.

Ash: You have a lot of wisdom to share with the world of SRE. So what advice would you give to new SRE managers?

And I know you don’t have that title, but you have a broader title of engineering manager.

So what kind of advice would you give to engineering managers or SRE managers trying to get into thinking about reliability?

Pablo: I think that the main thing is to create a good environment for people. And it’s not something naive. It’s, “Okay, create a good environment in which people try to do their best and try to get ideas on the table, discuss and get working from there.

Try not to be the smartest guy in the room, “Okay, I know everything. This is the way that we have to use the tool that we have to use. This is the mindset you have to have.”

“You have to think in that way like me!”

Instead, create a space that feels more like, “Okay, I have some experience in that. Why not write that thing or we want to become better in specific things? And we think that this is a way. What do you think about that?”

Another thing is to also support people to go deep in that.

You have knowledge to support them to understand better the way that you want to work or the company wants to work. I think the important thing is that all the technical aspects could be covered.

The important thing is that people feel comfortable to say out loud and then others will hear and then compare with their own ideas and then get something that works for everyone.

Ash: Speaking of people, I’m interested to know what your people are like. What does your team look like right now?

Pablo: We try to have small teams.

We’re trying to get diverse people in teams, in all aspects, not only in the technical part, and that creates a space for talking from different points of view.

This is something that works really well when you are trying to work in complex systems because you get more points of view that get you more options and with that in mind, with that on the table, you are able to get the best solution for things.

Ash: How do you manage their ongoing performance?

Pablo: This is something that sometimes is hard, and takes a lot of time because you have to take in mind different things. Not only the typical metrics for measuring developers but also to measure outcomes and to measure the way that the team is working.

I got an idea from Martin Fowler, that the minimum measure of a team is a team.

With that in mind, you can measure how the individuals work in that team because you will get from the team itself the way they are working.

When you measure the team, for sure, following DORA matrix or a STAR matrix will get, you deeper context from the individuals but then you will be blind of other things.

For me, it’s really important how our team is working as a whole, as a group, and also with other groups.

Because maybe you have a team that works really, really well, but overtakes all the ideas of all the other teams or gets all the responsibilities and then creates some kind of bottleneck for everything.

This is not something that works well in a company because you need that team for everything.

It’s better for the system, it’s better for the environment that the team is able to work with other teams and then trust in other teams and then create good synergies between them.

Ash: This was very insightful because you’re saying you can have people bring in technical skills, but that doesn’t mean much if you don’t organize it well.

Yeah. If you don’t have the right culture, if you don’t have the right practices in place and the right type of manager skills, which you obviously have.

Pablo, I really appreciate you coming on and giving your insights to people who are learning about the space and more broadly learning about being a better engineering manager. So thank you.

Pablo: No. Thank you – for inviting me.