Renaming “post-mortems” of software outages for psychological safety

As a generative leader and mental health advocate, I am wary of seeing such a morbid term being thrown around for what should be a learning experience that advances culture.

This post will differ from my usual positive posts about Site Reliability Engineering (SRE). Please bear with this because I’m an otherwise forward thinker.

Two issues I have with the term post-mortem:

  1. It compromises the psychological safety of novice SREs
  2. It risks your job security in pathological organizations

Let’s unpack this.

Imagine being a new SRE and hearing all these fascinating terms like SLOs, observability, APM, Chaos Engineering, etc.

Then a term — typically reserved for gritty crime dramas — makes its way into the SRE lingo.

“We are doing a post-mortem on yesterday’s outage”.

A what? You and I know what it means: figuring out what went wrong after an outage or performance degradation event in the production software system.

But let’s consider others for a second.

It’s a ghastly connotation for people who are averse to negative metaphors. Even more for those mentally scarred from seeing post-mortem scenes on TV.

Yes, they exist, but many will not be vocal about it.

I can understand the term’s origins. A lot of my friends are pure engineers and many of them have a sense of dark humor that they use to shock and delight each other.

But Site Reliability Engineering spans well beyond the figurative IT basement. It has begun to draw in diverse — in particular, neurodiverse — talent. Do we need to have lingo like this?

The other issue I have with this term is that it can risk your job in companies that don’t practice — and likely never will — a Westrum generative culture as outlined by Humble et al. in their book, Accelerate.

A generative culture of accepting failure will not translate well to companies where I’ve seen managers chastise people for unavoidable mistakes.

Related article:  How Jaeger tracing fits into software observability

Sonja Blignaut is a complexity science thought leader and has written about the use of dark metaphors in organizations. Dark metaphors are words loaded with some other meaning that is likely to cause friction in organizational dynamics.

Here’s an excerpt from Sonja’s writing:

“… we examine the nuts and bolts of what makes a powerful learning experience.”

Nuts & Bolts … so are learning experiences like a machine?

and another one:

Looking “under the hood” to understand culture; Fireing up, fixing or fine-tuning your culture. Creating a culture change “dashboard”.

So is culture like a car that can be taken apart, fixed and tuned? Again the metaphor implies predictability and mechanistic certainty.

So what will happen when a Site Reliability Engineer talks with a sociopathic manager who takes metaphors literally? “We’re doing a post-mortem on the outage.”, the SRE says.

The manager will think, “Post-mortem? Sounds like something went bad, and someone must have messed it up big time.” and then say, “Okay, so who’s involved in that?”.

In an act of self-preservation, many such managers will create a scapegoat for a revenue-losing or bad-PR downtime incident.

In its essence, the word post-mortem is a trigger word describing maleficent intent.

The institutional meaning will evolve into something that takes us away from what we want in all organizations — even the non-tech ones. We want a blame-free environment that lets us learn and improve systems.

On that note, I propose “Retrospective” for describing post-incident analyses.

In a world already deeply fatigued by negativity, do we need to be reminded of it every time we want to work to improve our systems?

Related article:  Runbooks for better incident response