As a generative leader and mental health advocate, I am wary of seeing such a morbid term being thrown around for what should be a learning experience that advances culture.
This post will be different from my usual positive posts about the field of Site Reliability Engineering (SRE).
Please bear with this because I’m an otherwise forward thinker looking toward a bright future for this space.
Two issues I have with the term post-mortem:
- It compromises the psychological safety of novice SREs
- It risks your job security in pathological organisations
Let’s unpack this.
Imagine being a new SRE and hearing all these fascinating terms like SLOs, observability, APM, Chaos Engineering etc.
Then a term — typically reserved for gritty crime dramas — makes its way into the SRE lingo.
“We are doing a post-mortem on yesterday’s outage”.
A what? You and I know what it means: figuring out what went wrong after an outage or performance degradation event in the production software system.
But let’s consider others for a second.
It’s a ghastly connotation for people who are averse to negative metaphors in general. Even more for those who are mentally scarred from seeing post-mortem scenes on TV.
Yes, they exist, but many will not be vocal about it.
I can understand the term’s origins. A lot of my friends are pure engineers and many of them have a sense of dark humour that they use to shock and delight each other.
But Site Reliability Engineering spans well beyond the figurative IT basement. It has begun to draw in diverse — in particular, neurodiverse — talent, do we need to have lingo like this?
The other issue I have with this term is that it can risk your job in companies that don’t practice — and likely never will — a Westrum generative culture like outlined by Humble et al. in their book, Accelerate.
A generative culture of accepting failure will not translate well to companies where I’ve seen managers chastise people for unavoidable mistakes.
Sonja Blignaut is a complexity science thought leader and has written about the use of dark metaphors in organisations. Dark metaphors are words that are loaded with some other meaning that is likely to cause friction in organisational dynamics.
Here’s an excerpt from Sonja’s writing:
“… we examine the nuts and bolts of what makes a powerful learning experience.”
Nuts & Bolts … so are learning experiences like a machine?
and another one:
Looking “under the hood” to understand culture; Fireing up, fixing or fine-tuning your culture. Creating a culture change “dashboard”.
So is culture like a car that can be taken apart, fixed and tuned? Again the metaphor implies predictability and mechanistic certainty.
So what will happen when a Site Reliability Engineer talks with a manager who takes metaphors literally?
“We’re doing a post-mortem on the outage.”, the SRE says.
The manager will think, “Post-mortem? Sounds like something went really bad, and someone must have messed it up big time.” and then say, “Okay, so who’s involved in that?”.
In an act of self-preservation, many such managers will put up a scapegoat for a revenue-losing or bad-PR downtime incident.
In its essence, the word post-mortem is a trigger word describing maleficent intent.
The institutional meaning will evolve into something that takes us away from what we want in all organisations — even the non-tech ones. We want a blame-free environment that lets us learn and improve systems.
On that note, I propose a term that’s more descriptive but not provoking: “Post-incident analysis (PIA)”. Yes, another acronym for the SRE jargon bank.
In a world already deeply fatigued by negativity, do we need to be reminded of it every time we want to work to improve our systems?