Why heroism is bad, and what we can do to stop it

What is heroism?

When there's a systemic problem or gap in a system, and an individual decides to fill that gap.

General pattern

We have a system: a service, a team, a process, a piece of automation, some combination of the above...

We want the system to have some nice property: often but not always an SLO of some kind.

The system does not have this property.

The Hero decides that, despite this, they will uphold this property, no matter what.

No matter what

No matter what includes:

No matter how many hours they need to work.
No matter that they need to work evenings and weekends.

Often also:

No matter what they're told about not doing this.
No matter whether this property is actually important or not.
No matter whether the team has "officially" decided to uphold this property.

Example #1

Team has a ticket queue, and a ticket rotation, with one person assigned to tickets every week.

They have a ticket SLO: all tickets will be handled within 24 hours.

Ticket arrival rate is so high that, most days, the tickets arriving in one day cannot be handled in a normal 8-hour working day.

Our Hero is on the case, though: They start working 12-, 14-, 16-hour days to make sure the ticket SLO is upheld.

Example #2

Team has a production service and associated automation, tooling, and on-call rotation.

They have an SLO for the service: 99.99% availability.

But the service effectively does not run at 99.99% availability, due to inherent properties of the service, how it degrades, the team's release process, alerting thresholds, etc.

Enter the Hero: Every day (including weekends), they look at the monitoring; they spot-check graphs. They roll back bad canaries by hand, they spot memory-use increases before alerts fire and submit CLs to fix the issues, and they spot outlier tasks and restart them to limit the damage, etc., etc.

Example #3

Team has a launch process: Follow checklists, calculate resource requirements, submit changes, run automation to push out changes.

There's explicit pressure to make launches happen quickly — after all, the launch is good for users, or might increase revenue. We don't want SREs slowing the launches down.

But the process is full of manual work, and is very time consuming. The automation is flaky and needs hand-holding.

Have no fear, Hero is here: They'll work 12-hour days to get everything in place, and then hand-hold the automation over the weekend to make sure we're ready to ramp up Monday morning.

But wait... heroism has saved us before!

Sometimes acts of heroism really are appreciated and necessary — we've all run into situations where we had a blind spot when planning, or experienced a major outage, or something weird and unexpected happened.

But heroism shouldn't be something you plan for or rely on in the long term. If you're committing to an SLO, you should be able to meet the SLO without any heroes.

Is firefighting heroism?

Not necessarily. Heroism often comes in the form of firefighting or toil. However, if you have a proper, structured approach to firefighting, with realistic expectations, no heroism is required.

On-call, when done right, does not require constant heroism.

Heroism is bad for the system

Because heroism masks systemic problems, the systemic problems are never fixed:

The team never has proper discussions about a realistic SLO.
The team doesn't realize that it needs to work on long-term systemic fixes that enable the system to behave the way it wants it to.
The system is broken, and because the team doesn't realize that it's broken, the system never improves.

Heroism is bad for the team

Heroism cultivates a culture that reinforces unrealistic expectations about the amount of work and the type of work expected from team members, and about the behavior expected from our systems.

Heroism is bad for the individual

The individual performs vast quantities of uninteresting, low-impact work.

They are likely to burn out.

No matter what volume of rote work an engineer does, it's unlikely they will get promoted.

Why is it hard to discourage?

Heroism is low risk, and easy to do. A hero knows they can do and complete the work. (This can be a huge temptation for anyone with any form of impostor syndrome.)

As a hero, you may receive immediate rewards: You complete a task, you feel that it's important, you often get praise from other members of your team or of a team that you helped, and you get peer bonuses.

It "feels" really important, and after all, aren't we supposed to own and care deeply about the systems we run?

How to discourage heroism

Explain why the behavior is bad for the individual, the team, and the system.
Tell the teammates engaging in heroism to let the system break.

Explain why the behavior isn't constructive

Help the Hero figure out what they should do instead. Often, they have a really good feel for where the pain points and problems in a system are, and have good but "raw" ideas about how to fix them. Push them to work on the long-term fixes.

Let it break

This sounds scary, but normally it isn't. Some common patterns here:

The Hero really thinks the consequences of them not saving the system will be disastrous, but they really usually aren't.
Sometimes the Hero is upholding some property that really isn't important, and the team (and Tech Leads, Managers, client teams, etc.) have decided that we really don't need to uphold the property—it isn't worth the cost. But the Hero won't let it go.

Let it break

If we have SLOs, we also have an error budget.

A signal that a system is broken is a really valuable signal. It's worth spending some error budget to obtain that signal, and it's worth pointing this out (especially if the Hero thinks the world will end if they stop being a hero, but nobody else does).

Why Heroism Is Bad

And What We Can Do To Stop It

What is heroism?

When there's a systemic problem or gap in a system, and an individual decides to fill that gap.

General pattern

No matter what

No matter what includes:

Often also:

Example #1

Example #2

Example #3

But wait... heroism has saved us before!

Is firefighting heroism?

Heroism is bad for the system

Heroism is bad for the team

Heroism is bad for the individual

Why is it hard to discourage?

How to discourage heroism

Explain why the behavior isn't constructive

Let it break

Let it break