Example #2
Team has a production service and associated automation, tooling, and on-call rotation.
They have an SLO for the service: 99.99% availability.
But the service effectively does not run at 99.99% availability, due to inherent properties of the service, how it degrades, the team's release process, alerting thresholds, etc.
Enter the Hero: Every day (including weekends), they look at the monitoring; they spot-check graphs. They roll back bad canaries by hand, they spot memory-use increases before alerts fire and submit CLs to fix the issues, and they spot outlier tasks and restart them to limit the damage, etc., etc.