John Allspaw on incidents, teams and learning organizations

28. 07. 2020

Overview

John focuses on helping organizations do better in incidents management and learn from their own mistakes. I talked with John about incident management, learning organizations, and how the world has changed since 2009.

John Allspaw is the founder of Adaptive Capacity Labs and one of the people that drove DevOps movement from its beginnings. His (and Paul Hammond’s) famous “10+ deploys per day at Flickr” talk at Velocity conf in 2009 showed us that there is a better way of collaborating and delivering software. Today, John focuses on helping organizations do better in incidents management and learn from their own mistakes.

I talked with John about incident management, learning organizations, and how the world has changed since 2009.

Ivan: John, why do we as an industry suck at incident management?

John: If we think about all the incidents that could be taking place but are not, it means that incidents are in a real sense extremely rare compared to when there aren’t incidents.

So if you’re asking the question of what makes us good at preventing incidents, well to some extent it’s our experience with past incidents that allow us to anticipate scenarios in the future.

We are most predominantly focused on fixing, here is an incident, that was terrible, we don’t want that to happen again…naturally let’s figure out what we can DO. Doing a thing, fixing, coming up with specific things you can do, action items, follow-ups, has somewhat of emotional value to it because that can provide us some feeling (whether it’s an illusion or not) that we have done something. That we can have control over our future. And incidents represent these surprises, they come out of nowhere, at the beginning of an incident it’s quite difficult sometimes to figure out is it just a bad afternoon or is it the big one, the one that sinks our company. And those are really uncomfortable feelings. The way that we cope with that from engineering sense is… we want to fix and improve things.

The disconnect is the idea that we can come up with things to fix without first understanding in a real sense where this came from, how were we surprised, what’s difficult about his?

I would take the stance, and many other people, and it sounds really obvious, if you have a better and richer understanding of the incident, you will have more productive things to do about it in the future.

Why do we suck at incident management? Largely it’s just immaturity, it’s an evolution… We have to understand that doing this well is possible and you just have to practice and develop some skills that you otherwise wouldn’t have…

Watch our whole conversation and John’s hints for better incident management.