It’s an unavoidable reality that bad things happen. I work a lot with software engineering and systems operations teams, and there have been too many times when someone had detected a systems failure or a potential security breach.
In these situations, naturally the first action is to stabilise the situation and fix it. That’s where the initial energy goes, and if it’s something non-trivial then it’s often a lot of effort. But once things are back to normal, there’s an important question I always ask, if no-one’s asked it already, and whether or not we’re doing a formal post mortem: “How can we stop this happening again?”
There are always ways to improve. Things like checklists and playbooks can help resolve problems quicker—when the pressure’s on, possibly at 3am, it’s easier to follow instructions than have to start reasoning for yourself.
But the big improvements come from reducing the chances of a similar problem happening in the first place. Sometimes this requires changing parts of the system (processes or software) such as adding a review step to changes, or scripting often-manual tasks, or addressing some tech debt. The very greatest improvements often come from designing risk out of the system entirely. In the software infrastructure world two major improvements in recent years have come from the idea of infrastructure as code and from treating servers as cattle, not pets. Those industry-changing ideas come about rarely, but we should still look to see if we can change our own systems to prevent a repeat of our more recent problem.
We all wish bad things didn’t happen. But when they do, we can—and should—take it as an opportunity to improve.