“How can we stop this happening again?”

It’s an unavoidable reality that bad things happen. I work a lot with software engineering and systems operations teams, and there have been too many times when someone had detected a systems failure or a potential security breach.

In these situations, naturally the first action is to stabilise the situation and fix it. That’s where the initial energy goes, and if it’s something non-trivial then it’s often a lot of effort. But once things are back to normal, there’s an important question I always ask, if no-one’s asked it already, and whether or not we’re doing a formal post mortem: “How can we stop this happening again?”

There are always ways to improve. Things like checklists and playbooks can help resolve problems quicker—when the pressure’s on, possibly at 3am, it’s easier to follow instructions than have to start reasoning for yourself.

But the big improvements come from reducing the chances of a similar problem happening in the first place. Sometimes this requires changing parts of the system (processes or software) such as adding a review step to changes, or scripting often-manual tasks, or addressing some tech debt. The very greatest improvements often come from designing risk out of the system entirely. In the software infrastructure world two major improvements in recent years have come from the idea of infrastructure as code and from treating servers as cattle, not pets. Those industry-changing ideas come about rarely, but we should still look to see if we can change our own systems to prevent a repeat of our more recent problem.

We all wish bad things didn’t happen. But when they do, we can—and should—take it as an opportunity to improve.

Photo by Amish Patel

One thought on ““How can we stop this happening again?”

  1. Is Root Cause Analysis routinely performed – every time ?
    What is the Root Cause of a defect ?

    Cause:
    The error that caused the defect
    Root Cause:
    What caused us to make the error that caused the defect

    Without proper Root Cause Analysis ,
    we’re doomed to repeat the same errors

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s