Many, many years ago I held the pager for my company. This was long before cloud computing, long before devops, and long before infrastructure as code. Mostly the pager fired when a disk filled up, and I needed to log in and delete some files. But one weekend it fired for another reason—one of our client’s webservers wasn’t responding. I tried to log into the server, but got rejected. After noticing other failures, and getting many other failed logins on other servers, I called our ISP, but the phone number was wrong. When I eventually found the right number I was told we didn’t pay for the plan that offered the support I needed. Everything I tried had a problem. Every alternative came up short. In summary, it was a very, very, very long weekend.
The actual explanation for all of this was nothing malicious: it was simply a coincidence of two or three failures compounded by a deep weakness in our recovery mechanisms.
Today I’m a strong advocate of rehearsing infrequently-used procedures like this. And as you can tell from that story, it’s not because I expect the main procedure to be wrong, it’s because there might be problems in any part of the surrounding system. It might be a problem with access to the documentation, access to the relevant part of the password manager, getting the right telephone number, a phone having enough battery… anything.
The nature of uncertainty is not just that things won’t go to plan, it’s that we can’t even plan the way in which they won’t go to plan.