I had glorious plans for the day: get up early, drive up to Fresno, meet three beautiful ladies all dressed in pirate outfits, photograph them all over the renaissance faire, then make the long drive home overnight. It started off well and early. Fresno was a four hour drive from my house, so I had to get up before the sun was up, jump in the car with my cameras and costumes, and make the long drive. But it was the first real day off I’d had in months, and I was looking forward to it.
The sun was just coming up and I was enjoying the view of the countryside, when I noticed I was slightly lost. I didn’t have a GPS (this was before they were standard equipment on all cars) and it took me a while to figure out where I was and how to get to the faire. During that time my cell phone rang.
I cursed silently to myself. Even though I was technically off, I was still on call, and my phone ringing at 7am on a Sunday morning was not going to be good news.
I reluctantly answered and found myself talking to John, a consultant who monitored and helped manage the multi-million dollar computer systems for our company. After some pleasantries, he told me had received a report that the power was out in our area. We had a generator so this shouldn’t have been a problem, but John was a very thorough fellow and went the extra mile to make sure everything was fine in the computer room.
John mentioned that his board (the monitoring software that ensure our systems were up and running) couldn’t contact our computers. He thought this might be because of the power failure, but he wanted me to know just in case there was a problem.
My stomach knotted up and I felt myself getting tense. This didn’t feel right.
Less than a minute later I received a call from one of the in-house help desk people who told me everything was down. The building, which was supposed to be protected by a huge 500KVA generator, was completely dark except for emergency lighting, and the computer room was unearthly quiet. I could hear the fear in his voice. This couldn’t be good.
I needed information and started calling my own team without success. Not a single person answer their phones. Getting hold of anyone early on a Sunday was always a nightmare… so I asked John to mobilize his team and get them into the computer room to see what they could find out.
During this time, I managed to get to the faire and in between phone calls told my three lady friends they would have to do without me for the day. Believe me, these ladies were gorgeous, and I was cursing at the bad luck for a disaster to happen at the precise wrong time. And little did I know, the timing couldn’t be worse.
John arrived at the computer room and quickly called me back. He told me the generator was working fine, roaring like a champ, but the transfer switch, the hardware that switches from street power to generator power in the event of a power fail, didn’t switch over. This meant the computer room lost power.
But the news got worse. The equipment was connected to two large UPS’s (Universal Power Supplies), which should have allowed all the computers to shut down gracefully.
One of the UPS’s failed. Since we had half our equipment plugged into one UPS and the other half into another, it was, well, suffice to say it was bad.
And, as John investigated, it got worse. By that time power had been restored and he was poking around to determine what happened and what could be done. And the third failure was the batteries in one of the SAN (an array of about 500 disk drives) failed. This was extremely bad in that the whole purpose of the batteries was to give the disks enough time to write out whatever was in memory to the disks in the event of a power fail. Since the batteries failed, the memory was not written.
This meant all of our data was so much garbage.
And it got worse. The communication line to the disaster site went down the day before, which meant the data out there was not going to be in a useful state. And just to make a bad day really bad, the backup tapes from the night before had failed.
So we had a thoroughly crashed computer room with destroyed data drives and our only recovery was backups older than 2 days. This was really, really bad.
John, who is one of the smartest people I know, told me to hold on, he had an idea, and he’d call me back in a few. To make a long story short, he figured out how to rebuild the disk array and get the data back online. To this day I don’t know exactly what he did, he said I didn’t really want to know, but he restored the data on the backup disks to a usable state.
Now, we were not out of the woods by any means. We had our data, but it was not in the database. It was in a backup format on a set of disks. So I called my Database Administrator and learned that he was in an airplane over the Mojave Desert about to jump out of a plane.
Fortunately I had a consultant available who was able to drive into our office and restore the data.
The long and short of it all: we were down for a total of 18 hours. We came to within a whisker of loosing everything, but we recovered.
I learned a lot because of this disaster.
- Our disaster planning was very good. We had disk-to-disk backups (which is what we recovered from), tape backups, a disaster site, a generator, UPS’s and so forth. Even so, a real, honest to god disaster sliced through everything and brought us to the brink of a huge cliff. This caused me and my team to completely reevaluate our plans and come up with better solutions.
- The generator had never been tested in a live situation. That was added to our plans, after the transfer switch was replaced.
- Always have several options available. Multiple backups, a disaster site, tapes or whatever. It’s not overkill.
- Test. Test. Test. That was perhaps our biggest problem was finding the time to test. Not just our time, but time we could take the computer systems offline to test.
- After this disaster, I worked with John to create an independent NOK (we called it a Tactical Operations Center) which monitored everything twenty-four hours a day, seven days a week.
There were dozens and dozens of additional lessons learned. But the main lesson was we needed to plan for disaster, to test our plans, and to retest constantly.