Posted on 06/16/2011 11:45:11 AM PDT by ShadowAce
Get enough people of any profession in one room and the conversation drifts inexorably towards horror stories. Everyone loves a good and then it all went horribly sideways yarn, and we all have more than one.
The IT profession is flush with tales of woe. There are no upper boundaries on ignorance or inability.
From common facepalms through to the failure of the backup plans backup plan, here are the top ten disaster recovery failures I have seen.
About once a quarter someone walks into my office and says: You know how to do data recovery, right?
Inevitably they carry an external USB Raid 0 hard drive upon which rested all the critical data for the entire company.
While I can probably get those images off that SD card you formatted, Raid 0 with a disaster recovery plan of I heard Trevor can do data recovery is doomed to failure.
Losing ones keys is a normal part of life. You keep a spare set at work or with a trusted friend. When dealing with mission-critical computing, however, plans need to be more robust.
My favourite equivalent of losing the keys is firing the sysadmin before realising that only he has the passwords to vital pieces of equipment whose manufacturer has gone out of business.
Disaster recovery plans that rely on the manufacturer will help us reset the password are iffy at best.
Dead tree backups lose their charm when a corrupted financials database is combined with reliance on a data storage medium requiring a meat-based search engine.
Always be prepared for the auditors. They strike without warning and they have no mercy.
Not everybodys definition of mission critical is 24/7/365. For small organisations , a cold spare requiring an on-site visit to power up may be adequate.
The plan, however, should take into consideration that the individual responsible for switching on the backup must be capable of making it through the snowstorm that took out the power lines.
Pay attention to log files. More than once I have seen perfectly planned and executed offsite failovers felled because nobody realised the cleaner at the backup site was liable to unplug the servers, for example to charge an iPod. This is not an urban legend.
The more important the data, the more likely it is to go missing. The older the data, the more likely it is that at least one copy is corrupt.
Inevitably, some bit of data will be missing from both the primary and the backup live servers. It happens to everyone and it is why we have tape.
Tapes are attached to a backup program of some kind, which keeps a catalogue of tapes and the files they contain. Life becomes interesting when the file thats missing belongs to someone making an order of magnitude more money than you, and the file thats corrupted is the backup catalogue.
Thirty-two hours into rebuilding the catalogue one tape at a time, you discover that one of the tapes is unreadable. Murphys Law, of course, stipulates that it is the tape with the necessary information.
The lesson is simple: test your backups and the catalogues too.
Databases are the lifeblood of many applications, which in turn keep companies alive. Redundancy is key, and so we turn to database synchronisation.
Some aspects of database synchronisation are old hat by now: the primary can sync live to the secondary and life is good so long as both are up. The primary server fails and the backup server absorbs the load exactly as planned: so far, so good.
Where it all goes horribly wrong is when the primary is returned to service without being informed that it is no longer the primary database. After being brought online it instantly overwrites all the data on the backup server with stale information.
This niggle in the recovery process really should have been practiced more.
In an ideal world, your primary and backup servers are identical systems supplemented by an identical test system. This exists so you can experiment with new configurations, settings and patches.
A critical lesson that others have learned so you dont have to is never, ever patch the primary and the backup clusters at the same time.
One beautiful illustration of this comes in the form of an unanticipated incompatibility between a software patch and a very specific combination of hardware present in both the primary and backup systems.
The testing system identical except for a motherboard one revision newer did not exhibit the issue. Patch released via automated patch software, the primary and backup servers were felled simultaneously.
An oilfield company is doing some deep field drilling. There are several constraints regarding the amount of equipment it can bring.
The drilling requires real-time analysis of sensor results and the decision is made to farm that out over communications links to a more established camp nearby.
Data connectivity being so critical, there were three redundant links: a satellite hook-up, a (very flaky) 3G repeater and a small prototype UAV comms blimp which served as a WiMax bridge between the drilling team and the camp.
Predictably, the satellite connection failed, and the 3G repeater never really worked at all. The drilling team was forced to use the largely untested UAV, which unfortunately began to stray out of range.
The on-site tech tried to connect to the blimp, only to discover that the firewall configuration prevented access from the network interface facing the drilling site.
The connection was so flaky that the team couldn't bounce a connection off a server located on the other side of the network. Thus the UAV drifted entirely out of range and half a province away before it was recovered. The drilling operation was a bust.
Moral: cloud computing absolutely requires multiple tested redundant network links.
Two companies merge and are in the process of consolidating their two data centres. About 80 per cent of the way through the power-up of the new systems, there is a loud snap and all electrical power is dead.
The electricians post mortem is succinct: the electrical panels were from the 1940s. To get 30-Amp lines for the UPSes, a previous electrician had simply "bridged" two 15-Amp breakers.
When enough systems were powered up, the total cumulative load on the panels blew the panels without tripping more than a handful of frankenbreakers.
When the first panel blew, affected systems switched to backup power supplies, blowing the second panel, until all seven panels in the building were wrecked. Thanks to 70 years of evolutionary wiring, five of those panels were located in parts of the building not leased by either company.
The disaster recovery plan was focused entirely on layers of backup power provisioning: mains, UPSes and a generator. Offsite backups werent a consideration.
With the distribution panels fried, generator power couldn't get to the UPSes and sysadmins had only enough time to shut down the systems cleanly before battery power failed. The downtime cost the company more than it would have spent on building an offsite secondary data centre.
that knowledge can be acquired from an adequate collection of textbooks, but true experience requires walking the minefield.
Please share your IT horror stories either in the comments section or by clicking the mail the author link above. Ill collect the best and publish them as a warning to all: here be monsters. ®
"Every software engineer should have the heart of a code writer.
In a jar on his desk."
I have seen a lot of companies including global banks that claim that they need 5 9’s and RTO and RPO of zero. That is till they see the price tag. Then things get trimmed back very quickly.
If you have a client that is truely that demanding, you have both a blessing and a curse. And you will likely see things at that account you will not see ever again. Besides, it makes a good resume builder.
Oracle v. SAP. $1.4B please. Thank you very much.
Back in the day, no, the day before that, the difference between taking a full Friday night backup or conducting a full system restore, was a sleepy operator typing either a “1” or a “2” at 04:00 Saturday morning. You guessed it. That operator mounted tape after tape, blindly following the system prompts until better than half of the production data on our System 370 had been overwritten with the previous week’s data. The boss, myself and a couple of my cohorts spent the next 55 hours in the DC straightening out that mess.
Again...a blessing and a curse.
I have nothing to do with that account. But I know people that do.
I don't have to fix their stuff but then again, I don't get to take credit as the Miracle Worker.
it does, but at the same time, you have to remember that they once lost a probe to Mars because they had two different groups working on it, one that used imperial measurements, the other used metrics.
Years ago, I was working in a datacenter that had about 13, HP-3000s with about 120 or so big washtub disk drives strung out the back on the floor. We also had a water-cooled IBM 3090, and a couple of miscellaneous Dec 11/780s. With all these systems up and running, this was a loud room to work in.
One day we had some fine fellows hanging wallpaper in the corridor that connected the computer room from the secure area you had to go through to enter.
There was a Big Red Button on the wall that had the words "EMERGENCY POWER CUT OFF" printed over in it large red letters. This button also had a cover over it that had to be pulled up in order to press the button, so noone could bump it by accident.
You remember those fine fellows hanging wallpaper? Well, they had to take that cover off the BRB so they could hang their wallpaper.
It was about 4:30 or so and we were right in the middle of shift change. A bunch of us were standing around talking and passing on info about what had happened the previous shift and what was coming up. Suddenly, we heard a huge BANG and it went dark as we heard the slow wind-down of all the fans, drives, and computers.
It became really quiet in that room. A kind of quiet you seldom hear, as our ears were so accustomed to the drone of the fans and drives, the lack thereof was even more profound than it might otherwise had been. We all kind of looked at each other and then looked out to the corridor, and saw where one of the fine fellows hanging wallpaper had accidently brushed up against that Big Red Button.
The aftermath was kind of interesting. We went through, hit the individual power switches on all the disk drives and other peripherals, to turn them off, then brought power back on.
Once we were sure power was stable, we started turning on the drives to the HP-3000 boxes. Out of the 12 systems, (that had thousands of users), 9 of them just kinda sat there and waited until each of its drives were fully up and connected, then they started executing the next instruction in their stack. I didn't know it at the time, but apparently they contained their own battery backup in the chassis, and just waited until they could continue right on with their work! To the thousands of users connected to these systems, their terminals just froze for a while, then continued right where they had left off. Total downtime for most of these systems was 15-30 minutes. (This represented thousands of hours of cumulative user downtime.)
The 3090 didn't fare so well. Apparently, they don't particularly like it when someone just yanks power from underneath them. Took about 13 or so hours to get it fully operational again.
I'll never forget what the Sound of Silence is really like.
My source (the prof) was one of the people who worked on the system, so while it might be a tall tale, it doesn't qualify as urban legend. It was being passed on as a "this is what I saw" not "this is what some other guy saw".
Basically, the point was that the guys in charge of calculating weight (for fuel, etc.) literally had no concept of a system component that had no weight. So they're acting like "I don't care how insignificant you think it is, I've got to know exactly!" -- thinking that the software developers were just being stubborn.
Wow. Just wow.
I worked in a company once where they hired a Russian engineer, and he designs this frame weldment, dimensions it all in cm, with no notes to that effect, then is stunned when it comes back built to his dimensions — in inches! He couldn’t even figure out what the problem was at first. Not the sharpest tool in the shed. He had this big story of how he quasi-escaped from the USSR, but after I worked with him for a while, I decided his presence was (whether knowingly or not on his part) a plan by the Sovs to destroy US manufacturing.
But how could any space-program ME be so inane? How was he smart enough to find the shop every morning if he couldn’t reason any better than that? Would you trust a guy like that to engineer a spacecraft?
I dunno. My guess it was some low-level flunkie who was given a list of system components and told to go find out what the weight of each one was.
Probably right.
After he left the boss turned to us, who were choking back tears, and said, "That won't work on me."
People get tunnel vision. When they’re working in one section of the project that requires one type of thinking even the smart ones can have a hard time adjusting to dealing with a different section where that type of thinking just doesn’t apply. I’ve lost track of how many GOOD project managers I’ve fielded the “how many tests does QA get done in a day” question from, they are quantitative people used to dealing with definable work units and “things just get done” world of QA makes no sense to them. That’s why test cases were invented, they’re a convenient fiction no QA department actually pays attention to but that gives project managers a handle for their gantt charts.
I was installing a medium-sized cluster for a client. The VP of my company had decided to tag along to see what we did in the field. I didn't mind, as I was pretty good at my job, and I had been to this client several times previously. We all got along.
So we arrive on site and start assembling this thing (I typically started from boxes, and ended up with a usable machine), and it took a few days. In the meantime, I had noticed that the power being supplied to me was insufficient, and pointed it out to the client. He got facilities there, and they changed out the power connectors for me while I continued working on the system.
I plugged everything in, and turned on the system.
10 minutes later the power to four buildings (including the one I was in) went down hard. Turns out that the power panel we were connected to was already slightly overloaded before we got there. Adding this power drain was just too much.
Wasted a day and a half to get things right again. :)
...And the VP finally got to understand why I claimed that something always goes wrong at these installs--and never the same thing twice.
It really does happen. At the MN Supercomputer Center we spent months trying to figure why the Crays were crashing inexplicably. As the lead software tech I was huddled at the console at 3am with a bunch of CRI engineers while the janitor was in the opposite end of the room with a power sweeper. We noticed that he brushed a rack on the far end of the room at exactly the moment that the Cray crashed. We had him do it again. The Cray crashed again.
What happened is that the racks were bolted to the metal grid of the raised floor panels. An electrical spark ran from his sweeper through the floor grid and into the heaviest ground wire in room, which was connected to the Cray. So basically any spark or electrical glitch anywhere in the room was being funneled into the poor Cray.
Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.