Skip to comments.A sysadmin's top ten tales of woe
Posted on 06/16/2011 11:45:11 AM PDT by ShadowAce
Get enough people of any profession in one room and the conversation drifts inexorably towards horror stories. Everyone loves a good and then it all went horribly sideways yarn, and we all have more than one.
The IT profession is flush with tales of woe. There are no upper boundaries on ignorance or inability.
From common facepalms through to the failure of the backup plans backup plan, here are the top ten disaster recovery failures I have seen.
About once a quarter someone walks into my office and says: You know how to do data recovery, right?
Inevitably they carry an external USB Raid 0 hard drive upon which rested all the critical data for the entire company.
While I can probably get those images off that SD card you formatted, Raid 0 with a disaster recovery plan of I heard Trevor can do data recovery is doomed to failure.
Losing ones keys is a normal part of life. You keep a spare set at work or with a trusted friend. When dealing with mission-critical computing, however, plans need to be more robust.
My favourite equivalent of losing the keys is firing the sysadmin before realising that only he has the passwords to vital pieces of equipment whose manufacturer has gone out of business.
Disaster recovery plans that rely on the manufacturer will help us reset the password are iffy at best.
Dead tree backups lose their charm when a corrupted financials database is combined with reliance on a data storage medium requiring a meat-based search engine.
Always be prepared for the auditors. They strike without warning and they have no mercy.
Not everybodys definition of mission critical is 24/7/365. For small organisations , a cold spare requiring an on-site visit to power up may be adequate.
The plan, however, should take into consideration that the individual responsible for switching on the backup must be capable of making it through the snowstorm that took out the power lines.
Pay attention to log files. More than once I have seen perfectly planned and executed offsite failovers felled because nobody realised the cleaner at the backup site was liable to unplug the servers, for example to charge an iPod. This is not an urban legend.
The more important the data, the more likely it is to go missing. The older the data, the more likely it is that at least one copy is corrupt.
Inevitably, some bit of data will be missing from both the primary and the backup live servers. It happens to everyone and it is why we have tape.
Tapes are attached to a backup program of some kind, which keeps a catalogue of tapes and the files they contain. Life becomes interesting when the file thats missing belongs to someone making an order of magnitude more money than you, and the file thats corrupted is the backup catalogue.
Thirty-two hours into rebuilding the catalogue one tape at a time, you discover that one of the tapes is unreadable. Murphys Law, of course, stipulates that it is the tape with the necessary information.
The lesson is simple: test your backups and the catalogues too.
Databases are the lifeblood of many applications, which in turn keep companies alive. Redundancy is key, and so we turn to database synchronisation.
Some aspects of database synchronisation are old hat by now: the primary can sync live to the secondary and life is good so long as both are up. The primary server fails and the backup server absorbs the load exactly as planned: so far, so good.
Where it all goes horribly wrong is when the primary is returned to service without being informed that it is no longer the primary database. After being brought online it instantly overwrites all the data on the backup server with stale information.
This niggle in the recovery process really should have been practiced more.
In an ideal world, your primary and backup servers are identical systems supplemented by an identical test system. This exists so you can experiment with new configurations, settings and patches.
A critical lesson that others have learned so you dont have to is never, ever patch the primary and the backup clusters at the same time.
One beautiful illustration of this comes in the form of an unanticipated incompatibility between a software patch and a very specific combination of hardware present in both the primary and backup systems.
The testing system identical except for a motherboard one revision newer did not exhibit the issue. Patch released via automated patch software, the primary and backup servers were felled simultaneously.
An oilfield company is doing some deep field drilling. There are several constraints regarding the amount of equipment it can bring.
The drilling requires real-time analysis of sensor results and the decision is made to farm that out over communications links to a more established camp nearby.
Data connectivity being so critical, there were three redundant links: a satellite hook-up, a (very flaky) 3G repeater and a small prototype UAV comms blimp which served as a WiMax bridge between the drilling team and the camp.
Predictably, the satellite connection failed, and the 3G repeater never really worked at all. The drilling team was forced to use the largely untested UAV, which unfortunately began to stray out of range.
The on-site tech tried to connect to the blimp, only to discover that the firewall configuration prevented access from the network interface facing the drilling site.
The connection was so flaky that the team couldn't bounce a connection off a server located on the other side of the network. Thus the UAV drifted entirely out of range and half a province away before it was recovered. The drilling operation was a bust.
Moral: cloud computing absolutely requires multiple tested redundant network links.
Two companies merge and are in the process of consolidating their two data centres. About 80 per cent of the way through the power-up of the new systems, there is a loud snap and all electrical power is dead.
The electricians post mortem is succinct: the electrical panels were from the 1940s. To get 30-Amp lines for the UPSes, a previous electrician had simply "bridged" two 15-Amp breakers.
When enough systems were powered up, the total cumulative load on the panels blew the panels without tripping more than a handful of frankenbreakers.
When the first panel blew, affected systems switched to backup power supplies, blowing the second panel, until all seven panels in the building were wrecked. Thanks to 70 years of evolutionary wiring, five of those panels were located in parts of the building not leased by either company.
The disaster recovery plan was focused entirely on layers of backup power provisioning: mains, UPSes and a generator. Offsite backups werent a consideration.
With the distribution panels fried, generator power couldn't get to the UPSes and sysadmins had only enough time to shut down the systems cleanly before battery power failed. The downtime cost the company more than it would have spent on building an offsite secondary data centre.
that knowledge can be acquired from an adequate collection of textbooks, but true experience requires walking the minefield.
Please share your IT horror stories either in the comments section or by clicking the mail the author link above. Ill collect the best and publish them as a warning to all: here be monsters. ®
Uh... okay; now, in English, please. ;-)
Trust me—it’s pretty funny if you are a sysadmin. :)
Which reminds me of the time I strolled into work at 5:00 AM and found the system adminstrator with her head down in tears on the keyboard and a front office executive standing over her. A system admin might just, possibly, be in at 5:00 AM, but a front office type, never. Seems she installed an update to the Solaris operating system on one unit, and did some “tests”, decided that everything worked OK and proceeded to install it on the other. As rosy fingered dawn broke over Ontario, the high bay became crowded with engineers and programmers who were “on the clock” with nothing to do but cheer good old Stella on.
My favorite story, from ten years ago, involves an application program that destroyed the operating system. It was a Solaris 8 env. The first time we ran the program in production, it overwrote the root filesystem, making our powerful Sun box with 28 processors and 28 gigs of memory worthless.
We immediately went into disaster recovery mode, and brought up production on the UAT server. Of course, the first thing they did was run the same program, which wiped out that machine as well.
Trader Joe’s has a completely mirrored datacenter in a different geographic location.
I don’t know how they handle data replication but they evidently understand the importance of redundancy.
That mirroring stuff is designed with the idea that one data center will lose power or be destroyed by terrorists.
However, if the database becomes corrupt, and you are using physical mirroring, you now have two copies of a corrupt database in two data centers. And we have found that bugs in the software are far more likely to happen than losing a data center.
I don’t know if I’ve told this story before, but at a very large and well known financial company, the testing lab signed off on a new image that was to be pushed out to company desktops.
For whatever reason, the company (which I won’t name) pushing out the image added a piece of software to the image that was pushed out.
Blew up 1/3 of the desktops. The only reason all of the desktops weren’t blown is that the image was only pushed out to 1/3 of the computers.
1,000 computers were put out of commission.
The company pushing out the image had added a virus protection program to an image that already had a virus protection program. The financial company got the computers back online by disabling all virus protection.
It creeps me out to even be near there. Bad karma, yo.
I company I worked at was expanding their datacenter--not the physical space, because they had/have plenty of room. No, they needed more clusters, so they bought 13 more 70-node clusters from their vendor.
Their cooling system couldn't handle it as I began turning them on. They couldn't get any more big chillers that quickly, so they ended up renting a "portable" chiller for several months, just to keep this (very large) data center semi-cool.
Is it just me or did you leave some sentences out of this story?
Is it just me or did you leave some sentences out of this story?
Dont forget the “Infinite troubleshooting” - customer has a problem. System taken off line to troubleshoot and repair. 12 hrs later, having still not reached a fix ... the executive finally made the call to execute DR for that system. RTO and RPO were both less than 2 hrs.
Not quite as bad as some of these, but one of my clients spent a great deal on “customized software”, when comparable (probably better) software was available from a major vendor. They neglected to force the developers to provide documentation of any sort. The software they chose always had problems, and less than a year after the project was completed, the company that developed it went out of business and the developers scatter to the four corners of the earth.
I understood completely .. so, I guess it's official ... I'm a geek.
One of my profs at Carnegie Mellon relayed a story from the space shuttle program (well, several stories, but this one is germane) — all systems on board had to be heavily redundant, so they installed four identical copies of the control software.
Of course, when the first one fails and rolls over to the second identical copy, what do you expect it’s going to do with the same bad data? And the third, and the fourth?
An unrelated story had do do with the mechanical folks trying to figure out how much the software weighed...
Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.