Free Republic
Browse · Search
General/Chat
Topics · Post Article

Skip to comments.

A sysadmin's top ten tales of woe
The Register ^ | 14 June 2011 | Trevor Pott

Posted on 06/16/2011 11:45:11 AM PDT by ShadowAce

Get enough people of any profession in one room and the conversation drifts inexorably towards horror stories. Everyone loves a good “…and then it all went horribly sideways” yarn, and we all have more than one.

The IT profession is flush with tales of woe. There are no upper boundaries on ignorance or inability.

From common facepalms through to the failure of the backup plan’s backup plan, here are the top ten disaster recovery failures I have seen.

10. Raid 0 isn’t Raid

About once a quarter someone walks into my office and says: “You know how to do data recovery, right?”

Inevitably they carry an external USB Raid 0 hard drive upon which rested all the critical data for the entire company.

While I can probably get those images off that SD card you formatted, Raid 0 with a disaster recovery plan of “I heard Trevor can do data recovery” is doomed to failure.

9. Call a locksmith

Losing one’s keys is a normal part of life. You keep a spare set at work or with a trusted friend. When dealing with mission-critical computing, however, plans need to be more robust.

My favourite equivalent of losing the keys is firing the sysadmin before realising that only he has the passwords to vital pieces of equipment whose manufacturer has gone out of business.

Disaster recovery plans that rely on “the manufacturer will help us reset the password” are iffy at best.

8. My backup is paper

Dead tree backups lose their charm when a corrupted financials database is combined with reliance on a data storage medium requiring a meat-based search engine.

Always be prepared for the auditors. They strike without warning and they have no mercy.

7. You have to be there

Not everybody’s definition of “mission critical” is 24/7/365. For small organisations , a cold spare requiring an on-site visit to power up may be adequate.

The plan, however, should take into consideration that the individual responsible for switching on the backup must be capable of making it through the snowstorm that took out the power lines.

6. The cleaner unplugged it

Pay attention to log files. More than once I have seen perfectly planned and executed offsite failovers felled because nobody realised the cleaner at the backup site was liable to unplug the servers, for example to charge an iPod. This is not an urban legend.

5. Tapes and Murphy's Law

The more important the data, the more likely it is to go missing. The older the data, the more likely it is that at least one copy is corrupt.

Inevitably, some bit of data will be missing from both the primary and the backup live servers. It happens to everyone and it is why we have tape.

Tapes are attached to a backup program of some kind, which keeps a catalogue of tapes and the files they contain. Life becomes interesting when the file that’s missing belongs to someone making an order of magnitude more money than you, and the file that’s corrupted is the backup catalogue.

Thirty-two hours into rebuilding the catalogue one tape at a time, you discover that one of the tapes is unreadable. Murphy’s Law, of course, stipulates that it is the tape with the necessary information.

The lesson is simple: test your backups – and the catalogues too.

4. Master and commander

Databases are the lifeblood of many applications, which in turn keep companies alive. Redundancy is key, and so we turn to database synchronisation.

Some aspects of database synchronisation are old hat by now: the primary can sync live to the secondary and life is good so long as both are up. The primary server fails and the backup server absorbs the load exactly as planned: so far, so good.

Where it all goes horribly wrong is when the primary is returned to service without being informed that it is no longer the primary database. After being brought online it instantly overwrites all the data on the backup server with stale information.

This niggle in the recovery process really should have been practiced more.

3. Death by patch

In an ideal world, your primary and backup servers are identical systems supplemented by an identical test system. This exists so you can experiment with new configurations, settings and patches.

A critical lesson that others have learned so you don’t have to is never, ever patch the primary and the backup clusters at the same time.

One beautiful illustration of this comes in the form of an unanticipated incompatibility between a software patch and a very specific combination of hardware present in both the primary and backup systems.

The testing system – identical except for a motherboard one revision newer – did not exhibit the issue. Patch released via automated patch software, the primary and backup servers were felled simultaneously.

2. Untried and untested

An oilfield company is doing some deep field drilling. There are several constraints regarding the amount of equipment it can bring.

The drilling requires real-time analysis of sensor results and the decision is made to farm that out over communications links to a more established camp nearby.

Data connectivity being so critical, there were three redundant links: a satellite hook-up, a (very flaky) 3G repeater and a small prototype UAV comms blimp which served as a WiMax bridge between the drilling team and the camp.

Predictably, the satellite connection failed, and the 3G repeater never really worked at all. The drilling team was forced to use the largely untested UAV, which unfortunately began to stray out of range.

The on-site tech tried to connect to the blimp, only to discover that the firewall configuration prevented access from the network interface facing the drilling site.

The connection was so flaky that the team couldn't bounce a connection off a server located on the other side of the network. Thus the UAV drifted entirely out of range and half a province away before it was recovered. The drilling operation was a bust.

Moral: cloud computing absolutely requires multiple tested redundant network links.

1. The power cascade

Two companies merge and are in the process of consolidating their two data centres. About 80 per cent of the way through the power-up of the new systems, there is a loud snap and all electrical power is dead.

The electrician’s post mortem is succinct: the electrical panels were from the 1940s. To get 30-Amp lines for the UPSes, a previous electrician had simply "bridged" two 15-Amp breakers.

When enough systems were powered up, the total cumulative load on the panels blew the panels without tripping more than a handful of frankenbreakers.

When the first panel blew, affected systems switched to backup power supplies, blowing the second panel, until all seven panels in the building were wrecked. Thanks to 70 years of evolutionary wiring, five of those panels were located in parts of the building not leased by either company.

The disaster recovery plan was focused entirely on layers of backup power provisioning: mains, UPSes and a generator. Offsite backups weren’t a consideration.

With the distribution panels fried, generator power couldn't get to the UPSes and sysadmins had only enough time to shut down the systems cleanly before battery power failed. The downtime cost the company more than it would have spent on building an offsite secondary data centre.

It all goes to show…

…that knowledge can be acquired from an adequate collection of textbooks, but true experience requires walking the minefield.

Please share your IT horror stories either in the comments section or by clicking the “mail the author” link above. I’ll collect the best and publish them as a warning to all: here be monsters. ®


TOPICS: Computers/Internet
KEYWORDS: sysadmin; troubles
Navigation: use the links below to view more comments.
first 1-5051-57 next last

1 posted on 06/16/2011 11:45:14 AM PDT by ShadowAce
[ Post Reply | Private Reply | View Replies]

To: rdb3; Calvinist_Dark_Lord; GodGunsandGuts; CyberCowboy777; Salo; Bobsat; JosephW; ...

2 posted on 06/16/2011 11:46:18 AM PDT by ShadowAce (Linux -- The Ultimate Windows Service Pack)
[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

Uh... okay; now, in English, please. ;-)


3 posted on 06/16/2011 11:56:56 AM PDT by Jack Hammer
[ Post Reply | Private Reply | To 1 | View Replies]

To: Jack Hammer

Trust me—it’s pretty funny if you are a sysadmin. :)


4 posted on 06/16/2011 11:58:31 AM PDT by ShadowAce (Linux -- The Ultimate Windows Service Pack)
[ Post Reply | Private Reply | To 3 | View Replies]

To: ShadowAce

Which reminds me of the time I strolled into work at 5:00 AM and found the system adminstrator with her head down in tears on the keyboard and a front office executive standing over her. A system admin might just, possibly, be in at 5:00 AM, but a front office type, never. Seems she installed an update to the Solaris operating system on one unit, and did some “tests”, decided that everything worked OK and proceeded to install it on the other. As rosy fingered dawn broke over Ontario, the high bay became crowded with engineers and programmers who were “on the clock” with nothing to do but cheer good old Stella on.


5 posted on 06/16/2011 11:59:45 AM PDT by Lonesome in Massachussets (Somewhere in Kenya a village is missing its idiot)
[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

My favorite story, from ten years ago, involves an application program that destroyed the operating system. It was a Solaris 8 env. The first time we ran the program in production, it overwrote the root filesystem, making our powerful Sun box with 28 processors and 28 gigs of memory worthless.

We immediately went into disaster recovery mode, and brought up production on the UAT server. Of course, the first thing they did was run the same program, which wiped out that machine as well.


6 posted on 06/16/2011 12:02:26 PM PDT by proxy_user
[ Post Reply | Private Reply | To 1 | View Replies]

To: proxy_user

LOL!


7 posted on 06/16/2011 12:04:59 PM PDT by ShadowAce (Linux -- The Ultimate Windows Service Pack)
[ Post Reply | Private Reply | To 6 | View Replies]

To: ShadowAce

Trader Joe’s has a completely mirrored datacenter in a different geographic location.

I don’t know how they handle data replication but they evidently understand the importance of redundancy.


8 posted on 06/16/2011 12:06:40 PM PDT by stylin_geek (Never underestimate the power of government to distort markets)
[ Post Reply | Private Reply | To 4 | View Replies]

To: stylin_geek

That mirroring stuff is designed with the idea that one data center will lose power or be destroyed by terrorists.

However, if the database becomes corrupt, and you are using physical mirroring, you now have two copies of a corrupt database in two data centers. And we have found that bugs in the software are far more likely to happen than losing a data center.


9 posted on 06/16/2011 12:09:30 PM PDT by proxy_user
[ Post Reply | Private Reply | To 8 | View Replies]

To: ShadowAce

I don’t know if I’ve told this story before, but at a very large and well known financial company, the testing lab signed off on a new image that was to be pushed out to company desktops.

For whatever reason, the company (which I won’t name) pushing out the image added a piece of software to the image that was pushed out.

Blew up 1/3 of the desktops. The only reason all of the desktops weren’t blown is that the image was only pushed out to 1/3 of the computers.

1,000 computers were put out of commission.

The company pushing out the image had added a virus protection program to an image that already had a virus protection program. The financial company got the computers back online by disabling all virus protection.


10 posted on 06/16/2011 12:15:17 PM PDT by stylin_geek (Never underestimate the power of government to distort markets)
[ Post Reply | Private Reply | To 4 | View Replies]

To: proxy_user
The DR center for my company shares space with Enron's hardware that was maintained by court order.

It creeps me out to even be near there. Bad karma, yo.

11 posted on 06/16/2011 12:19:01 PM PDT by I Buried My Guns (You bring the traitor. I'll bring the rope.)
[ Post Reply | Private Reply | To 9 | View Replies]

To: stylin_geek
I've got a couple myself--

I company I worked at was expanding their datacenter--not the physical space, because they had/have plenty of room. No, they needed more clusters, so they bought 13 more 70-node clusters from their vendor.

Their cooling system couldn't handle it as I began turning them on. They couldn't get any more big chillers that quickly, so they ended up renting a "portable" chiller for several months, just to keep this (very large) data center semi-cool.

12 posted on 06/16/2011 12:20:42 PM PDT by ShadowAce (Linux -- The Ultimate Windows Service Pack)
[ Post Reply | Private Reply | To 10 | View Replies]

bflr


13 posted on 06/16/2011 12:21:28 PM PDT by absolootezer0 (2x divorced tattooed pierced harley hatin meghan mccain luvin' REAL beer drinkin' smoker ..what?)
[ Post Reply | Private Reply | To 1 | View Replies]

To: Lonesome in Massachussets
Which reminds me of the time I strolled into work at 5:00 AM and found the system adminstrator with her head down in tears on the keyboard and a front office executive standing over her. A system admin might just, possibly, be in at 5:00 AM, but a front office type, never. Seems she installed an update to the Solaris operating system on one unit, and did some “tests”, decided that everything worked OK and proceeded to install it on the other. As rosy fingered dawn broke over Ontario, the high bay became crowded with engineers and programmers who were “on the clock” with nothing to do but cheer good old Stella on.

Is it just me or did you leave some sentences out of this story?

14 posted on 06/16/2011 12:22:05 PM PDT by BRK
[ Post Reply | Private Reply | To 5 | View Replies]

To: Lonesome in Massachussets
Which reminds me of the time I strolled into work at 5:00 AM and found the system adminstrator with her head down in tears on the keyboard and a front office executive standing over her. A system admin might just, possibly, be in at 5:00 AM, but a front office type, never. Seems she installed an update to the Solaris operating system on one unit, and did some “tests”, decided that everything worked OK and proceeded to install it on the other. As rosy fingered dawn broke over Ontario, the high bay became crowded with engineers and programmers who were “on the clock” with nothing to do but cheer good old Stella on.

Is it just me or did you leave some sentences out of this story?

15 posted on 06/16/2011 12:22:20 PM PDT by BRK
[ Post Reply | Private Reply | To 5 | View Replies]

To: ShadowAce

Dont forget the “Infinite troubleshooting” - customer has a problem. System taken off line to troubleshoot and repair. 12 hrs later, having still not reached a fix ... the executive finally made the call to execute DR for that system. RTO and RPO were both less than 2 hrs.


16 posted on 06/16/2011 12:27:03 PM PDT by taxcontrol
[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

bkmk


17 posted on 06/16/2011 12:31:40 PM PDT by Sergio (An object at rest cannot be stopped! - The Evil Midnight Bomber What Bombs at Midnight)
[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

Not quite as bad as some of these, but one of my clients spent a great deal on “customized software”, when comparable (probably better) software was available from a major vendor. They neglected to force the developers to provide documentation of any sort. The software they chose always had problems, and less than a year after the project was completed, the company that developed it went out of business and the developers scatter to the four corners of the earth.


18 posted on 06/16/2011 12:32:08 PM PDT by The Sons of Liberty (Psalm 109:8 Let his days be few and let another take his office. - Mene, Mene, Tekel, Upharsin)
[ Post Reply | Private Reply | To 1 | View Replies]

To: Jack Hammer
Uh... okay; now, in English, please. ;-)

I understood completely .. so, I guess it's official ... I'm a geek.

19 posted on 06/16/2011 12:34:27 PM PDT by tx_eggman (Liberalism is only possible in that moment when a man chooses Barabas over Christ.)
[ Post Reply | Private Reply | To 3 | View Replies]

To: proxy_user

One of my profs at Carnegie Mellon relayed a story from the space shuttle program (well, several stories, but this one is germane) — all systems on board had to be heavily redundant, so they installed four identical copies of the control software.

Of course, when the first one fails and rolls over to the second identical copy, what do you expect it’s going to do with the same bad data? And the third, and the fourth?

An unrelated story had do do with the mechanical folks trying to figure out how much the software weighed...


20 posted on 06/16/2011 12:36:42 PM PDT by kevkrom (Imagine if the media spent 1/10 the effort vetting Obama as they've used against Palin.)
[ Post Reply | Private Reply | To 6 | View Replies]

To: kevkrom
An unrelated story had do do with the mechanical folks trying to figure out how much the software weighed...

LOL! That's just awesome.

21 posted on 06/16/2011 12:38:31 PM PDT by ShadowAce (Linux -- The Ultimate Windows Service Pack)
[ Post Reply | Private Reply | To 20 | View Replies]

To: proxy_user; roamer_1; ShadowAce
The first time we ran the program in production, it overwrote the root filesystem, making our powerful Sun box with 28 processors and 28 gigs of memory worthless.

Was the application running under an account with root privileges, or was the root file system open to accounts with non-root privileges?

We immediately went into disaster recovery mode, and brought up production on the UAT server. Of course, the first thing they did was run the same program, which wiped out that machine as well.

ROFL

22 posted on 06/16/2011 12:48:02 PM PDT by rabscuttle385 (Live Free or Die)
[ Post Reply | Private Reply | To 6 | View Replies]

To: proxy_user

What you didn’t get the memo from management? The software testing budget has been cut by 50%, blame the programmers.


23 posted on 06/16/2011 12:58:32 PM PDT by ImJustAnotherOkie (zerogottago)
[ Post Reply | Private Reply | To 9 | View Replies]

To: ShadowAce
"We just acquired an implementation company that also did installs of our largest competitor's software....

...

"...what do you mean the acquired company's guys were using their inside knowledge to access our competitor's confidential information??"

(actually, this is now more of a tale of woe for Legal)

24 posted on 06/16/2011 1:00:34 PM PDT by martin_fierro (< |:)~)
[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce
I just makes my stomach hurt and I found myself reaching for asprin, rolaids, and whiskey.

/johnny

25 posted on 06/16/2011 1:09:53 PM PDT by JRandomFreeper (Gone Galt)
[ Post Reply | Private Reply | To 4 | View Replies]

To: ShadowAce

A buddy and his programming staff were given two weeks notice after meeting a deadline with code the met the specification. They had been expecting a ‘atta boy’ or a ‘congrats’ not a pink slip. Management takes code to customers who love the new software but ‘could you make it do X & Z too!?’

Management goes to my buddy with the request and he tells them it would take little effort to make those enhancements but all the new programmers they would have to hire would a few months to get up to speed on the code before they could tackle the changes. ‘What about you and your staff?’ ‘Sorry, but we all have new jobs and all of us leave tomorrow. Why did you get rid of us all?’

They confessed that they wanted to get rid of all those expensive programmers to save money and look smart to their managers.


26 posted on 06/16/2011 1:29:31 PM PDT by pikachu (After Monday and Tuesday, even the calender goes W T F !)
[ Post Reply | Private Reply | To 1 | View Replies]

To: rabscuttle385

No, it was a bug in Solaris. It just produced a core dump bigger than 2 gigs when it failed, so Solaris interpreted the size as a negative number and wrote it backwards in the filesystem.

The Sun guys said oh, you should have installed the OS patch for that.


27 posted on 06/16/2011 1:37:29 PM PDT by proxy_user
[ Post Reply | Private Reply | To 22 | View Replies]

To: ShadowAce

bookmark


28 posted on 06/16/2011 1:53:21 PM PDT by FourPeas ("Maladjusted and wigging out is no way to go through life, son." -hg)
[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

I heard of a case back in the bad old days of removable platter drives where the admin got a call at home in the middle of the night to inform him that the primary copy of their data had failed. Not too concerned, he asked if they had mounted the backup copy, and was told that they had done so, only to find out that the problem was in the drive, when it destroyed the backup. Don’t know if they also had tape for a second layer of backup.


29 posted on 06/16/2011 2:00:46 PM PDT by Still Thinking (Freedom is NOT a loophole!)
[ Post Reply | Private Reply | To 1 | View Replies]

To: kevkrom
An unrelated story had do do with the mechanical folks trying to figure out how much the software weighed...

That one smells like an urban legend to me!

30 posted on 06/16/2011 2:04:52 PM PDT by Still Thinking (Freedom is NOT a loophole!)
[ Post Reply | Private Reply | To 20 | View Replies]

To: ShadowAce

Electrical company comes in to swing over facility power from an aging UPS to a new 50kW unit, big monster. This is in the middle of June in west central Florida a few years ago, so inevitably, the afternoon thunderstorms start to pop up around 6 PM and last sometimes late into the night, depending on the atmosphere.

Well apparently the electrical company tech didn’t want to be working on an indoor UPS, unplugged, at 10 PM during a lightning storm, and one Hell of a lightning storm it was! I lost power at my home, and I’m 25 miles from the DC. My pager starts going off at 11 PM, and I call in to our DR incident command center to find out our entire DC is black.

I rush over in the pouring rain to find out that the electrical “engineer” left the neutral and the ground unconnected on the new UPS and when a lightning strike hit directly to our ground looped rod, millions of volts of electricity streamed through the live wire, blew up 51 batteries in UPSes chained to the new one, and melted every single transformer in the building.

Needless to say we stopped doing business with that company. It took us 28 hours to bring the entire DC back online and found out that not only were our tape backups not functioning due to magnetic interference during the storm, many of the servers deployed for our finance department were on RAID0 and over 8 years old (talking dishwasher Compaq 5500s here); you know the rest of the story.

Good list.


31 posted on 06/16/2011 2:08:36 PM PDT by rarestia (It's time to water the Tree of Liberty.)
[ Post Reply | Private Reply | To 4 | View Replies]

To: ShadowAce

Reminds me of the time our data center overheated at a company I worked for. Everything shut down now. No warning, nothing.

Of course, the datacenter overheated after hours. Yours truly was on call.

The company was too cheap to install temperature sensors and alarms.

After the data center was down for over 24 hours, the company decided to spend a few dollars on appropriate monitoring and alarms.

Not only that, but the company had also fired the company that maintained and monitored our mainframe shortly before the disaster.

$200 million a year company and they wouldn’t spend a few thousand to maintain data integrity.

I left as soon as I could. I got tired of dealing with that kind of crap.


32 posted on 06/16/2011 2:12:32 PM PDT by stylin_geek (Never underestimate the power of government to distort markets)
[ Post Reply | Private Reply | To 12 | View Replies]

To: Jack Hammer

Most of these boil down to the software equivalent of a spare tire nobody’s checked the air level on since ever. Backup systems are nice, but you need to make sure they actually work. The other 10% are about making sure that whatever killed the primary doesn’t daisy chain to the backup, basically don’t change tires in the middle of the patch of stuff that popped the first one.


33 posted on 06/16/2011 2:12:47 PM PDT by discostu (Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn)
[ Post Reply | Private Reply | To 3 | View Replies]

To: ShadowAce

Forgot to add, I warned the company there were issues with various system, but they wouldn’t listen.

I got tired of getting blamed for stuff I’d warned them about.


34 posted on 06/16/2011 2:18:15 PM PDT by stylin_geek (Never underestimate the power of government to distort markets)
[ Post Reply | Private Reply | To 12 | View Replies]

To: ShadowAce
The cleaner unplugged it

Pay attention to log files. More than once I have seen perfectly planned and executed offsite failovers felled because nobody realised the cleaner at the backup site was liable to unplug the servers, for example to charge an iPod. This is not an urban legend.

Then there was the manager of the building containing the mission-critical mainframe processing real time test data. He conducted a tour of his facility for some visitors and at one point in the tour he pointed out the main power switch to the mainframe - and cycled the switch off and back on!!! Scratch one expensive test, and scratch (quite literally) all the big, expensive hard disks supporting the operation.

Sigh . . .

On a smaller scale, there was the large computer which would go crazy every now and then.

Who knew that the steel wool pad on the floor cleaning machine would put iron filings in the air, or that they would randomly short out whatever printed circuit they settled on? Certainly not the janitor!


35 posted on 06/16/2011 2:53:01 PM PDT by conservatism_IS_compassion (DRAFT PALIN)
[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

Hurricane Katrina hit landfall directly over our manufacturing plant along the Gulf Coast in Mississippi. The computer center there was flooded to almost ceiling level. Our Dell storage array network with all the local servers, disk drives, etc. was completely submerged in a stinking, muddy mess.

Come to find out, our fancy ‘distributed’ document management system is a combination of central and local storage. Whenever someone ‘local’ would access a blueprint file in edit mode, the system would move the file from ‘central’ storage to ‘local’ storage - this to improve the speed of accessing the file.

The ‘local’ files were in a Raid5 configuration with weekly full and nightly incremental disk-to-disk backup... plus tape backups stored in the datacenter - now all ruined.

The off-site, month-old backup was in a local bank deposit box. But the bank did not open for about a month after Katrina. The bank-located backups recovered just fine to a sister plant located in TN. But within the lost month, some of the company’s blueprint files had been moved to local storage. In all, a few dozen critical blueprints from across the company existed only on the muck encrusted data disks.

Luckily, a company specializing in recovering data from damaged disks were able to retrieve all the lost engineering files. But not after several weeks and over $100k spent...


36 posted on 06/16/2011 4:11:39 PM PDT by cheee (Good, Fast, Cheap ... you can only pick two...)
[ Post Reply | Private Reply | To 1 | View Replies]

To: ShadowAce

Between stupid users, and self inflicted pain, my horror stories are so numerous, I just don’t know where to begin. Lol


37 posted on 06/16/2011 4:17:48 PM PDT by KoRn (Department of Homeland Security, Certified - "Right Wing Extremist")
[ Post Reply | Private Reply | To 1 | View Replies]

To: Lonesome in Massachussets; Matchett-PI; MinuteGal; mcmuffin; Bob Ireland
I feel your pain, pal.
I owned nine computer systems in my Corporation before those PEOPLE talked me into buying one of these evil things ...

It's been a long road, bud, but I made some good friends

38 posted on 06/16/2011 4:44:40 PM PDT by gonzo ( Buy more ammo, dammit! You should already have the firearms .................. FRegards)
[ Post Reply | Private Reply | To 5 | View Replies]

To: proxy_user
However, if the database becomes corrupt, and you are using physical mirroring, you now have two copies of a corrupt database in two data centers.

That depends on the software and array you are using. The company I work for uses proprietary protocols for several mirroring software systems that prevent that sort of thing. Well, not so much prevent as make it easily recoverable to point-in-time when the corruption happens.

39 posted on 06/16/2011 7:05:00 PM PDT by Bloody Sam Roberts (If you think it's time to bury your weapons.....it's time to dig them up.)
[ Post Reply | Private Reply | To 9 | View Replies]

To: taxcontrol
RTO and RPO were both less than 2 hrs.

The company I work for has a customer who doesn't care what the cost is to protect the data for their enterprise. Their RTO and RPO are both zero. They demand it. They'll want operational overview but the details aren't relevant nor is the cost. Whatever it takes to accomplish that, do it. Any number above zero costs them millions per minute.

This account has some very stressed but very wealthy Account Sales Reps.

40 posted on 06/16/2011 7:14:16 PM PDT by Bloody Sam Roberts (If you think it's time to bury your weapons.....it's time to dig them up.)
[ Post Reply | Private Reply | To 16 | View Replies]

To: ImJustAnotherOkie
I keep a sign tacked over my desk;

"Every software engineer should have the heart of a code writer.
In a jar on his desk."

41 posted on 06/16/2011 7:16:58 PM PDT by Bloody Sam Roberts (If you think it's time to bury your weapons.....it's time to dig them up.)
[ Post Reply | Private Reply | To 23 | View Replies]

To: Bloody Sam Roberts

I have seen a lot of companies including global banks that claim that they need 5 9’s and RTO and RPO of zero. That is till they see the price tag. Then things get trimmed back very quickly.

If you have a client that is truely that demanding, you have both a blessing and a curse. And you will likely see things at that account you will not see ever again. Besides, it makes a good resume builder.


42 posted on 06/16/2011 9:33:37 PM PDT by taxcontrol
[ Post Reply | Private Reply | To 40 | View Replies]

To: martin_fierro
(actually, this is now more of a tale of woe for Legal)

Oracle v. SAP. $1.4B please. Thank you very much.

43 posted on 06/17/2011 5:44:07 AM PDT by Ol' Sox
[ Post Reply | Private Reply | To 24 | View Replies]

To: ShadowAce

Back in the day, no, the day before that, the difference between taking a full Friday night backup or conducting a full system restore, was a sleepy operator typing either a “1” or a “2” at 04:00 Saturday morning. You guessed it. That operator mounted tape after tape, blindly following the system prompts until better than half of the production data on our System 370 had been overwritten with the previous week’s data. The boss, myself and a couple of my cohorts spent the next 55 hours in the DC straightening out that mess.


44 posted on 06/17/2011 6:11:04 AM PDT by Ol' Sox
[ Post Reply | Private Reply | To 1 | View Replies]

To: taxcontrol
Besides, it makes a good resume builder.

Again...a blessing and a curse.
I have nothing to do with that account. But I know people that do.
I don't have to fix their stuff but then again, I don't get to take credit as the Miracle Worker.

45 posted on 06/17/2011 7:48:28 AM PDT by Bloody Sam Roberts (If you think it's time to bury your weapons.....it's time to dig them up.)
[ Post Reply | Private Reply | To 42 | View Replies]

To: Still Thinking
That one smells like an urban legend to me!

it does, but at the same time, you have to remember that they once lost a probe to Mars because they had two different groups working on it, one that used imperial measurements, the other used metrics.

46 posted on 06/17/2011 9:12:59 AM PDT by zeugma (The only thing in the social security trust fund is your children and grandchildren's sweat.)
[ Post Reply | Private Reply | To 30 | View Replies]

To: ShadowAce
Here's my fun little horror story for this thread:

Years ago, I was working in a datacenter that had about 13, HP-3000s with about 120 or so big washtub disk drives strung out the back on the floor. We also had a water-cooled IBM 3090, and a couple of miscellaneous Dec 11/780s. With all these systems up and running, this was a loud room to work in.

One day we had some fine fellows hanging wallpaper in the corridor that connected the computer room from the secure area you had to go through to enter.

There was a Big Red Button on the wall that had the words "EMERGENCY POWER CUT OFF" printed over in it large red letters. This button also had a cover over it that had to be pulled up in order to press the button, so noone could bump it by accident.

You remember those fine fellows hanging wallpaper? Well, they had to take that cover off the BRB so they could hang their wallpaper.

It was about 4:30 or so and we were right in the middle of shift change. A bunch of us were standing around talking and passing on info about what had happened the previous shift and what was coming up. Suddenly, we heard a huge BANG and it went dark as we heard the slow wind-down of all the fans, drives, and computers.

It became really quiet in that room. A kind of quiet you seldom hear, as our ears were so accustomed to the drone of the fans and drives, the lack thereof was even more profound than it might otherwise had been. We all kind of looked at each other and then looked out to the corridor, and saw where one of the fine fellows hanging wallpaper had accidently brushed up against that Big Red Button.

The aftermath was kind of interesting. We went through, hit the individual power switches on all the disk drives and other peripherals, to turn them off, then brought power back on.

Once we were sure power was stable, we started turning on the drives to the HP-3000 boxes. Out of the 12 systems, (that had thousands of users), 9 of them just kinda sat there and waited until each of its drives were fully up and connected, then they started executing the next instruction in their stack. I didn't know it at the time, but apparently they contained their own battery backup in the chassis, and just waited until they could continue right on with their work! To the thousands of users connected to these systems, their terminals just froze for a while, then continued right where they had left off. Total downtime for most of these systems was 15-30 minutes. (This represented thousands of hours of cumulative user downtime.)

The 3090 didn't fare so well. Apparently, they don't particularly like it when someone just yanks power from underneath them. Took about 13 or so hours to get it fully operational again.

I'll never forget what the Sound of Silence is really like.

47 posted on 06/17/2011 9:48:10 AM PDT by zeugma (The only thing in the social security trust fund is your children and grandchildren's sweat.)
[ Post Reply | Private Reply | To 1 | View Replies]

To: Still Thinking
That one smells like an urban legend to me!

My source (the prof) was one of the people who worked on the system, so while it might be a tall tale, it doesn't qualify as urban legend. It was being passed on as a "this is what I saw" not "this is what some other guy saw".

Basically, the point was that the guys in charge of calculating weight (for fuel, etc.) literally had no concept of a system component that had no weight. So they're acting like "I don't care how insignificant you think it is, I've got to know exactly!" -- thinking that the software developers were just being stubborn.

48 posted on 06/17/2011 10:03:40 AM PDT by kevkrom (Imagine if the media spent 1/10 the effort vetting Obama as they've used against Palin.)
[ Post Reply | Private Reply | To 30 | View Replies]

To: kevkrom

Wow. Just wow.


49 posted on 06/17/2011 10:17:45 AM PDT by Still Thinking (Freedom is NOT a loophole!)
[ Post Reply | Private Reply | To 48 | View Replies]

To: zeugma

I worked in a company once where they hired a Russian engineer, and he designs this frame weldment, dimensions it all in cm, with no notes to that effect, then is stunned when it comes back built to his dimensions — in inches! He couldn’t even figure out what the problem was at first. Not the sharpest tool in the shed. He had this big story of how he quasi-escaped from the USSR, but after I worked with him for a while, I decided his presence was (whether knowingly or not on his part) a plan by the Sovs to destroy US manufacturing.


50 posted on 06/17/2011 10:22:27 AM PDT by Still Thinking (Freedom is NOT a loophole!)
[ Post Reply | Private Reply | To 46 | View Replies]


Navigation: use the links below to view more comments.
first 1-5051-57 next last

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
General/Chat
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson