[NTLUG:Discuss] Disaster recovery

Sat Aug 13 22:32:13 CDT 2005

Leroy Tennison wrote:
...
> Chris, Robert,
> 
> You make very good points about the meaning of DR, this is where I 
> assumed the "religious war" would break out.
> 
> Chris - what is the issue with 'LIKE hardware'?  Are you thinking about 
> time to recover or what?  I'm well aware of the tape issues (Backed up 
> by one unit, can't be read by another.  If you want to get a tape 
> hardware sales representative to choke just ask them "What if the tape I 
> backed up with your hardware can't be read on another of your units, 
> what will you do for me?" - from personal experience at a trade show... 
> In talking to the data recovery people I found that they wanted to know 
> the brand, version and revisions of the tape backup software as well as 
> the model and firmware revision of the tape unit itself.)

If your sites do not use the EXACT same hardware (including mods made
over time), you will have created a huge variable for recovery.  Many
companies perform fake DR scenarios every year or two.. which helps
catch these kinds of problems.  But if the hardware is too different,
it can be an absolute nightmare... .usually resulting in the replacement
of hardware after the DR scenario is over.

> 
> If there's more to it than that, what is it?  I would hope that IDE and 
> SCSI/RAID devices would be sufficiently abstracted that a tape restore 
> wouldn't care as long as the target device had adequate capacity and had 
> the same filesystem.  I'm also assuming that ALL of the business data is 
> on a separate partition than where the OS resides.  I guess i should ask 
> about video and input device issues as well.

Certainly you should always strive to minimize the differences using
devices that are well abstracted.  Things that differ broadly include
CPUs, motherboard chipsets, ethernet NICs, RAID controllers, SCSI
HBAs, Fiber HBAs... many of those things are considered to be
"well abstracted" components.... alas... it's not that simple.

Even the difference between a 32-bit Xeon and a 32-bit Xeon with
HT can be an issue... Certainly a problem if the other processor
is EM64T or AMD (lookout!).

Chipsets can mean the difference between a working DMA and a
non-working DMA... can mean the difference between seeing
a harddrive and not seeing one.

Yes.. most off of this had to do with time.  So if time is infinite
after a disaster (which it's usually longer than mgmt realizes...
they tend to assume that after Dallas is taken out by an H-bomb,
that their employees will pick right back up and get the business
running by the end of the week... HAH!)... then the incompatabilities
are simply part of the process.  You deal with it.

> 
> All,
> 
> Along the lines of the meaning of DR, one issue that no one raised is a 
> "partial disaster" - the hardware is intact but the system won't boot or 
> the service won't start.  In this case what would you document?  What 
> came to mind for me was menu.lst/grub.conf/lilo.conf and the critical 
> service configuration files.
> 
> Another issue I didn't see (or maybe this is the reason for the "all of 
> /etc" statements) is special features in use such as quotas or Extended 
> ACLs.  Would a tape restore adequately handle these or is documentation 
> called for?

ACLs are still very non-standard across *ix... but if we're talking just
Linux.. then you just have to make sure to use a backup format
that can handle ACLs.  I know that s-tar handles them (though it had
some serious bugs until very recently)... not sure if GNU tar has
added adequate support or not.  My soln... kill the extended ACL
support.. it's still baking.