[NTLUG:Discuss] Disaster recovery

Sun Aug 14 02:30:44 CDT 2005

Robert Pearson wrote:

>On 8/12/05, Leroy Tennison <leroy_tennison at prodigy.net> wrote:
>  
>
>>At the risk of starting a "religious" war, what would you consider to be
>>the crucial things to document about a Linux system for the purpose of
>>disaster recovery?
>>    
>>
>
>It sounds like you are asking about quickly, or even at all,
>rebuilding a server that is lost completely. That is one common type
>of Disaster Recovery (DR). IMHO Disaster Recovery (DR) is more of a
>process than a series of discrete steps. If you start laying out the
>process the discrete steps fall right into place. This is the best
>reason to test DR before you really need it. Did you get the process
>right?
>
>You could think of Disaster Recovery as one of two types:
>(1) Local Disaster Recovery - loss of a server, a cluster of servers,
>a group of servers, a roomful of servers, a part of a floor of servers
>or the entire floor, or a site building housing some of the servers.
>
>You can group these by:
>Performance Hit - loss of a server, a cluster of servers
>Revenue Hit - LOB (Line or Lines of Business) Hit - a group of
>servers, a roomful of servers, a part of a floor of servers or the
>entire floor, or a site building housing some of the servers.
>
>(2) Site Disaster Recovery - the passive mirror site, the active
>mirror site, the site (building) all the Production servers are at (or
>in) or the "Total (Nuclear) Meltdown".
>Total (Nuclear) Meltdown Definition - loss of the Production site, the
>mirror site (or sites), the Disaster Recovery site and all the
>recovery material stored offsite and/or loss of all key Disaster
>Recovery personnel. The knowledge left with the building.
>
>Both of the above are server scenarios. 
>
>There is actually a third type of Disaster Recovery that has to do with Storage.
>(1) Revenue Hit - you lose the storage for the Information that
>generates 80% of your revenue
>(2) Out of Business Hit - you lose the storage for the "key"
>Information that keeps you in business
>
>Since most people do not know what that Information is for (1) and (2)
>they backup everything and plan to recover everything in case of a
>Disaster. Content Management can be a big help here.
>
>In the past this was not a big issue because the storage was Direct
>Attached to the servers. Since NAS and SAN and Virtualization have
>become real this has become an issue. On the one hand if your servers
>and storage are Geographically Dispersed from each other you are more
>secure from Disasters. On the other hand you are not.
>
>The Electronic Inventory and Imaging Server solutions address the
>original question more directly and will be in a separate email. The
>Electronic Inventory is similar to what "cfengine" or "servdoc" or any
>good Configuration Management software does. It gets a "point-in-time
>snapshot" of the configuration to give you "Configuration at a Glance"
>for really quick recovery. An Imaging Server is just a bunch of "dd"
>image files for quick recovery of servers. It can be more, much more.
>If you have the time...
>
>Thanks,  Robert
>
>_______________________________________________
>https://ntlug.org/mailman/listinfo/discuss
>
>  
>
I want to thank everyone (once, rather than relpying to each message) 
for their input.  Some good points surfaced.

Now for the questions:

David, Thomas,

I guess I should qualify what I mean by "Disaster Recovery", I'm 
assuming that there is a data backup of the system.  My concern is what 
I need to know to get to a point where that data restore is usable 
(functioning system with a configuration that will accept the restore). 
 In this situation I don't see a need for making a copy of all of /etc 
but I could be missing something.  If I am please educate me.

Chris, Robert,

You make very good points about the meaning of DR, this is where I 
assumed the "religious war" would break out.

Chris - what is the issue with 'LIKE hardware'?  Are you thinking about 
time to recover or what?  I'm well aware of the tape issues (Backed up 
by one unit, can't be read by another.  If you want to get a tape 
hardware sales representative to choke just ask them "What if the tape I 
backed up with your hardware can't be read on another of your units, 
what will you do for me?" - from personal experience at a trade show... 
 In talking to the data recovery people I found that they wanted to know 
the brand, version and revisions of the tape backup software as well as 
the model and firmware revision of the tape unit itself.)

If there's more to it than that, what is it?  I would hope that IDE and 
SCSI/RAID devices would be sufficiently abstracted that a tape restore 
wouldn't care as long as the target device had adequate capacity and had 
the same filesystem.  I'm also assuming that ALL of the business data is 
on a separate partition than where the OS resides.  I guess i should ask 
about video and input device issues as well.

All,

Along the lines of the meaning of DR, one issue that no one raised is a 
"partial disaster" - the hardware is intact but the system won't boot or 
the service won't start.  In this case what would you document?  What 
came to mind for me was menu.lst/grub.conf/lilo.conf and the critical 
service configuration files.

Another issue I didn't see (or maybe this is the reason for the "all of 
/etc" statements) is special features in use such as quotas or Extended 
ACLs.  Would a tape restore adequately handle these or is documentation 
called for?