[NTLUG:Discuss] Disaster recovery

Sun Aug 14 02:54:24 CDT 2005

On 8/14/05, Leroy Tennison <leroy_tennison at prodigy.net> wrote:
> I guess I should qualify what I mean by "Disaster Recovery", I'm
> assuming that there is a data backup of the system.  My concern is what
> I need to know to get to a point where that data restore is usable
> (functioning system with a configuration that will accept the restore).
>  In this situation I don't see a need for making a copy of all of /etc
> but I could be missing something.  If I am please educate me.

There are two "hard" definitions in Disaster Recovery. These are:
(1) RPO - Recovery point Objective
(2) RTO - Recovery Time Objective

The "what I need to know to get to a point" above is all about RPO.
If you have a one hour RTO then your RPO better be almost instantaneous. 
There is no time to find, load and spin tapes. 
Even a four hour RTO does not allow tape spinning if you have a lot of data.

Most Financial Services companies have a four hour RTO if they do any
open market trading.
If they go down and are not completely back up in four hours they are
out of business.
This means the main goal is to never go down. 
They all have very good Disaster Recovery plans and test them
regularly. These are all mainframe based. Open Systems do not have the
high hardware reliability of mainframes. This is why Open Systems are
cheaper.
The reason Financial Services companies are out of business with a
four hour outage has to do with the nature of the business. They are
constantly trading in the market. These trades must be recorded and
reconciled within a matter of minutes or they get overwhelmed and can
never record and reconcile the trades. After four hours they are
hopelessly behind.

If you feel really adventurous and either own your own company or are
married to the owner's daughter, you can simulate this environment by
simply pulling the plug to the email function for four hours in the
middle of the day. Maybe less. No guts no glory.

> Along the lines of the meaning of DR, one issue that no one raised is a
> "partial disaster" - the hardware is intact but the system won't boot or
> the service won't start.  In this case what would you document?  What
> came to mind for me was menu.lst/grub.conf/lilo.conf and the critical
> service configuration files.

IMHO this is not Disaster Recovery. It is typical operational problems
similar to the hard disk failures we all know and love. If you are
lucky smoke will pour out of the box; and not set off the Halon. This
will mean the disks are OK but the motherboard is fried. The silent,
smokeless failures are the most sinister.

We all love and now have as many LiveCD's as we can fit in our
pockets. It used to be screwdrivers and other gadgets like the Dr. Who
sonic screwdriver and Captain Kirk's communicator. We used to make the
newbies train in Disaster Recovery on a system with no floppy or CD.
How did they load the OS in those days? You can now hook your
Blackberry or cell phone up to the serial port and get an ASCII
console running? If you have the right cable.

One of the most common problems, not involving hardware, is database
corruption. This is not a Disaster Recovery scenario. It is a test of
good backup/recovery procedures and DBA knowledge and skills.
Typically you have one hour to recover.

Good questions. You are thinking along the right lines.
Disaster Recovery is like virtue. It is its own reward and it's a darn
good thing.

Thanks,  Robert