New Page 1

3.6 Identify the purpose and characteristics of disaster recovery.

RAID is only part of a network’s fault tolerance. Another important aspect is disaster recovery. It isn’t always a crashed hard drive or a virus that will wipe out your data. There’s also physical disaster, such as theft, vandalism (physical and virtual), floods, fires, hurricanes, etc. Your network’s ability to recover from disasters is an important part of its fault tolerance plan.

Disaster is an occurrence causing widespread destruction and distress; a catastrophe or a grave misfortune. - Source: dictionary.com.

Guidelines

A disaster can be as catastrophic as a tornado destroying the primary operation site or as mundane as the accidental loss of critical data. How a network administrator will cope with inevitable disasters will depend on the type of disaster and the amount of extant pre-planning for disaster recovery.

Sometimes a disaster recovery is a simple as restoring a backup or rebooting a server. Other disasters require planning such as having standbys and spares of all critical pieces of replaceable hardware components for a potential server failure, in-stock and tested.

Always have a documented disaster recovery plan. Update your documentation and test it regularly. Periodically ensure that procedures are current and accurate.

While disaster recovery can be a complicated process, there are several basic guidelines for larger enterprises:

Always keep a set of the data OFFSITE.
Establish an alternative site (cold site; offsite new building; maybe even a different state).
Prepare a special group of people to work at your alternative site and devise a plan to get them to the new site from the disaster site. Consider a rotating schedule of different technical staff members.
Decide what products are needed to support the recovery process, acquire them, and train in their use.
Some disaster recovery scenarios even include having a complete duplicate of your server standing by, in case of disaster.
Simple items such as a UPS can save you a lot of headaches, such as in the case of power failures.
Disaster-recovery management should always begin with a planning meeting.

See the Cramsession article on how to test and replace batteries in the APC UPS 1000.

Questions to ask when developing a disaster recovery plan:

What will the company need if disaster strikes?
What department(s) has priority for getting back online first?
How much of the data is at risk?
What is the minimum and maximum downtime you can afford?
What is your cost per minute if your system is down?
Are there redundant networks that can replace your failed system?
Would a clustered environment minimize your risk?

For Windows 2000 Server disaster-recovery concepts, click here.

The Three Hots (Spare, Plug, and Swap) and Fail Over

Hot Spare – A drive you have on hand that can be placed in a server. This refers to the drive itself. A hot spare can become a hot swap/plug.
Hot Plug/Swap – Replacing a hard drive, CD-ROM drive, power supply, or other device with a similar device without shutting down the server. Hot plugging is supported by Universal Serial Bus (USB), IEEE 1394, and PCMCIA. Caution: Do not confuse this Hot Plug with Hot Plug PCI, which is the ability to plug a device into a PCI slot while the PCI bus remains online.
Fail Over – When one device, database, server, or network fails, a standby automatically takes its place. This is an important fault tolerance function in a mission-critical environment where constant accessibility is a must.