3.6 Identify the purpose and characteristics of disaster
recovery.
RAID is only part of a network’s fault tolerance. Another
important aspect is disaster recovery. It isn’t always a crashed hard drive or
a virus that will wipe out your data. There’s also physical disaster, such as
theft, vandalism (physical and virtual), floods, fires, hurricanes, etc. Your
network’s ability to recover from disasters is an important part of its fault
tolerance plan.
Disaster is an occurrence causing widespread
destruction and distress; a catastrophe or a grave misfortune. - Source:
dictionary.com.
Guidelines
A disaster can be as catastrophic as a tornado destroying the
primary operation site or as mundane as the accidental loss of critical data.
How a network administrator will cope with inevitable disasters will depend on
the type of disaster and the amount of extant pre-planning for disaster
recovery.
Sometimes a disaster recovery is a simple as restoring a backup
or rebooting a server. Other disasters require planning such as having standbys
and spares of all critical pieces of replaceable hardware components for a
potential server failure, in-stock and tested.
Always have a documented disaster recovery plan. Update
your documentation and test it regularly. Periodically ensure that procedures
are current and accurate.
While disaster recovery can be a complicated process, there are
several basic guidelines for larger enterprises:
- Always keep a set of the data OFFSITE.
- Establish an alternative site (cold site; offsite new building; maybe even
a different state).
- Prepare a special group of people to work at your alternative site and
devise a plan to get them to the new site from the disaster site. Consider a
rotating schedule of different technical staff members.
- Decide what products are needed to support the recovery process, acquire
them, and train in their use.
- Some disaster recovery scenarios even include having a complete duplicate
of your server standing by, in case of disaster.
- Simple items such as a UPS can save you a lot of headaches, such as in the
case of power failures.
- Disaster-recovery management should always begin with a planning meeting.
See the Cramsession article on how to test and replace batteries
in the APC
UPS 1000.
Questions to ask when developing a disaster recovery plan:
- What will the company need if disaster strikes?
- What department(s) has priority for getting back online first?
- How much of the data is at risk?
- What is the minimum and maximum downtime you can afford?
- What is your cost per minute if your system is down?
- Are there redundant networks that can replace your failed system?
- Would a clustered environment minimize your risk?
For Windows 2000 Server disaster-recovery concepts, click
here.
The Three Hots (Spare, Plug, and Swap) and Fail Over
- Hot Spare – A drive you have on hand that can be placed in a
server. This refers to the drive itself. A hot spare can become a hot
swap/plug.
- Hot
Plug/Swap – Replacing a hard drive, CD-ROM drive, power
supply, or other device with a similar device without shutting down
the server. Hot plugging is supported by Universal Serial Bus (USB),
IEEE
1394, and PCMCIA.
Caution: Do not confuse this Hot Plug with Hot Plug PCI, which is the
ability to plug a device into a PCI slot while the PCI bus remains online.
- Fail
Over – When one device, database, server, or network fails,
a standby automatically takes its place. This is an important fault
tolerance function in a mission-critical environment where constant
accessibility is a must.