Disaster Recovery:
Backing Up and
Restoring
E
very MIS or network administrator has a horror story
to tell about backing up and restoring systems or data.
One organization, where we now manage more than a dozen
backup servers, has data processing centers spread all over
the United States, and all are inter-connected via a large pri-
vate wide area network. In mid-1999, a valuable remote
Microsoft SQL Server machine just dropped dead. The IT
doctor said it had died of exhaustion . . . five years of faithful
service and never a day’s vacation. After trying everything
to revive it, we instructed the data center’s staff to ship the
server back to HQ for repairs.
The first thing we asked the IT people at the remote office
was: “You’ve been doing your backups everyday right?” “Sure
thing,” they replied. “Every day for the past five years.” They
sounded so proud, we were overjoyed. “Good, we will have to
rebuild your server from those tapes, so send them all to us
with the server.” To cut a frustrating story short: The five
years’ worth of tapes had nada on them, not a bit nor a byte.
Zilch. We spent two weeks trying to make sense of what was
on that SQL Server computer and to rebuild it. We refuse to
even guess the cost of that loss.
We have another horror story we will later relate, but this
example should make it clear to you that backup administra-
tion, a function of disaster recovery, is one of the most impor-
tant IT functions you will have the fortune to be charged with.
Backup administrators need to be trained, responsible, and
cool people. They need to be constantly revising and refining
their practice and strategy; their companies depend on them.
17
17
CHAPTER
✦✦✦✦
In This Chapter
Understanding
Backup Practice
and Procedure
Removable Storage
and Media Pools
Using the Backup
Tools that Come with
Windows 2000
✦✦✦✦
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 643
644
Part V ✦ Availability Management
This chapter serves as an introduction to disaster recovery/backup-restore proce-
dures on Windows 2000 networks, the Backup-Restore utility that ships with the
Windows 2000 operating system, and the Windows 2000 Removable Storage Manager.
Before we get into this chapter, we should consider several angles on the backup/
restore functions expected of administrators.
Why Back Up Data?
You back up for two reasons, and even Windows 2000, with its fancy tools, rarely
highlights the differences:
✦ Record-keeping (such as annual backups performed every month)
✦ Disaster Recovery (DR) or System Recovery
You should make an effort to decide when a file is no longer valuable to the disaster
recovery period, and then it should be archived out for record-keeping. Depending
on your company’s needs, this may vary from a week to a couple of weeks, or from
a month to a couple of months, and even years. There is no point buying media for
annual backups for a site you know is due to close in six months.
What To Back Up
Often, administrators back up every file on a machine or network and dump the
whole pile into a single backup strategy. Instead, they should be splitting up our
files into two distinct groups: System and Data.
✦ System files comprise files that do not change between versions of the applica-
tions and operating systems.
✦ Data files comprise all the files that change every day, such as word-processing
files, database files, spreadsheets files, media files, graphics files, and configura-
tion files (like the registry, DHCP, WINS, DNS, and the Active Directory data-
bases). Depending on your business, data files can change from 2 percent a day
on the low side to 80 percent a day on the high side. The average in many of the
businesses for which we have consulted is around 20 percent of the files chang-
ing every day. And, you must also consider the new files that arrive.
Understanding the requirements will make your life in the admin seat easier,
because this is one of the most critical of all IT or network admin jobs. One per-
son’s slip-up can cause millions of dollars in data loss. How often have you backed
up an entire system that was lost for some reason, only to find that to restore it,
you had to reinstall from scratch? “So why was I backing up the system,” you might
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 644
645
Chapter 17 ✦ Disaster Recovery: Backing Up and Restoring
have asked yourself. And how often have your restored a file for a user who then
complained he or she lost five days’ worth of work on the file because the restore
was so outdated. It’s happened to us on many occasions and is very disheartening
if you are trying so hard to keep your people productive.
There is nothing worse than trying to recover lost data, knowing that all on Mahogany
Row are sitting idle, with the IT director standing behind you in the server room, and
discovering you cannot recover. The thought of your employment record being pulled
should make you realize how important it is to pay attention to this function.
We will delve into these two subjects in more depth in this chapter and explore how
Windows 2000 helps us better manage our recovery and record-keeping processes.
We will start by focusing on the data side of the backup equation and finally lead
this discussion into system backup/restore.
Understanding Backup
Before you can get started using Windows 2000 Backup, or any other backup pro-
gram, you need to know how backing up works and have a basic backup strategy
in mind.
Archive Bits
The archive bit is a flag, or a unit of data, indicating that the file has been modified.
When we refer to the setting of the archive bit, we mean that we have turned it on,
or we have set it to “1.” Turning it off means we set it to zero or “0.” If the archive
bit is turned on since we last backed up the file, it means that the file has been mod-
ified since it was last backed up.
Trusting the state of the archive bit, however, is not an exact science by any means,
because it is not unusual for other applications (and developers) and processes to
mess with the archive bit. This is the reason we recommend that a full backup be
performed on all data at least once a week.
What Is a Backup?
A backup is an exact copy of a file (including documentation) that is stored on a
storage media (usually in a compressed state) and kept in a safe place (usually at a
remote location) for use in the event the working copy is destroyed. Notice that we
placed emphasis on “including documentation,” because with every media holding
backups, you need to maintain a history or documentation of the files on the media.
This is usually in the form of labels and identification data on the media itself, on
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 645
646
Part V ✦ Availability Management
the outside casing, and in spreadsheets, hard catalogs, or data ledgers in some
form or another. Without history data, restore media will be unable to locate your
files and the backup will be useless. This is why it is possible to prepare a tape for
overwriting by merely formatting the label so that the magnetic head thinks the
media is blank.
There are various types of backups, depending on what you back up and how often
you back it up:
✦ Archived backup: A backup that documents (in header files, labels, and
backup records) the state of the archive bit at the time of copy. The state (on-
off) of the bit indicates to the backup software that the file has been changed
since the last backup. When Windows 2000 Backup does an archived backup,
it sets the archive bit accordingly.
✦ Copy backup: An ad-hoc “raw” copy that ignores the archive bit state. It also
does not set the archive bit after the copy. A copy backup is useful for quick
copies between DR processes and rotations, or to pull an “annual” during the
monthly rotation (we discuss this later).
✦ Daily backup: This does not form part of any rotation scheme (in our book
anyway). It is just a backup of files that have been changed on the day of the
backup. We question the usefulness of the daily backup in Backup, because
mission-critical DR practice dictates the deployment of a manual or auto-
mated rotation scheme (described later). Also, Backup does not offer a sum-
mary or history of the files that have changed during the day. If you were
responsible for backing up a couple of million files a day . . . well, this just
would not fly.
✦ Normal backup: A complete backup of all files (that can be backed up), period.
The term “normal” is more a Windows 2000 term because this backup is more
commonly called a “full” backup in DR circles. The full backup copies all files
and then sets the archive bit to indicate (to Backup) that the files have been
backed up. You would do a full backup at the start of any backup scheme. You
would also have to do a full backup after making changes to any scheme. A full
backup, and documentation or history drawn from it, is the only means of per-
forming later incremental backups. Otherwise, the system would not know
what has or has not changed since the last backup.
✦ Incremental backup: A backup of all files that have changed since the last
full or incremental backup. The backup software sets the archive bit, which
thereby denotes that the files have been backed up. Under a rotation scheme,
a full restore would require you to have all the incremental media used in
the media pool, all the way back to the first media, which contains the full
backup. You would then have the media containing all the files that have
changed (and versions thereof) at the time of the last backup.
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 646
647
Chapter 17 ✦ Disaster Recovery: Backing Up and Restoring
✦ Differential backup: This works exactly as the incremental, except that it
does not do anything to the archive bit. In other words, it does not mark the
files as having been backed up. When the system comes around to do a differ-
ential backup, it will rely on comparison of the files to be backed up with the
original catalog. Differential backups are best done on a weekly basis, along
with a full or normal backup, so as to keep differentials comparing against
recently backed up files.
What Is a Restore?
A restore is the procedure you perform to replace a working copy of a file or collec-
tion of files to a computer’s hard disks in the event they are lost or destroyed. You
will often perform a restore for no reason other than to return files to a former state
(such as when a file gets mangled, truncated, corrupted, or infected with a virus).
Restore management is crucial in the DR process. If you lose a hard disk or the
entire machine (for example, it is trashed, stolen, lost, or fried in a fire), you will
need to rebuild the machine and have it running in almost the same state (if not
exactly) as its predecessor was in at the time of the loss. How you manage your DR
process will determine how much downtime you experience or the missing genera-
tion of information between the last backup and the disaster — a period we call void
recovery time.
Understanding How Backup Works
A collection of media, such as tapes or disks, is known as a backup set (this is differ-
ent from a media pool, which we will discuss in a bit). The backup set is the backup
media containing all the files that were backed up during the backup operation.
Backup uses the name and date of the backup set as the default set name. Backup
allows you to either append to a backup set in future operations or replace or over-
write the files in the media set. It allows you to name your backup set according to
your scheme or regimen.
Backup also completes a summary or histories catalog of the backed-up files, which
is called a backup set catalog. If your backup set contains several media, then the
catalog is stored on the last medium in the set, at the end of the file backup. The
backup catalog is loaded when you begin a restore operation. You will be able to
select the files and folders you need to restore from the backup catalog.
Removable Storage and Media Pools
Removable Storage (RS) is a new service in Windows 2000 that takes away a lot of
the complexity of managing backup systems. This service also brings network sup-
port to Windows for a wider range of backup and storage devices.
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 647
648
Part V ✦ Availability Management
Microsoft took the responsibility of setting up backup devices and management of
media away from the old Backup application and created a central authority for
such tasks. This central authority is known as Removable Storage and is one of
the largest and most sophisticated additions to the operating system, worth the
price of the OS license alone, and a welcome member on any network. If you are
not ready to convert to a Windows 2000 network, you might consider raising a
Windows 2000 “Backup” server just to obtain the services of Removable Storage.
But Removable Storage is like an iceberg. In this chapter and in other parts of the
book, we can only show you the tip. Exposing the rest of this monster service, and
everything you can do with it, is beyond the scope of this treatise, and a full treat-
ment of the subject would run into several chapters. To fully appreciate this service,
and if you need to get into some serious disaster recovery strategies, possibly even
custom backup and media handling algorithms, you should refer to the Microsoft
documentation covering both the Removable Storage Service and its API and the
Tape/Disk API. A good starting place is the Windows 2000 Server Operations Guide,
which is part of the Resource Kit, discussed in Appendix B. We will, however, pro-
vide you with an introduction to the service, coming up next.
The Removable Storage Service
Removable Storage comprises several components. But the central nervous system
of this technology is the Removable Storage Service and the Win32 Tape/Disk API.
These two components, respectively, expose two application programming inter-
faces (APIs) that any third party can access to obtain removable storage functional-
ity and gain access to removable storage media and devices. The Backup program
that ships with the OS makes use of both APIs to provide a usable, but not too
sophisticated, backup service.
By using the two services, applications do not need to concern themselves with the
specifics of media management, such as identifying cartridges, changing them in
backup devices, cataloging, numbering, and so on. This is all left to the Removable
Storage Service. All the application requires is access to a media pool created and
managed by Removable Storage. The backup application’s responsibility is identify-
ing what needs to be backed up or restored, and the source and destination of data;
Removable Storage’s responsibility deals with where to store it, what to store it on,
and how to retrieve it. Essentially, the marriage of backup-restore applications and
Removable Storage has been consummated along client/server principles.
The Removable Storage Service can be accessed directly by programming against the
API. You can also work with it interactively (albeit not as completely as programming
against the API) in the Removable Storage node found in the Computer Management
snap-in (
compmgmt.msc
). The Removable Storage node is also present in the Remote
Storage snap-in discussed in Chapter 21. Before we begin with any hard-core backup
practice, let’s look at Removable Storage and how it relates to backup and disaster
recovery. Removable Storage is also briefly discussed in Chapter 16.
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 648
649
Chapter 17 ✦ Disaster Recovery: Backing Up and Restoring
Figure 17-1: The Removable Storage Snap-in
The service provides the following functionality to backup applications, also known
as backup or data moving and fetching clients:
✦ Management of hardware, such as drive operations, drive health and status,
and drive head cleaning
✦ Mounting and dismounting of cartridges and disks (media)
✦ Media inventory
✦ Library inventory
✦ Access to media and their properties
Access to the actual hardware is hidden from client applications. But the central
component exposed to all clients is the media pool. To better understand the media
pool concept in Removable Storage, let’s first discuss media.
Backup media ranges from traditional tape cartridges (discussed at the end of this
chapter) to magnetic disk, optical disk CD-ROM, DVD, and so on. More types of
media are becoming available, such as “sticks” and “cards” that you can pop into
cameras and pocket-sized PCs, but these are not traditional backup media formats,
nor can they hold the amount of data you would wish to store. DVD, a digital video
standard, however, is a good choice for backing up data because so much can be
stored on a single DVD disk.
Like the dynamic disk management technology discussed in Chapter 16, Removable
Storage hides the physical media from the clients. Instead, media is presented as a
logical unit, which is assigned a logical identifier or ID. When a client needs to store
or retrieve data from media, it does not deal with the physical media, but rather
with that media’s logical ID. The logical ID can thus encapsulate any physical media,
the format of which is of no concern to the client application.
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 649
650
Part V ✦ Availability Management
Although the client need not be concerned about the actual media, you, the
backup administrator, have the power, by configuring media pools, to dictate onto
which format or media type your backups should be placed. If this is confusing to
you, it will become clearer when you understand media pools, discussed shortly.
The benefit of the logical ID is patent, but a good example of its application is that
the service is able to move data, represented by its logical ID, from one physical
medium to another. This would be desirable if media is approaching the end of its
life and the data needs to be moved to new cartridges.
Media formats can be extremely complex. Some media allow you to write and read
to both sides; others only allow access to one side. How media is written to and
read from differs from format to format. Removable Storage handles all those pecu-
liarities for you. Just like the Print Spooler service, which can expose the various
features of thousands of different print devices, so can Removable Storage identify
many storage devices and expose their capabilities to you and the application. (The
pros and cons of each of the popular backup media formats are discussed at the
end of this chapter).
Finally, and most important from the cost/benefit aspect, Removable Storage allows
media to be shared by various applications. This ensures maximum use of your
media asset.
The Removable Storage Database
Removable Storage stores all the information it needs about the hardware, media
pools, work lists, and more in its own database. This database is not accessible to
clients and is not a catalog of which files have been backed up and when. Everything
that Removable Storage is asked to do, or does, is automatically saved in this
database.
Physical Locations
Removable Storage also completely handles the burden of managing media location,
a chore once shared between the client applications and the administrator. But the
physical location service deals with more than knowing in which cupboard, shoe-
box, vault, or offsite dungeon you prefer your media stored in; it is also responsible
for the physical attributes of the hardware devices used for backing up and restoring
data. It is worthwhile to understand this section, because you will need such knowl-
edge to perform high-end backup services that protect a company’s data.
Removable storage splits the location services into two tiers: libraries and offline
locations. If a media is online, then it is inside a tape device of some kind that can
at any time be fired up to allow data to be accessed or backed up. If media is offline,
then it means that you have taken it out of its drive or slot and sent it somewhere.
Note
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 650
651
Chapter 17 ✦ Disaster Recovery: Backing Up and Restoring
As soon as you remove media from a device, Removable Storage makes a note in
its database that the media is offline.
Libraries can be single tape drives or highly sophisticated and very expensive
robotic storage silos comprising hundreds of drive bays. A CD-R/W tower, with 12
drives, is also an example of a library. Media in these devices or so-called libraries
are always considered online, and are marked as such in the database. Removable
Storage also understands the physical components that make up these devices.
Library components comprise the following:
✦ Drives: All backup devices are equipped with drives. The drive machinery
consists of the recording heads, drums, motors, and other electronics. To
qualify as a library, a device requires at least one drive.
✦ Slots: Slots are pigeonholes, pits, or holding pens in which online media is
placed, in an online state. When media is needed for a backup, a restore, or a
read, the cartridge or disk is pulled out of the slot and inserted into the drive.
When the media is no longer needed, the cartridge is removed from the drive
and returned to its slot. The average tape drive does not come equipped with a
slot, but all high-end, multi-drive robotic systems do. The basic slot-equipped
machine typically comes equipped with two drives and 15 slots. Slots are typi-
cally grouped into collections called magazines. Each magazine holds about
five cartridges, and one magazine maintains a cleaning cartridge in one of the
slots. You typically have access to magazines so that you can populate them
with the cartridges you fetched from offline locations.
✦ Transports: These are the robotic machines in high-end libraries that move
cartridges and disks from slots to drives and back again.
✦ Bar Code Readers: Bar coding is discussed later in this chapter. It is a means
by which the cartridges can be identified in their slots. You do not require a
bar code reader-equipped system to use a multi-drive or multi-slot system
because media identifiers can also be written to the media. But bar code read-
ing allows for much faster access to the cartridges, because the system does
not need to read information off the actual media, which requires every car-
tridge to be pulled from a slot and inserted into a drive, a process that could
take as long as five minutes for every cartridge.
✦ Doors: Doors differ from device to device and from library system to library
system. In some cases, the door looks like the door to a safe, which is released
by Removable Storage when you need to gain access to slots or magazines.
Many systems have doors that only authorized users can access. Some doors
are built so strong that you would need a blowtorch to open them. On many
cheaper devices, especially single drive-no slot hardware, the door is a small
lever that Removable Storage will release so that you can extract the cartridge.
Other devices have no doors at all, but when Removable Storage sends an
“open sesame” command to the “door,” the cartridge is ejected out of the
drive bay.
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 651
652
Part V ✦ Availability Management
✦ Insert/Eject Ports: The IE ports are not supported on all devices. IE ports pro-
vide a high degree of controlled access to the unit in a multi-slot library sys-
tem. In other words, you insert media into the port, and the transport goes
and finds a free slot for it. Another way to comprehend the IE port function is
to compare it to a valet service. You hand your car keys to the valet, and he or
she goes and finds a free parking space for you.
If the hardware you attach supports any or all of these sophisticated features,
Removable Storage will be able to “discover it” and use it appropriately.
There are dozens, if not hundreds, of devices from which to choose for backing up
and storing data. Removable Storage, as we discussed, can handle not only tradi-
tional tape backup systems, but also CD silos, changers, and huge multi-disk read-
ers. If you wish to check if Removable Storage supports a particular device, follow
the steps to create a media pool discussed in the section “Performing a Backup”
later in this chapter.
Media Pools
A new term in the Windows operating system is the media pool. If you are planning
to do a lot of backing up or have been delegated the job of backup operator or
administrator, you will have a lot to do with media pools in your future backup-
restore career.
A media pool in the general sense of the term is a collection of media organized as
a logical unit. Conceptually speaking, the media pool contains media that belong to
any defined storage or backup device, format, or technology assigned to your hard-
ware, be it a server in the office or one located out on the WAN somewhere, 15,000
miles away. However, each media pool can only represent media of one type. You
cannot have a media pool that combines DVD, DAT, and ZIP technology. But you can
back up your data to multiple media pools of different types if the client application
or function so requires it.
It may be easier to think of the media pool in terms of the hardware devices that are
available to your system (such as a CD-R/W or a DLT tape drive). You should strive
not to work with media pools from dissimilar devices, especially when backing up
zillions of files. For example, you should stay away from creating media pools that
consist of Zip drives, DLT tape drives, and a CDR-R/W changer. It would make man-
aging your media, such as offsite storage, boxing, and labeling, very difficult, much
like wearing a sneaker on one foot and a hiking boot on the other and then justify-
ing walking with both at the same time because they both represent “pools” of
walking attire.
Removable Storage separates media pools into two classes: system pools and appli-
cation pools. The Removable Storage Service creates system pools when it is first
installed. By default, the Removable Storage Service is enabled and starts up when
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 652
653
Chapter 17 ✦ Disaster Recovery: Backing Up and Restoring
you boot your system. If you disable it or remove it from installation, any devices
installed in your servers — or attached on external busses — will be ignored by
Windows 2000, as if they did not exist. When Removable Storage is activated, it will
detect your equipment, and if compliant, they will be used in media pools automati-
cally created by the service or applications.
System pools
System pools hold the media that are not being used by any application. When you
install new media into your system, the first action that Removable Storage takes is
to place the media into a pool for unrecognized media. Then, when you have identi-
fied the media, you can make it available to applications by moving it to the free
pools group. The system pools are built according to the following groups:
✦ Free pools: Free pools allow any application to access the media pools in this
group. In other words, these media pools can be made available to any appli-
cation requiring free media. Applications can draw on these media pools
when they need to back up data. When media pools are no longer required,
they can be returned to this group.
✦ Unrecognized pools: Media in these pools are not known to Removable
Storage. If the service cannot read information on a cartridge, or if the car-
tridge is blank, the media pool supporting it is placed into this grouping.
✦ Import pools: This group is for media pools that were used in other Removable
Storage systems, on other servers, or by applications that are compatible with
Removable Storage or that can be read by Removable Storage. Media written
to by the Microsoft Tape Format (MTF) can thus be imported into the local
Removable Storage system.
Application pools
When an application is given access to a free media pool, either it will create a spe-
cial pool into which the media can be placed or you can create pools manually for
the application using the Removable Storage snap-in, illustrated in Figure 17-1.
A very useful and highly sought after feature of Windows 2000 media pools is that
permissions can be assigned to pools to allow other applications to use the pools
or to protect the pools in their own sets.
Multi-level media pools
It might astonish you to find out that media pools can be organized into hierarchies
or nests. In other words, you can create media pools that hold several other media
pools. An application can then use the root media pool and gain access to the dif-
ferent data storage formats in the nested media pools. Expect to see sophisticated
document storage, backup, and management applications using such media pools.
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 653
654
Part V ✦ Availability Management
An example of using such a hierarchy of media pools can be drawn from a near dis-
aster that was averted during the writing of this chapter. One of our 15-tape DLT
changers went nuts and began reporting that our tapes were not really DLT tapes
but alien devices it was unable to identify. The only way to continue backing up our
server farm was to enlist every SCSI tape and disk device on the network into one
large pool. Once the DLT library recovered, we could go back to business as usual.
Work Queue and Operator Requests
You will notice nodes for both Work Queue and Operator Requests in the
Removable Storage tree. These services provide a communications and information
exchange function between the operator (the backup operator or administrator or
the backup operator group) and Removable Storage, respectively.
Work queue
Working backup applications and the HRS/RSS service post work requests to the
Removable Storage service. To manage the multitude of requests that can come
from applications and services, each request for work from the Removable Storage
service is placed into the work queue. The work queue is very similar in concept to
a print queue discussed in Chapter 23.
The work queue provides information on queue states on a continual basis, and
these are reported to the details pane in the Work Queue node. For example, if an
application is busy backing up data, an “In Process” state will be posted to the
details pane identifying the work request and the state it is in. Table 17-1 describes
the work queue states reported to the Work Queue details pane.
Table 17-1
Work Queue States
State Explanation
Queued The work item has been queued. It is waiting for the RS service to examine
the request.
In Process RS is working on the work item.
Waiting The request is waiting for a resource, currently being used by another
service, before work on the item can continue.
Completed RS has handled the work item successfully. The request has been satisfied.
Failed RS has failed to complete the work item. The request did not obtain the
desired service.
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 654
655
Chapter 17 ✦ Disaster Recovery: Backing Up and Restoring
Operator requests
No matter how sophisticated Removable Storage is, there are some things it just
will not do. These items will be marked for the “human” work queue. For example,
Removable Storage will not go and fetch cartridges from the cabinet or the store-
room. This is something you have to do. The details pane in the Operator Requests
node is where Removable Storage posts its request states for you, the operator.
Removable Storage can also send you a message via the messenger service or the
system tray, just in case you have the habit of pretending the Operator Requests
node does not exist. Table 17-2 lists the possible Operator Request States.
Table 17-2
Operator Request States
State Explanation
Submitted The described request has been submitted, and the system is waiting for
the operator’s input.
Refused The operator has refused to perform the described request.
Completed The operator has complied and has completed the described request.
Labeling Media
Removable Storage can read data written to the labels on the actual tape or mag-
netic disk as well as external information supplied in bar code format. The identifi-
cation service is robust and highly sophisticated and will ensure that your media
does not get overwritten or modified by other applications.
You need to provide names for your media pools, and you should also, if you can
afford a bar code reader, organize them according to serial numbers (represented
as bar codes) for more accurate handling. If you are planning to install a library
system, make sure you get one that can read the bar codes from the physical labels
on the cartridge casing. This information will be critical when it comes to locating
a few files that need restoring from five million files stored on 120 30GB tapes
(the bigger the enterprise, the more complex the backup and restore regimen
and management).
Another reason we prefer a numbering or bar code scheme for identifying media, as
opposed to labeling it according to the day of the week, is that often a cartridge can
get inadvertently written to on the wrong day. If that happens, you may have a cart
named Wednesday, but with Tuesday data on it, which can get confusing and create
unnecessary concern. With a bar code or serial number, you can simply make sure
that the cart gets put back into the Wednesday box without having to scratch out
or change the label.
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 655
656
Part V ✦ Availability Management
Practicing Scratch and Save
Although Windows 2000 does not cater to the concept of scratch and save sets, it
is worth a mention because you should understand the terms for more advanced
backup procedures. Simply put, a save set is a set of media in the media pool that
cannot be overwritten for a certain period of time. A scratch set is a set of media
that is safe to overwrite. A backup set should be stored and cataloged in a save set
for any period of time during which the media should not be used for backup. You
can create your own spreadsheet or table of media rotating in and out of scratch
and save sets.
The principal behind scratch and save is to protect data from being overwritten
for pre-determined periods. We have included a scratch and save utility on the CD
accompanying this book; it is called Scratch n’ Save and can be found in the
SNS
folder. Although this little utility does not prevent you from overwriting data, it
will assist you in organizing your media pools.
For example, a monthly save set is saved for a month, while a yearly is saved for
a year. After a “safe” period of time has elapsed, you can move the save set to the
scratch set. In other words, once a set is moved out of the save status into the
scratch status, you are tacitly allowing the files on it to be destroyed. A save set
becomes a scratch set when you are sure, through proper media pool management,
that other media in the pool contain both full and modified, and current and past
files of your data, and that it is safe to destroy the data on the scratch media.
It is important to fully understand the concept of save and scratch sets because it
is the only way you will be able to ensure your media can be safely recycled. The
alternative is to make every set a save set, which means you never recycle the
tapes . . . making your DR project a very costly and risky venture because tapes
that are being constantly used will stretch and wear out sooner.
Establishing Quality of Support Baselines for
Data Backup/Restore
Windows 2000 provides the administrator with backup and recovery tools seen
before only on midrange and mainframe technology (such as the ability to mark
files for archiving). For the first time, Windows network administrators are in a
much better position to commit to service level agreements and quality of service
or support levels than before. Unfortunately, the new tools and technologies result
in a higher and more critical administrative burden (the service level shifts to the
Windows administrator as opposed to being usually the domain of the midrange,
UNIX, or mainframe administrative team). Let’s consider some of the abstract
issues related to backups before we get into procedures.
No matter how regularly you back up the data on your network, you can only restore
up to the point of your last complete backup. Unless you are backing up every second
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 656
657
Chapter 17 ✦ Disaster Recovery: Backing Up and Restoring
of the day, which is highly unlikely and impractical, you can never fully recover the
latest data up to the point of meltdown (unless you had a crash immediately after
you backed up). You need to decide how critical it is that your business cannot afford
to lose even one hour of data. For many companies, any loss could mean serious set-
back and costly recovery, often lasting long after the disaster occurs.
It is important, therefore, that you consider the numerous alternatives for backup
procedures and various strategies if out-of-date data is considered inadequate
recovery. You need to decide on a baseline for backup/restores: What is the least
acceptable recovery situation? You will also need to take into account the quality of
support promised to staff and the departments and divisions that depend on your
systems, and the service level agreements (SLA) in place with the customers.
Service level and quality of support are discussed fully in Chapters 1, 4, and 5.
First, before we consider other factors, let’s decide what we would consider ade-
quate in terms of the currency of backed-up data. Then, once we have established
our tolerance level, we need to work out how to cater to it, and at what cost.
Starting with currency, consider this list:
1. Data restored is one month or more old.
2. Data restored is between one and four weeks old.
3. Data restored is between four and seven days old.
4. Data restored is between one and three days old.
5. Data restored is between six and twelve hours old.
6. Data restored is between two and five hours old.
7. Data restored is between one and sixty minutes old.
Now, depending on how the backups were done and the nature of your backup tech-
nology, just starting up the recovery process could take anywhere up to ten minutes
(such as reading the catalog), depending on the technology. So, level 7 would be out
of the picture for you as a tape backup proposition. In cases where backup media is
off-site, you would need to take into consideration how long it takes after placing a
call to the backup bank for the media to arrive at the data center. This could be any-
thing from 30 minutes to 6 hours. And you may be charged for “rush” delivery.
Now look back at the list and consider your options. How important (mission-critical)
is it that data is restored, if not in real-time, almost in real-time? There are many situa-
tions requiring immediate restoration of data. Many applications in banking, finance,
business, science, engineering, medicine, and so on require real-time recovery of data
in the event of a crash, corruption of data, deleted data, and so on.
You could and should be exploring or installing clustered systems, mirrors, replica-
tion sets, and RAID-5 level and higher storage arrays, as described in the previous
chapter. But these so-called fault-tolerant and redundant systems typically share a
Note
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 657
658
Part V ✦ Availability Management
common hard-disk array or a central storage facility. Loss of data is thus system-
wide and mirrored across the entire array. A mirror is a reflection: no more, no less.
This brings us to another factor to consider: the flawed backup. You bring this fac-
tor into consideration if your data is continuously changing. The question to ask
is, “How soon after the update of data should I make a backup?” You may decide,
based on the previous list, that data even five minutes out of date is damaging to
system integrity or the business objectives. A good example is online real-time
order or delivery tracking. But backing up data with such narrow intervals between
versions brings us to the subject of quality and integrity of backed-up data. (Later
in this chapter, we will discuss versioning and how new technology in Windows
2000 facilitates it.) What if the file that just got hit by a killer virus is quarantined
and you go to the backup only to find it is also infected or corrupt? What if all the
previous files are infected, and now just opening the file renders it useless? It’s
something to think about.
Earlier this year, we rushed to the aid of our main SQL Server group, which had lost
a valuable database on the customer ordering system (on our extranet). Every hour
offline was costing the company six figures as customers went elsewhere to place
their orders. Four-letter words were flying around the server room. We had to go
back three days to find a clean backup of the database that showed no evidence of
corrupt metadata.
Figure 17-2 illustrates data backed up on a daily basis, and in this case, bad data is
backed up for three days in a row. You may consider some of the gray area as safe,
where backup data is bound to have all the flaws of its source (corruption, viruses,
lack of integrity, and so forth), if you have other means of assuring quality or data
integrity. Such assurances may be provided by means of highly sophisticated anti-
virus software, quality of data routines and algorithms, versioning, and just making
sure people check their data themselves. Backing up bad data every ten minutes
may be a futile exercise depending on the tools you have to recover or rebuild the
integrity of the data.
Most companies back up data to a tape drive (we discuss the formats later). The
initial cost is really insignificant in relation to the benefit: the ability to back up and
recover large amounts of data. A good tape drive can run anywhere from $500 for
good Quarter-inch Cartridge (QIC) systems to $3,000 to $4,000 for the high-speed,
high-capacity Digital Linear Tape (DLT) systems, and a robotic library system can
cost as much as $30,000. Let’s now consider minimum restore levels, keeping the
quality of backup factors described earlier in mind:
1. Restore is required in real-time (now) or close to it. Data must be no longer
than a few seconds old and immediately accessible by users and systems
even in the event the primary source is offline. In the case of industrial or
medical systems, the secondary source of data must be up-to-date, and the
latency might be measured in milliseconds and not seconds. Your SLAs may
dictate that 24-7 customers can fine you if data is offline longer than x seconds
or minutes. Let’s call this the critical restore level.
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 658
659
Chapter 17 ✦ Disaster Recovery: Backing Up and Restoring
Figure 17-2: The narrower the interval between backups, the more
chance that backed up data is also corrupted, infected, or lacks integrity.
2. Restore is required within ten minutes of the primary source going offline.
Let’s call this emergency restore.
3. Restore is required within one hour of the primary source going offline. Let’s
call this urgent restore.
4. Restore is required within one to four hours of the primary source going
offline. Let’s call this important restore.
5. All other restores that can occur later than the previous can be considered
casual restores.
Figure 17-3 shows this in a visual hierarchy.
123
A
Backing up once a day
= data integrity
4 5 6
10 20 30
B
Backing up at 10 minute intervals daily
40 50 60
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 659
660
Part V ✦ Availability Management
Figure 17-3: The data restoration pyramid
The pyramid in Figure 17-3 illustrates that the faster the response to a restore or
recall of data request, the higher the chance of retrieving poor data. Each layer of
the pyramid covers the critical level of the restore request. This does not mean that
critical restores are always going to be a risk and that the restored data is flawed.
It means that the data backed up closest to the point of failure is more likely to be
at risk compared to data that was backed up hours or even days before the failure.
If a hard disk crashes, the data on the backup tapes is probably sound, but if the
crash is due to corrupt data or virus infection, the likelihood of recent data being
infected is high.
Another factor to consider is that often you’ll find that the “cleanest” backup data
is the furthest away from the point of restoration, or the most out-of-date.
If the level of restore you need is not as critical or the quality of the backup not
too important, you could consider a tape drive system either to a backup server
or local to the hosting machine. You could then set up a scheme of continuous or
hourly backup routines. In the event data is lost (usually because someone deletes
a file or folder), you would be able to restore the file. The worst-case scenario is
that the data restored is one hour out of date, and at such a wide interval, that a
replacement of a corrupt file with another corrupt file is unlikely. Consider the fol-
lowing anecdote: We recently lost a very important Exchange-based e-mail system.
Many accounts on the server could be considered extremely mission critical.
Thousands of dollars were lost every minute the server was down. (The fallout
from downed systems compounds damages at an incredible rate. The longer a
system is down, the worse it becomes.)
= data integrity
critical
Emergency
Urgent
Important
Casual
4667-8 ch17.f.qc 5/15/00 2:07 PM Page 660