Tải bản đầy đủ (.pdf) (6 trang)

O''''Reilly Network For Information About''''s Book part 126 docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (22.39 KB, 6 trang )

Some Random Musings
Before we try to optimize our search, let us define some terms. There are two basic
categories of storage devices, distinguished by the access they allow to individual
records. The first type is sequential access;
1
in order to read record 1000 from a
sequential device, we must read records 1 through 999 first, or at least skip over
them. The second type is direct access; on a direct access device, we can read
record 1000 without going past all of the previous records. However, only some
direct access devices allow nonsequential accesses without a significant time
penalty; these are called random access devices. Unfortunately, disk drives are
direct access devices, but not random access ones. The amount of time it takes to
get to a particular data record depends on how close the read/write head is to the
desired position; in fact, sequential reading of data may be more than ten times as
fast as random access.
Is there a way to find a record in a large file with an average of about one
nonsequential access? Yes; in fact, there are several such methods, varying in
complexity. They are all variations on hash coding, or address calculation; as you
will see, such methods actually can be implemented quite simply, although for
some reason they have acquired a reputation for mystery.
Hashing It Out
Let's start by considering a linear or sequential search. That is, we start at the
beginning of the file and read each record in the file until we find the one we want
(because its key is the same as the key we are looking for). If we get to the end of
the file without finding a record with the key we are looking for, the record isn't in
the file. This is certainly a simple method, and indeed is perfectly acceptable for a
very small file, but it has one major drawback: the average time it takes to find a
given record increases every time we add another record. If the file gets twice as
big, it takes twice as long to find a record, on the average. So this seems useless.
Divide and Conquer
But what if, instead of having one big file, we had many little files, each with only


a few records in it? Of course, we would need to know which of the little files to
look in, or we wouldn't have gained anything. Is there any way to know that?
Let's see if we can find a way. Suppose that we have 1000 records to search
through, keyed by telephone number. To speed up the lookup, we have divided the
records into 100 subfiles, averaging 10 numbers each. We can use the last two
digits of the telephone number to decide which subfile to look in (or to put a new
record in), and then we have to search through only the records in that subfile. If
we get to the end of the subfile without finding the record we are looking for, it's
not in the file. That's the basic idea of hash coding.
But why did we use the last two digits, rather than the first two? Because they will
probably be more evenly distributed than the first two digits. Most of the telephone
numbers on your list probably fall within a few telephone exchanges near where
you live (or work). For example, suppose my local telephone book contained a lot
of 758 and 985 numbers and very few numbers from other exchanges. Therefore, if
I were to use the first two digits for this hash coding scheme, I would end up with
two big subfiles (numbers 75 and 98) and 98 smaller ones, thus negating most of
the benefit of dividing the file. You see, even though the average subfile size
would still be 10, about 90% of the records would be in the two big subfiles, which
would have perhaps 450 records each. Therefore, the average search for 90% of the
records would require reading 225 records, rather than the five we were planning
on. That is why it is so important to get a reasonably even distribution of the data
records in a hash-coded file.
Unite and Rule
It is inconvenient to have 100 little files lying around, and the time required to
open and close each one as we need it makes this implementation inefficient. But
there's no reason we couldn't combine all of these little files into one big one and
use the hash code to tell us where we should start looking in the big file. That is, if
we have a capacity of 1000 records, we could use the last two digits of the
telephone number to tell us which "subfile" we need of the 100 "subfiles" in the
big file (records 0-9, 10-19 980-989, 990-999). To help visualize this, let's look

at a smaller example: 10 subfiles having a capacity of four telephone numbers each
and a hash code consisting of just the last digit of the telephone number (Figure
initfile).
Hashing with subfiles, initialized file (Figure initfile)
Subfile #
+ + + + +
0 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
1 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
2 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
3 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
4 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
5 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
6 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
7 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
8 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
9 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
In order to use a big file rather than a number of small ones, we have to make some
changes to our algorithm. When using many small files, we had the end-of-file
indicator to tell us where to add records and where to stop looking for a record;
with one big file subdivided into small subfiles, we have to find another way to

handle these tasks.
Knowing When to Stop
One way is to add a "valid-data" flag to every entry in the file, which is initialized
to "I" (for invalid) in the entries in Figure initfile, and set each entry to "valid"
(indicated by a "V" in that same position) as we store data in it. Then if we get to
an invalid record while looking up a record in the file, we know that we are at the
end of the subfile and therefore the record is not in the file (Figure distinct).
Hashing with distinct subfiles (Figure distinct)
Subfile #
+ + + + +
0 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
1 | V 9876541 | V 2323231 | V 9898981 | I 0000000 |
+ + + + +
2 | V 2345432 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
3 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
4 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
5 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
6 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
7 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
8 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
9 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +

For example, if we are looking for the number "9898981", we start at the beginning
of subfile 1 in Figure distinct (because the number ends in 1), and examine each
record from there on. The first two entries have the numbers "9876541" and
"2323231", which don't match, so we continue with the third one, which is the one
we are looking for. But what if we were looking for "9898971"? Then we would go
through the first three entries without finding a match. The fourth entry is "I
0000000", which is an invalid entry. This is the marker for the end of this subfile,
so we know the number we are looking for isn't in the file.
Now let's add another record to the file with the phone number "1212121". As
before, we start at the beginning of subfile 1, since the number ends in 1. Although
the first three records are already in use, the fourth (and last) record in the subfile is
available, so we store our new record there, resulting in the situation in Figure
merged.
Hashing with merged subfiles (Figure merged)
Subfile #
+ + + + +
0 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
1 | V 9876541 | V 2323231 | V 9898981 | V 1212121 |
+ + + + +
2 | V 2345432 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
3 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
4 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
5 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
6 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +

7 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
8 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
9 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
However, what happens if we look for "9898971" in the above situation? We start
out the same way, looking at records with phone numbers "9876541", "2323231",
"9898981", and "1212121". But we haven't gotten to an invalid record yet. Can we
stop before we get to the first record of the next subfile?
Handling Subfile Overflow
To answer that question, we have to see what would happen if we had added
another record that belonged in subfile 1. There are a number of possible ways to
handle this situation, but most of them are appropriate only for memory-resident
files. As I mentioned above, reading the next record in a disk file is much faster
than reading a record at a different place in the file. Therefore, for disk-based data,
the most efficient place to put "overflow" records is in the next open place in the
file, which means that adding a record with phone number "1234321" to this file
would result in the arrangement in Figure overflow.
2

Hashing with overflow between subfiles (Figure overflow)
Subfile #
+ + + + +
0 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |
+ + + + +
1 | V 9876541 | V 2323231 | V 9898981 | V 1212121 |
+ + + + +
2 | V 2345432 | V 1234321 | I 0000000 | I 0000000 |
+ + + + +

3 | I 0000000 | I 0000000 | I 0000000 | I 0000000 |

×