Tải bản đầy đủ (.pdf) (6 trang)

O''''Reilly Network For Information About''''s Book part 128 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (25.55 KB, 6 trang )

extra space to prevent the hashing from getting too slow means that the file would
end up taking up about 437 Kbytes. For this application, disk storage space would
not be a problem; however, the techniques we will use to reduce the file size are
useful in many other applications as well. Also, searching a smaller file is likely to
be faster, because the heads have to move a shorter distance on the average to get
to the record where we are going to start our search.
If you look back at Figure initstruct, you will notice that the upc field is ten
characters long. Using the ASCII code for each digit, which is the usual
representation for character data, takes one byte per digit or 10 bytes in all. I
mentioned above that we would be using a limited character set to reduce the size
of the records. UPC codes are limited to the digits 0 through 9; if we pack two
digits into one byte, by using four bits to represent each digit, we can cut that down
to five bytes for each UPC value stored. Luckily, this is quite simple, as you will
see when we discuss the BCD (binary-coded decimal) conversion code below. The
other data compression method we will employ is to convert the item descriptions
from strings of ASCII characters, limited to the sets 0-9, A-Z, and the special
characters comma, period, minus, and space, to Radix40 representation,
mentioned in Chapter prologue.htm. The main difference between Radix40
conversions and those for BCD is that in the former case we need to represent 40
different characters, rather than just the 10 digits, and therefore the packing of data
must be done in a slightly more complicated way than just using four bits per
character.
The Code
Now that we have covered the optimizations that we will use in our price lookup
system, it's time to go through the code that implements these algorithms. This
specific implementation is set up to handle a maximum of FILE_CAPACITY
items, defined in superm.h (Figure superm.00a).
5
Each of these items, as defined
in the ItemRecord structure in the same file, has a price, a description, and a
key, which is the UPC code. The key would be read in by a bar-code scanner in a


real system, although our test program will read it in from the keyboard.
Some User-Defined Types
Several of the fields in the ItemRecord structure definition require some
explanation, so let's take a closer look at that definition, shown in Figure
superm.00.
ItemRecord struct definition (from superm\superm.h) (Figure superm.00)
codelist/superm.00
The upc field is defined as a BCD (binary-coded decimal) value of
ASCII_KEY_SIZE digits (contained in BCD_KEY_SIZE bytes). The
description field is defined as a Radix40 field DESCRIPTION_WORDS in
size; each of these words contains three Radix40 characters.
A BCD value is stored as two digits per byte, each digit being represented by a
four-bit code between 0000(0) and 1001(9). Function ascii_to_BCD in
bcdconv.cpp (Figure bcdconv.00) converts a decimal number, stored as ASCII
digits, to a BCD value by extracting each digit from the input argument and
subtracting the code for '0' from the digit value; BCD_to_ascii (Figure
bcdconv.01) does the reverse.
ASCII to BCD conversion function (from superm\bcdconv.cpp) (Figure
bcdconv.00)
codelist/bcdconv.00
BCD to ASCII conversion function (from superm\bcdconv.cpp) (Figure
bcdconv.01)
codelist/bcdconv.01
A UPC code is a ten-digit number between 0000000000 and 9999999999, which
unfortunately is too large to fit in a long integer of 32 bits. Of course, we could
store it in ASCII, but that would require 10 bytes per UPC code. So BCD
representation saves five bytes per item compared to ASCII.
A Radix40 field, as mentioned above, stores three characters (from a limited set
of possibilities) in 16 bits. This algorithm (like some other data compression
techniques) takes advantage of the fact that the number of bits required to store a

character depends on the number of distinct characters to be represented.
6
The BCD
functions described above are an example of this approach. In this case, however,
we need more than just the 10 digits. If our character set can be limited to 40
characters (think of a Radix40 value as a "number" in base 40), we can fit three
of them in 16 bits, because 40
3
is less than 2
16
.
Let's start by looking at the header file for the Radix40 conversion functions,
which is shown in Figure radix40.00a.
The header file for Radix40 conversion (superm\radix40.h) (Figure radix40.00a)
codelist/radix40.00a
The legal_chars array, shown in Figure radix40.00 defines the characters that
can be expressed in this implementation of Radix40.
7
The variable weights
contains the multipliers to be used to construct a two-byte Radix40 value from
the three characters that we wish to store in it.
The legal_chars array (from superm\radix40.cpp) (Figure radix40.00)
codelist/radix40.00
As indicated in the comment at the beginning of the ascii_to_radix40
function (Figure radix40.01), the job of that function is to convert a null-terminated
ASCII character string to Radix40 representation. After some initialization and
error checking, the main loop begins by incrementing the index to the current word
being constructed, after every third character is translated. It then translates the
current ASCII character by indexing into the lookup_chars array, which is
shown in Figure radix40.02. Any character that translates to a value with its high

bit set is an illegal character and is converted to a hyphen; the result flag is
changed to S_ILLEGAL if this occurs.
The ascii_to_radix40 function (from superm\radix40.cpp) (Figure radix40.01)
codelist/radix40.01
The lookup_chars array (from superm\radix40.cpp) (Figure radix40.02)
codelist/radix40.02
In the line radix40_data[current_word_index] +=
weights[cycle] * j;, the character is added into the current output word
after being multiplied by the power of 40 that is appropriate to its position. The
first character in a word is represented by its position in the legal_chars string.
The second character is represented by 40 times that value and the third by 1600
times that value, as you would expect for a base-40 number.
The complementary function radix40_to_ascii (Figure radix40.03) decodes
each character unambiguously. First, the current character is extracted from the
current word by dividing by the weight appropriate to its position; then the current
word is updated so the next character can be extracted. Finally, the ASCII value of
the character is looked up in the legal_chars array.
The radix40_to_ascii function (from superm\radix40.cpp) (Figure radix40.03)
codelist/radix40.03
Preparing to Access the Price File
Now that we have examined the user-defined types used in the ItemRecord
structure, we can go on to the PriceFile structure, which is used to keep track
of the data for a particular price file.
8
The best way to learn about this structure is
to follow the program as it creates, initializes, and uses it. The function main,
which is shown in Figure superm.01, after checking that it was called with the
correct number of arguments, calls the initialize_price_file function
(Figure suplook.00) to set up the PriceFile structure.
The main function (from superm\superm.cpp) (Figure superm.01)

codelist/superm.01
The initialize_price_file function (from superm\suplook.cpp) (Figure
suplook.00)
codelist/suplook.00
The initialize_price_file function allocates storage for and initializes
the PriceFile structure, which is used to control access to the price file. This
structure contains pointers to the file, to the array of cached records that we have in
memory, and to the array of record numbers of those cached records. As we
discussed earlier, the use of a cache can reduce the amount of time spent reading
records from the disk by maintaining copies of a number of those records in
memory, in the hope that they will be needed again. Of course, we have to keep
track of which records we have cached, so that we can tell whether we have to read
a particular record from the disk or can retrieve a copy of it from the cache instead.
When execution starts, we don't have any records cached; therefore, we initialize
each entry in these arrays to an "invalid" state (the key is set to
INVALID_BCD_VALUE). If file_mode is set to CLEAR_FILE, we write such
an "invalid" record to every position in the price file as well, so that any old data
left over from a previous run is erased.
Now that access to the price file has been set up, we can call the process
function (Figure superm.02). This function allows us to enter items and/or look up
their prices and descriptions, depending on mode.
The process function (from superm\superm.cpp) (Figure superm.02)
codelist/superm.02
First, let's look at entering a new item (INPUT_MODE). We must get the UPC
code, the description, and the price of the item. The UPC code is converted to BCD,
the description to Radix40, and the price to unsigned. Then we call
write_record (Figure suplook.01) to add the record to the file.
The write_record function (from superm\suplook.cpp) (Figure suplook.01)
codelist/suplook.01
In order to write a record to the file, write_record calls

lookup_record_number (Figure suplook.02) to determine where the record
should be stored so that we can retrieve it quickly later. The
lookup_record_number function does almost the same thing as
lookup_record (Figure suplook.03), except tha the latter returns a pointer to
the record rather than its number. Therefore, they are implemented as calls to a
common function: lookup_record_and_number (Figure suplook.04).
The lookup_record_number function (from superm\suplook.cpp) (Figure
suplook.02)
codelist/suplook.02
The lookup_record function (from superm\suplook.cpp) (Figure suplook.03)
codelist/suplook.03
The lookup_record_and_number function (from superm\suplook.cpp) (Figure
suplook.04)
codelist/suplook.04
After a bit of setup code, lookup_record_and_number determines whether
the record we want is already in the cache, in which case we don't have to search
the file for it. To do this, we call compute_cache_hash (Figure suplook.05),
which in turn calls compute_hash (Figure suplook.06) to do most of the work
of calculating the hash code.
The compute_cache_hash function (from superm\suplook.cpp) (Figure
suplook.05)
codelist/suplook.05

×