HandBooks Professional Java-C-Scrip-SQL part 161 doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (25.77 KB, 5 trang )

pass through the file, because until then we do not know the total sizes of all
records with a given key character.
1

But we are getting a little ahead of ourselves here. Before we can calculate the total
size of all the records for a given key character, we have to read all the records in
the file, so let's continue by looking at the loop that does that. This "endless" loop
starts by reading a line from the input file and checking whether it was successful.
If the status of the input file indicates that the read did not work, then we break out
of the loop. Otherwise, we increment the total number of keys read (for statistical
purposes), calculate the length of the record, and increment the total amount of
data read (also for statistical purposes). Next, we determine whether the record is
long enough for us to extract the character of the key that we need; if it is, we do
so, and otherwise we treat as though it were 0 so that such a record will sort to the
beginning of the file.
2
Once we have found (or substituted for) the character on
which we are sorting, we add the length of the line (plus one for the new-line
character that the getline function discards) to the displacement for that
character. Then we continue with the next iteration of the loop.
Once we get to the end of the input file, we close it. Then we compute the total
displacement values for each character, by adding the total displacement value for
the previous character to the displacement value for the previous character. At this
point, having read all of the data from the input file, we can display the statistics on
the total number of keys and total amount of data in the file, if this is the first pass.
This is also the point where we display the time taken to do the counting pass.
Now we're ready for the second, distribution, pass for this character position. This
is another "endless" loop, very similar to the previous one. As before, we read a
line from the input file and break out of the loop if the read fails. Next, we
concatenate a new-line character to the end of the input line. This is necessary
because the getline function discards that character from lines that it reads;

therefore, if we did not take this step, our output file would have no new-line
characters in it, which would undoubtedly be disconcerting to our users.
Next, we extract the current key character from the line, or substitute a null byte
for it if it is not present. The next operation is to calculate the current amount of
data in the buffer used to store data for this key character. Then we add the length
of the current line to the amount of existing data in the buffer. If adding the new
line to the buffer would cause it to overflow its bounds, then we have to write out
the current data and clear the buffer before storing our new data in it.
To do this, we seek to the position in the output file corresponding to the current
value of the total displacement array for the current character value. As we have
already seen, the initial value of the total displacement array entry for each
character is equal to the number of characters in the file for all records whose key
character precedes this character. For example, if the current key character is a
capital 'A', then element 65 in the total displacement array starts out at the
beginning of the distribution loop with the offset into the output file where we
want to write the first record whose key character is a capital 'A'. If this is the first
time that we are writing the buffer corresponding to the letter 'A', we need to
position the output file to the first place where records whose keys contain the key
character 'A' should be written, so the initial value of the total displacement array
element is what we need in this situation.
However, once we have written that first batch of records whose keys contain the
letter 'A', we have to update the total displacement element for that character so
that the next batch of records whose keys contain the letter 'A' will be written
immediately after the first batch. That's the purpose of the next statement in the
source code.
Now we have positioned the file properly and have updated the next output
position, so we write the data in the buffer to the file. Then we update the total
number of writes for statistical purposes, and clear the buffer in preparation for its
next use to hold more records with the corresponding key character.
At this point, we are ready to rejoin the regular flow of the program, where we

append the input line we have just read to the buffer that corresponds to its key
character. That's the end of the second "endless" loop, so we return to the top of
that loop to continue processing the rest of the lines in the file.
Once we've processed all the lines in the input file, there's one more task we have
to handle before finishing this pass through the data: writing out whatever remains
in the various buffers. This is the task of the for loop that follows the second
"endless" loop. Of course, there's no reason to write out data from a buffer that
doesn't contain anything, so we check whether the current length of the buffer is
greater than 0 before writing it out to the file.
After displaying a message telling the user how long it took to do this distribution
pass, we return to the top of the outer loop and begin again with the next pass
through the file to handle the next character position in the key. When we get
through with all the characters in the key, we are finished with the processing, so
we display a final message indicating the total number of writes that we have
performed, free the memory for the buffers, terminate the timing routines, and exit.
Performance: Baseline
So how does this initial version of the program actually perform? While I was
working on the answer to this question, it occurred to me that perhaps it would be a
good idea to run tests on machines of various amounts of physical memory. After
all, even if this algorithm works well with limited physical memory, that doesn't
mean that having additional memory wouldn't help its performance. In particular,
when we are reading and writing a lot of data, the availability of memory to use as
a disk cache can make a lot of difference. Therefore, I ran the tests twice, once
with 64 MB of RAM in my machine and once with 192 MB. The amount of
available memory did make a substantial difference, as you'll see when we discuss
the various performance results.
Figure timings.01 illustrates how this initial version works with files of various
sizes, starting with 100,000 records of approximately 60 bytes apiece and ending
with one million similar records.
3

Performance of Zensort version 1 (Figure timings.01)
zensort/timings.01
According to these figures, this is in fact a linear sort, or close enough to make no
difference, at least on the larger machine. An n log n sort would take exactly 1.2
times as long per element when sorting one million records as when sorting
100,000 records. While this sort takes almost exactly that much longer per element
for the one million record file on the smaller machine, the difference is only three
percent on the larger machine, so obviously the algorithm itself has the capability
of achieving linear scaling.
But linear performance only matters if the performance is good enough in the
region in which we are interested. Since this is a book on optimization, let's see if
we can speed this up significantly.
The Initial Improvements
One of the most obvious areas where we could improve the efficiency of this
algorithm is in the use of the buffer space. The particular input file that we are
sorting has keys that consist entirely of digits, which means that allocating 256
buffers of equal size, one for each possible ASCII character, is extremely wasteful,
because only 10 of those buffers will ever be used. Although not all keys consist
only of digits, that is a very common key composition; similarly, many keys
consist solely of alphabetic characters, and of course there are keys that combine
both. In any of these cases, we would do much better to allocate more memory to
buffers that are actually going to be used; in fact, we should not bother to allocate
any memory for buffers that are not used at all. Luckily, we can determine this on
the counting pass with very little additional effort, as you can see in Figure
zen02.cpp.
Zensort version 2 (Zensort\zen02.cpp) (Figure zen02.cpp)
zensort/zen02.cpp
The first changes of any significance in this program are the addition of two new
arrays that we will use to keep track of the buffer size for each possible key

character and the total number of characters stored in each buffer. Of course,
because we are assigning memory to buffers in a dynamic fashion, we can't
allocate those buffers until we know how much memory we want to devote to each
buffer. Therefore, the allocation has to be inside the main loop rather than
preceding it. By the same token, we have to delete each buffer before the end of
the main loop so that they can be re-allocated for the next pass.
The next question, of course, is how we decide how much space to devote to each
buffer. It seemed to me that the best way to approach this would be to calculate the
proportion of the entire file that the records for each key character correspond to,
and allocate that proportion of the entire buffer space to the buffer for that key
character, so that's how I did it.
First, we add up all of the record length totals; then we compute the ratio of the
total amount of space available for buffers to the total amount of data in the file.
Then we step through all the different key characters and compute the appropriate
size of the buffer for each key character. If the result comes out to be zero, then we
don't allocate any space for that buffer; instead, we assign a null pointer to that
buffer address, as that is much more efficient than allocating a zero-length buffer.
However, if the buffer size comes out to be greater than zero, we allocate the
computed amount of space for that buffer, then clear it to zeros. Finally, we clear
the buffer character count for that buffer, as we haven't stored anything in it yet.
When it's time to store some data in the buffer, we use the buffer character count
array rather than calling strlen to find out how much data is currently in the
buffer. I decided to track the count myself because when I first changed the buffer
allocation strategy from fixed to variable, the program ran much more slowly than
it had previously. This didn't make much sense to me at first, but upon reflection I
realized that the longer the buffers are, the longer it would take strlen to find the
end of each buffer. To prevent this undesirable effect, I decided to keep track of the
size of buffers myself rather than relying on strlen to do it. Of course, that
means that we have to add the length of each record to the total count for each
buffer as we add the record to the buffer, so I added a line to handle this task.

The Second Version
So how does this second version of the program actually perform? Figure
timings.02 illustrates how it works with files of various sizes.
4

Performance of Zensort version 2 (Zensort\timings.02) (Figure timings.02)
zensort/timings.02
If you compare the performance of this second version of the program to the
previous version on the small file, you'll notice that it is almost 7.5 times as fast on
both the 64 MB machine and the 192 MB machine as was the previous version.
However, what is more important is how well it performs when we have a lot of
data to sort. While we haven't achieved quite as much of a speed-up on larger files
as on the smallest one, we have still sped up the one million record sort by a factor
of almost 4.5 to one when running the sort on a machine with "only" 64 MB of
RAM and more than 6.5 to 1 on the more generously equipped machine with 192
MB of RAM, which is not an insignificant improvement.
5
Is this the best we can
do? Not at all, as you'll see in the analysis of the other versions of the program.
Let's continue with a very simple change that provided some improvement without
any particular effort.
The Third Version
At this point in the development of the sorting algorithm, I decided that although
saving memory is nice, we don't have to go overboard. On the assumption that
anyone who wants to sort gigantic files has a reasonably capable computer, I
decided to increase the amount of memory allocated to the buffers from 4 MB to
16 MB. As you might imagine, this improved performance significantly on larger
files, although not nearly as much proportionally as our previous change did.

HandBooks Professional Java-C-Scrip-SQL part 161 doc

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về