Tải bản đầy đủ (.pdf) (6 trang)

O''''Reilly Network For Information About''''s Book part 131 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (25.9 KB, 6 trang )

sorting function. In such a case, by far the greatest proportion of the time is spent
in sorting the keys, so that's where we'll focus our attention next.
Sorting Speed
Which sort should we use? Surely we can't improve on the standard sort supplied
with the C++ compiler we're using (typically Quicksort) or one of its relatives such
as heapsort. After all, the conventional wisdom is that even the fastest sorting
algorithm takes time proportional to n*log(n). That is, if sorting 1000 items takes 1
second, sorting 1,000,000 items would take at least 2000 seconds, using an optimal
sorting algorithm. Since these commonly used algorithms all have average times
proportional to this best case, it seems that all we have to do is select the one best
suited for our particular problem.
This is a common misconception: in fact, only sorts that require comparisons
among keys have a lower bound proportional to n*log(n). The distribution
counting sort we are going to use takes time proportional to n, not n*log(n), so that
if it takes 1 second to sort 1000 items, it would take 1000 seconds to sort 1,000,000
items, not 2000 seconds.
2
Moreover, this sort is quite competitive with the
commonly used sorts even for a few hundred items, and it is easy to code as well.
For some applications, however, its most valuable attribute is that its timing is data
independent. That is, the sort takes the same amount of time to execute whether the
data items are all the same, all different, already sorted, in reverse order, or
randomly shuffled. This is particularly valuable in real-time applications, where
knowing the average time is not sufficient.
3
Actually, this sort takes time
proportional to the number of keys multiplied by the length of each key. The
reason that the length of the keys is important is that a distribution counting sort
actually treats each character position of the sort keys as a separate "key"; these
keys are used in order from the least to the most significant. Therefore, this method
actually takes time proportional to n*m, where n is the number of keys and m is the


length of each key. However, in most real-world applications of sorting, the
number of items to be sorted far exceeds the length of each item, and additions to
the list take the form of more items, not lengthening of each one. If we had a few
very long items to sort, this sort would not be as appropriate.
You're probably wondering how fast this distribution sort really is, compared to
Quicksort. According to the results of my tests, which sorted several different
numbers of records between 23480 and 234801 on 5-digit ZIP codes using both
Microsoft's implementation of the Quicksort algorithm (qsort) in version 5.0 of
their Visual C++ compiler and my distribution counting sort, which I call
"Megasort", there's no contest.
4
The difference in performance ranged from
approximately 43 to 1 at the low end up to an astonishing 316 to 1 at the high end!
Figures promailx.00 and promail.00, near the end of this chapter, show the times in
seconds when processing these varying sets of records.
Now let's see how such increases in performance can be achieved with a simple
algorithm.
The Distribution Counting Sort
The basic method used is to make one pass through the keys for each character
position in the key, in order to discover how many keys have each possible ASCII
character in the character position that we are currently considering, and another
pass to actually rearrange the keys. As a simplified example, suppose that we have
ten keys to be sorted and we want to sort only on the first letter of each key.
The first pass consists of counting the number of keys that begin with each letter.
In the example in Figure unsorted, we have three keys that begin with the letter 'A',
five that begin with the letter 'B', and two that begin with the letter 'C'. Since 'A' is
the lowest character we have seen, the first key we encounter that starts with an 'A'
should be the first key in the result array, the second key that begins with an 'A'
should be the second key in the result array, and the third 'A' key should be the
third key in the result array, since all of the 'A' keys should precede all of the 'B'

keys.
Unsorted keys (Figure unsorted)
1 bicycle
2 airplane
3 anonymous
4 cashier
5 bottle
6 bongos
7 antacid
8 competent
9 bingo
10 bombardier
The next keys in sorted order will be the ones that start with 'B'; therefore, the first
key that we encounter that starts with a 'B' should be the fourth key in the result
array, the second through fifth 'B' keys should be the fifth through eighth keys in
the result array, and the 'C' keys should be numbers nine and ten in the result array,
since all of the 'B' keys should precede the 'C' keys. Figure counts.and.pointers
illustrates these relationships among the keys.
Counts and pointers (Figure counts.and.pointers)
Counts Starting indexes
A B C A B C
3 5 2 1 4 9
| | |
+-+ | + +
Key Old | New | | Explanation
Index | Index | |
Bicycle 1 |+ 4 + | The first B goes to position 4.
Airplane 2 ++ 1 + | The first A goes to position 1.
Anonymous 3 ++ 2 + | The second A goes after the first.
Cashier 4 +++ 9 + The first C goes to position 9.

Bottle 5 ||+ 5-+ The second B goes after the first.
Bongos 6 ||+ 6-+ The third B goes after the second.
Antacid 7 |++ 3 The third A goes after the second.
Competent 8 +-+ 10 The second C goes after the first.
Bingo 9 + 7-+ The fourth B goes after the third.
Bombardier 10 8-+ The fifth B goes after the fourth.
If we rearrange the keys in the order indicated by their new indexes, we arrive at
the situation shown in Figure afterfirst:
After the sort on the first character (Figure afterfirst)
Index Unsorted Keys Sorted Keys

1 Bicycle + + Airplane
| |
2 Airplane + + + Anonymous
| |
3 Anonymous + + + Antacid
| + +
4 Cashier + + + Bicycle
| |
5 Bottle + + Bottle
| |
6 Bongos + + Bongos
| |
7 Antacid + + + Bingo
| |
8 Competent + + + + Bombardier
+ + + +-+
9 Bingo + + + +-Cashier
+ + +
10 Bombardier + + Competent

Multicharacter Sorting
Of course, we usually want to sort on more than the first character. As we noted
earlier, each character position is actually treated as a separate key; that is, the
pointers to the keys are rearranged to be in order by whatever character position we
are currently sorting on. With this in mind, let's take the case of a two character
string; once we can handle that situation, the algorithm doesn't change significantly
when we add more character positions.
We know that the final result we want is for all the A's to come before all the B's,
which must precede all the C's, etc., but within the A's, the AA's must come before
the AB's, which have to precede the AC's, etc. Of course, the same applies to keys
starting with B and C: the BA's must come before the BB's, etc. How can we
manage this?
We already know that we are going to have to sort the data separately on each
character position, so let's work backward from what the second sort needs as
input. When we are getting ready to do the second (and final) sort, we need to
know that all the AA's precede all the AB's, which precede all the AC's, etc., and
that all the BA's precede the BB's, which precede the BC's. The same must be true
for the keys starting with C, D, and any other letter. The reason that the second sort
will preserve this organization of its input data is that it moves keys with the same
character at the current character position from input to output in order of their
previous position in the input data. Therefore, any two keys that have the same
character in the current position (both A's, B's, C's, etc.) will maintain their relative
order in the output. For example, if all of the AA's precede all of the AB's in the
input, they will also do so in the output, and similarly for the BA's and BB's. This
is exactly what we want. Notice that we don't care at this point whether the BA's
are behind or ahead of the AB's, as arranging the data according to the first
character is the job of the second sort (which we haven't done yet). But how can we
ensure that all the AA's precede the AB's, which precede the AC's, etc. in the
input? By sorting on the second character position first!
For example, suppose we are sorting the following keys: AB, CB, BA, BC, CA,

BA, BB, CC. We start the sort by counting the number of occurrences of each
character in the second position of each key (the less significant position). There
are three A's, three B's, and two C's. Since A is the character closest to the
beginning of the alphabet, the first key that has an A in the second position goes in
the first slot of the output. The second and third keys that have an A in the second
position follow the first one. Those that have a B in the second position are next, in
output positions 4, 5, and 6. The C's bring up the rear, producing the situation in
Figure lesser.char.
After this first sort (on the second character position), all of the keys that have an A
in the second position are ahead of all of the keys that have a B in the second
position, which precede all those that have a C in the second position. Therefore,
all AA keys precede all AB keys, which precede all AC keys, and the same is true
for BA, BB, BC and CA, CB, and CC as well. This is the exact arrangement of
input needed for the second sort.
Less significant character sort (Figure lesser.char)
1 AB ++ BA *
||
2 CB + ||+ CA
| |||
3 BA +-++|+ BA *
| | ||
4 BC + | +-++ AB
| + +++
5 CA + +|+ CB
| |
6 BA + ++ BB *
+ +-+
7 BB + + BC *


×