Tải bản đầy đủ (.pdf) (10 trang)

Tài liệu Thuật toán Algorithms (Phần 18) ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (76.55 KB, 10 trang )

EXTERNAL SORTING
163
exactly after the sort phase is completed ) The best choice between these two
alternatives of the lowest reasonable value of P and the highest reasonable
value of P is obviously very dependent on many systems parameters: both
alternatives (and some in between) should be considered.
Polyphase Merging
One problem with balanced multiway merging for tape sorting is that it
requires either an excessive number of tape units or excessive copying. For
P-way merging either we must use 2P t lpes (P for input and P for output)
or we must copy almost all of the file from a single output tape to P input
tapes between merging passes, which effectively doubles the number of passes
to be about 21og,(N/2M).
S
everal
clevl:r tape-sorting algorithms have been
invented which eliminate virtually all of this copying by changing the way in
which the small sorted blocks are merged together. The most prominent of
these methods is called polyphase
mergir;g.
The basic idea behind polyphase merging is to distribute the sorted blocks
produced by replacement selection somewhat unevenly among the available
tape units (leaving one empty) and
thc:n
to apply a “merge until empty”
strategy, at which point one of the output tapes and the input, tape switch
roles.
For example, suppose that we have just three tapes, and we start out
with the following initial configuration of sorted blocks on the tapes. (This
comes from applying replacement selection to our example file with an internal
memory that can only hold two records.:


Tape I : A 0 R S T IN AGN DEMR GIN
Tape,2:EGX
AMP EL
Tape
3:
After three 2-way merges from
tape3
1 and 2 to tape 3, the second tape
becomes empty and we are left with the configuration:
Tapel: DEMR G IN
Tape
2:
TapeS:AEGOR

STX
AIMNP AEGLN
Then, after two 2-way merges from tapes 1 and 3 to tape 2, the first tape
becomes empty, leaving:
Tape 1:
TapeZ:ADEEGMORRSTX
AGIIMNNP
Tape3:AEGLN
164 CHAPTER 13
The sort is completed in two more steps. First, a two-way merge from
tapes 2 and 3 to tape 1 leaves one file on tape 2, one file on tape 1. Then a
twoway merge from tapes 1 and 2 to tape 3 leaves the entire sorted file on
tape 3.
This “merge until empty” strategy can be extended to work for an ar-
bitrary number of tapes. For example, if we have four tape units
Tl,

T2,
T3, and T4 and we start out with Tl being the output tape, T2 having 13
initial runs, T3 having 11 initial runs, and T4 having 7 initial runs, then after
running a 3-way “merge until empty,” we have T4 empty, Tl with 7 (long)
runs, T2 with 6 runs, and T3 with 4 runs. At this point, we can rewind
Tl and make it an input tape, and rewind T4 and make it an output tape.
Continuing in this way, we eventually get the whole sorted file onto
Tl:
Tl T2 T3 T4
0
13 11 7
7 6 4 0
3 2 0 4
1 0 2 2
0 1 1 1
1 0 0 0
The merge is broken up into many phases which don’t involve all the data,
but no direct copying is involved.
The main difficulty in implementing a polyphase merge is to determine
how to distribute the initial runs. It is not difficult to see how to build the
table above by working backwards: take the largest number on each line, make
it zero, and add it to each of the other numbers to get the previous line. This
corresponds to defining the highest-order merge for the previous line which
could give the present line. This technique works for any number of tapes
(at least three): the numbers which arise are “generalized Fibonacci numbers”
which have many interesting properties. Of course, the number of initial runs
may not be known in advance, and it probably won’t be exactly a generalized
Fibonacci number. Thus a number of “dummy” runs must be added to make
the number of initial runs exactly what is needed for the table.
The analysis of polyphase merging is complicated, interesting, and yields

surprising results. For example, it turns out that the very best method for
distributing dummy runs among the tapes involves using extra phases and
more dummy runs than would seem to be needed. The reason for this is that
some runs are used in merges much more often than others.
EXTERNAL SORTING
165
There are many other factors to be
t&ken
into consideration in implement-
ing a most efficient tape-sorting method. For example, a major factor which
we have not considered at all is the timt: that it takes to rewind a tape. This
subject has been studied extensively, ant many fascinating methods have been
defined. However, as mentioned above, the savings achievable over the simple
multiway balanced merge are quite limited. Even polyphase merging is only
better than balanced merging for small P, and then not substantially. For
P > 8, balanced merging is likely to run
j’aster
than polyphase, and for smaller
P the effect of polyphase is basically to sue two tapes (a balanced merge with
two extra tapes will run faster).
An
Easier Way
Many modern computer systems provide a large virtual memory capability
which should not be overlooked in imp ementing a method for sorting very
large files. In a good virtual memory
syf#tem,
the programmer has the ability
to address a very large amount of data, leaving to the system the responsibility
of making sure that addressed data is Lransferred from external to internal
storage when needed. This strategy relict on the fact that many programs have

a relatively small “locality of reference” : each reference to memory is likely to
be to an area of memory that is relatively close to other recently referenced
areas. This implies that transfers from e:rternal to internal storage are needed
infrequently. An
int,ernal
sorting method with a small locality of reference can
work very well on a virtual memory system. (For example, Quicksort has two
“localities” :
most references are near one of the two partitioning pointers.)
But check with your systems programmclr before trying it on a very large file:
a method such as radix sorting, which
hE,s
no locality of reference whatsoever,
would be disastrous on a virtual memory system, and even Quicksort could
cause problems, depending on how well the available virtual memory system
is implemented. On the other hand,
th’:
strategy of using a simple internal
sorting method for sorting disk files
desl:rves
serious consideration in a good
virtual memorv environment.
166
Exercises
1.
Describe how you would do external selection: find the kth largest in a
file of N elements, where N is much too large for the file to fit in main
memory.
2. Implement the replacement selection algorithm, then use it to test the
claim that the runs produced are about twice the internal memory size.

3.
What is the worst that can happen when replacement selection is used to
produce initial runs in a file of N records, using a priority queue of size
M, with M < N.
4. How would you sort the contents of a disk if no other storage (except
main memory) were available for use?
5. How would you sort the contents of a disk if only one tape (and main
memory) were available for use?
6. Compare the 4-tape and 6-tape multiway balanced merge to polyphase
merge with the same number of tapes, for 31 initial runs.
7.
How many phases does 5-tape polyphase merge use when started up with
four tapes containing 26,15,22,28 runs?
8. Suppose the 31 initial runs in a 4-tape polyphase merge are each one
record long (distributed 0, 13, 11, 7 initially). How many records are
there in each of the files involved in the last three-way merge?
9. How should small files be handled in a Quicksort implementation to be
run on a very large file within a virtual memory environment?
10.
How would you organize an external priority queue? (Specifically, design
a way to support the insert and remove operations of Chapter 11, when
the number of elements in the priority queue could grow to be much to
large for the queue to fit in main memory.)
167
SOURCES for Sorting
The primary reference for this section is volume three of D. E. Knuth’s
series on sorting and searching. Further information on virtually every topic
that we’ve touched upon can be found in that book. In particular, the results
that we’ve quoted on performance chal,acteristics of the various algorithms
are backed up by complete mathematic:tl analyses in Knuth’s book.

There is a vast amount of literatllre on sorting. Knuth and Rivest’s
1973 bibliography contains hundreds of entries, and this doesn’t include the
treatment of sorting in countless books ind articles on other subjects (not to
mention work since 1973).
For Quicksort, the best reference is Hoare’s original 1962 paper, which
suggests all the important variants, including the use for selection discussed
in Chapter 12. Many more details on the mathematical analysis and the
practical effects of many of the modifications and embellishments which have
been suggested over the years may be fat nd in this author’s 1975 Ph.D. thesis.
A good example of an advanced priority queue structure, as mentioned in
Chapter 11, is J. Vuillemin’s “binomial
cueues”
as implemented and analyzed
by M. R. Brown. This data structure supports all of the priority queue
operations in an elegant and efficient manner.
To get an impression of the myriall details of reducing algorithms like
those we have discussed to general-purpoire practical implementations, a reader
would be advised to study the reference material for his particular computer
system’s sort utility. Such material
necef
sarily deals primarily with formats of
keys, records and files as well as many other details, and it is often interesting
to identify how the algorithms themselv:s are brought into play.
M. R. Brown, “Implementation and am.lysis of binomial queue algorithms,”
SIAM Journal of Computing, 7, 3, (August, 1978).
C. A. R. Hoare, “Quicksort,” Computer Journal, 5, 1 (1962).
D. E. Knuth, The Art of Computer Programming. Volume
S:
Sorting and
Searching, Addison-Wesley, Reading, M9, second printing, 1975.

R. L. Rivest and D. E. Knuth, “BibliogIaphy 26: Computing Sorting,” Com-
puting Reviews, 13, 6 (June, 1972).
R. Sedgewick, Quicksort, Garland, New York, 1978. (Also appeared as the
author’s Ph.D. dissertation, Stanford University, 1975).
SEARCHING
c
f
I

!t-
I
i
14. Elementary Searching Methods
A fundamental operation intrinsic
;o
a great many computational tasks
is searching: retrieving some partic-liar information from a large amount
of previously stored information. Normally we think of the information as
divided up into records, each record haling a key for use in searching. The
goal of the search is to find all records with keys matching a given search key.
The purpose of the search is usually to
;1ccess
information within the record
(not merely the key) for processing.
Two common terms often used to describe data structures for searching
are dictionaries and symbol tables. For example, in an English language dic-
tionary, the “keys” are the words and the “records” the entries associated with
the words which contain the definition, pronunciation, and other associated in-

formation. (One can prepare for learning and appreciating searching methods
by thinking about how one would implenent a system allowing access to an
English language dictionary.) A symbol table is the dictionary for a program:
the “keys” a-e the symbolic names used in the program, and the “records”
contain information describing the objet t named.
In searching (as in sorting) we
havt:
programs which are in widespread
use on a very frequent basis, so that it vrill be worthwhile to study a variety
of methods in some detail. As with sorling, we’ll begin by looking at some
elementary methods which are very useful for small tables and in other special
situations and illustrate fundamental techniques exploited by more advanced
methods. We’ll look at methods which
stelre
records in arrays which are either
searched with key comparisons or index:d by key value, and we’ll look at a
fundamental method which builds structures defined by the key values.
As with priority queues, it is best to think of search algorithms as belong-
ing to packages implementing a variety of generic operations which can be
separated from particular implementations, so that alternate implementations
could be substituted easily. The operations of interest include:
171
172
CHAPTER 14
Initialize the data structure.
Search for a record (or records) having a given key.
Insert a new record.
Delete a specified record.
Join two dictionaries to make a large one.
Sort the dictionary; output all the records in sorted order.

As with priority queues, it is sometimes convenient to combine some of these
operations. For example, a search and insert operation is often included for
efficiency in situations where records with duplicate keys are not to be kept
within the data structure. In many methods, once it has been determined
that a key does not appear in the data structure, then the internal state of
the search procedure contains precisely the information needed to insert a new
record with the given key.
Records with duplicate keys can be handled in one of several ways,
depending on the application. First, we could insist that the primary searching
data structure contain only records with distinct keys. Then each “record” in
this data structure might contain, for example, a link to a list of all records
having that key. This is the most convenient arrangement from the point
of view of the design of searching algorithms, and it is convenient in some
applications since all records with a given search key are returned with one
search.
The second possibility is to leave records with equal keys in the
primary searching data structure and return any record with the given key
for a search. This is simpler for applications that process one record at a
time, where the order in which records with duplicate keys are processed is
not important. It is inconvenient from the algorithm design point of view
because some mechanism for retrieving all records with a given key must still
be provided. A third possibility is to assume that each record has a unique
identifier (apart from the key), and require that a search find the record with
a given identifier, given the key. Or, some more complicated mechanism could
be used to distinguish among records with equal keys.
Each of the fundamental operations listed above has important applica-
tions, and quite a large number of basic organizations have been suggested to
support efficient use of various combinations of the operations. In this and the
next few chapters, we’ll concentrate on implementations of the fundamental
functions search and insert (and, of course, initialize), with some comment on

delete and sort when appropriate. As with priority queues, the join operation
normally requires advanced techniques which we won’t be able to consider
here.
Sequential Searching
The simplest method for searching is simply to store the records in an array,

×