Tải bản đầy đủ (.pdf) (400 trang)

The art of computer programming volume 3 sorting and searching (second edition 2011) part 2

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.58 MB, 400 trang )

CHAPTER

SIX

SEARCHING
Let’s look at the record.

— AL SMITH

(1928)

This chapter might have been given the more pretentious title “Storage and
on the other hand, it might simply have been called

Retrieval of Information”

Table Look-Up.”
in a

computer

s

We

;

are concerned with the process of collecting information
in such a way that the information can subsequently be

memory,



recovered as quickly as possible. Sometimes

we are confronted with more data
than we can really use, and it may be wisest to forget and to destroy most of it;
but at other times it is important to retain and organize the given facts in such
a way that fast retrieval

is

possible.

Most

how

of this chapter is devoted to the study of a very simple search problem:
to find the data that has been stored with a given identification.
For

example, in a numerical application we might want to find
/( x), given x and
a table of the values of /; in a nonnumerical application, we might want to find
the English translation of a given Russian word.
In general, we shall suppose that a set of
records has been stored, and
the problem is to locate the appropriate one. As in the case of sorting, we

N


assume that each record includes a special
is

many

especially appropriate, because

day searching

for their keys.

We

that each key uniquely identifies

a table or

field called its key; this

terminology
people spend a great deal of time every
keys to be distinct, so

generally require the
its

record.

where the word “table”


The

N

collection of all records

is

called

usually used to indicate a small
and “file” is usually used to indicate a large table.
large file or a group of
is frequently called a database.
file,

is

A

file,

files

Algorithms for searching are presented with a so-called argument, K, and the
problem is to find which record has
as its key. After the search is complete,
two possibilities can arise: Either the search was successful, having located the
unique record containing K; or it was unsuccessful, having determined that
is nowhere to be found. After an unsuccessful search it

is sometime desirable to

K

K

enter a

new

record, containing K, into the table; a

method that does this is called
a search- and-insertion algorithm. Some hardware devices known as associative
memories solve the search problem automatically, in a way that might resemble
the functioning of a human brain; but we shall study techniques for searching
on a conventional general-purpose

digital

computer.

Although the goal of searching is to find the information stored in the record
associated with K, the algorithms in this chapter generally ignore everything but
392


SEARCHING

6


the keys themselves. In practice
located

K

;

for

393

we have

find the associated data once

K

if
appears in location TABLE + i, the associated data
might be in location TABLE + * + 1, or in DATA + i, etc. It is
what should be done after
has

example,

(or a pointer to

we can


it)

K

therefore convenient to gloss over the details of

been successfully found.
Searching

is

the most time-consuming part of

many programs, and

method for a bad one often
we can often arrange the data

the

substitution of a good search

leads to a substantial

increase in speed. In fact

or the data structure

is eliminated entirely, by ensuring that we always know just
where to find the information we need. Linked memory is a common way to

achieve this; for example, a doubly linked list makes it unnecessary to search for
the predecessor or successor of a given item. Another way to avoid searching
occurs if we are allowed to choose the keys freely, since we might as well let
them be the numbers {1,2,..., TV); then the record containing
can simply
be placed in location TABLE + K. Both of these techniques were used to eliminate searching from the topological sorting algorithm discussed in Section 2.2.3.
However, searches would have been necessary if the objects in the topological
sorting algorithm had been given symbolic names instead of numbers. Efficient
algorithms for searching turn out to be quite important in practice.

so that searching

K

We might divide them
we divided the sorting algorithms
Or we might divide search
dynamic searching, where “static” means that the

Search methods can be classified in several ways.
into internal versus external searching, just as

of Chapter 5 into internal versus external sorting.

methods

into static versus

contents of the table are essentially unchanging (so that


it is

important to min-

imize the search time without regard for the time required to set up the table),

and “dynamic” means that the table is subject to frequent insertions and perhaps
also deletions. A third possible scheme is to classify search methods according to
whether they are based on comparisons between keys or on digital properties of
the keys, analogous to the distinction between sorting by comparison and sorting
by distribution. Finally we might divide searching into those methods that use
the actual keys and those that work with transformed keys.
The organization of this chapter is essentially a combination of the latter two
modes of classification. Section 6.1 considers “brute force” sequential methods of
search, then Section 6.2 discusses the improvements that can be made based on
comparisons between keys, using alphabetic or numeric order to govern the decisions. Section 6.3 treats digital searching, and Section 6.4 discusses an important
class of methods called hashing techniques, based on arithmetic transformations
of the actual keys. Each of these sections treats both internal and external
searching, in both the static and the dynamic case; and each section points out
the relative advantages and disadvantages of the various algorithms.
Searching and sorting are often closely related to each other. For example,
consider the following problem: Given two sets of numbers,

and

B =

{bi, b %,

.


.

suggest themselves:

.

b n }, determine

whether or not

A C

A=
B.

{aj, 02

,

.

.
.

,

am }

Three solutions



SEARCHING

394

6

Compare each a

1.

sequentially with the 6/s until finding a match.

t

2. Sort the a’s and 6’s, then make one sequential pass through both
checking the appropriate condition.
3. Enter the 6/s in a table, then search for each of the a,.

files,

m and n.
some constant a, and
some (larger) constant
With a suitable hashing method, solution 3 will take roughly c 3 m + c4 n units of
time, for some (still larger) constants c 3 and c 4 It follows that solution 1 is good
for very small to and n, but solution 2 soon becomes better as m and n grow
larger. Eventually solution 3 becomes preferable, until n exceeds the internal
memory size; then solution 2 is usually again superior until n gets much larger

still. Thus we have a situation where sorting is sometimes a good
substitute for
searching, and searching is sometimes a good substitute for sorting.
More complicated search problems can often be reduced to the simpler case
Each of these solutions
Solution

1

will take

is

attractive for a different range of values of

roughly

Cimn

solution 2 will take about C 2 (to lg

units of time, for

m + n lg n)

units, for

c.-i

.


considered here. For example, suppose that the keys are words that might be
slightly misspelled; we might want to find the correct record in spite of this
error. If we make two copies of the file, one in which the keys are in normal
lexicographic order and another in which they are ordered from right to left (as
the words were spelled backwards), a misspelled search argument will probably
agree up to half or more of its length with an entry in one of these two files. The
search methods of Sections 6.2 and 6.3 can therefore be adapted to find the key
if

that was probably intended.

A

related problem has received considerable attention in connection with

airline reservation systems,

when

and

in other applications involving people’s

names

is a good chance that the name will be misspelled due to
poor
handwriting or voice transmission. The goal is to transform the argument into
some code that tends to bring together all variants of the same name. The


there

following contemporary form of the “Soundex” method, a technique that was
originally developed

by Margaret K. Odell and Robert C. Russell [see U.S.
Patents 1261167 (1918), 1435663 (1922)], has often been used for encoding
surnames:
1.

Retain the first letter of the name, and drop
u, w, y in other positions.

2.

Assign the following numbers to the remaining letters after the
b,

f,

p,

v ->

1

1

c, g, j, k, q, s, x, z


d, t
3.

If



>•

-A 2

3

two or more

occurrences of

4.

i,

o,

first:

4

A6


with the same code were adjacent in the original name

(before step 1), or adjacent except for intervening h’s

the

a, e, h,

m, n -» 5
r

letters

all

and

w’s,

omit

all

but

first.

Convert to the form

“letter, digit, digit, digit”


by adding

trailing zeros (if

there are less than three digits), or by dropping rightmost digits
are more than three).

(if

there


SEARCHING

6

395

For example, the names Euler, Gauss, Hilbert, Knuth, Lloyd, Lukasiewicz, and
Wachs have the respective codes E460, G200, H416, K530, L300, L222, W200.

Of course

this

system

will bring together


names that

are

somewhat

different,

names that are similar; the same seven codes would be obtained for
Ghosh, Heilbronn, Kant, Liddy, Lissajous, and Waugh. And on the other
hand a few related names like Rogers and Rodgers, or Sinclair and St. Clair, or
Tchebysheff and Chebyshev, remain separate. But by and large the Soundex
code greatly increases the chance of finding a name in one of its disguises. [For
as well as
Ellery,

Bourne and D. F. Ford, JACM 8 (1961), 538Leon Davidson, CACM 5 (1962), 169-171; Federal Population Censuses
1790-1890 (Washington, D.C.: National Archives, 1971), 90.]
When using a scheme like Soundex, we need not give up the assumption
that all keys are distinct; we can make lists of all records with equivalent codes,
further information, see C. P.
552;

treating each

list

as a unit.

Large databases tend to make the retrieval process more complex, since

many different fields of each record as potential
keys, with the ability to locate items when only part of the key information is
specified. For example, given a large file about stage performers, a producer
might wish to find all unemployed actresses between 25 and 30 with dancing
talent and a French accent; given a large file of baseball statistics, a sportswriter
may wish to determine the total number of runs scored by the Chicago White
Sox in 1964, during the seventh inning of night games, against left-handed
pitchers. Given a large file of data about anything, people like to ask arbitrarily
complicated questions. Indeed, we might consider an entire library as a database,
people often want to consider

and a searcher may want to
information retrieval.

An

find everything that has

been published about

introduction to the techniques for such secondary key

(multi-attribute) retrieval problems appears below in Section 6.5.
it may be helpful to put
During the pre-computer era, many books of
etc., were compiled, so that mathematical
calculations could be replaced by searching. Eventually these tables were transferred to punched cards, and used for scientific problems in connection with
collators, sorters, and duplicating punch machines. But when stored-program
computers were introduced, it soon became apparent that it was now cheaper to
recompute logo; or cos a: each time, instead of looking up the answer in a table.

Although the problem of sorting received considerable attention already in
the earliest days of computers, comparatively little was done about algorithms
for searching. With small internal memories, and with nothing but sequential
media like tapes for storing large files, searching was either trivially easy or

Before entering into a detailed study of searching,

things in historical perspective.

logarithm tables, trigonometry tables,

almost impossible.

But the development of

larger

and

larger random-access

memories during

the 1950s eventually led to the recognition that searching was an interesting

problem

own

complaining about the limited amounts

of space in the early machines, programmers were suddenly confronted with
larger amounts of memory than they knew how to use efficiently.
in its

right. After years of


SEARCHING

396

The

first

Computers

6

surveys of the searching problem were published by A.


Automation 5,12 (December 1956),

6-9;

W. W.

Dumey,


I.

IBM

Peterson,

Research Control 1 (1958), 159-164; A. S. Douglas, Comp. J. 2 (1959), 1-9.
More
extensive treatments were given later by Kenneth E. Iverson, A Programming
J.

Language (New York: Wiley, 1962), 133-158, and by Werner Buchholz, IBM
Systems J. 2 (1963), 86-111.
During the early 1960s, a number of interesting new search procedures based
on tree structures were introduced, as we shall see; and research about searching
is still

6.1.

actively continuing at the present time.

SEQUENTIAL SEARCHING

“Begin at the beginning, and go on
This sequential procedure

is


the obvious

till

you

way

find the right key; then stop.”

to search,

starting point for our discussion of searching because

algorithms are based on

it.

We

very interesting ideas, in spite of

(

Sequential search).

whose respective keys are K±, K^,
argument K. We assume that
>


N

51.

[Initialize.]

52. [Compare.]

Set

file?]

makes a useful
more intricate
some

precisely as follows:

Given a table of records Ri,
,

Kn,

R2

,

.

.


.

,

Rn,

this algorithm searches for a given

1.

1.

i

K = Ki, the algorithm terminates successfully.

If

53. [Advance.] Increase
54. [End of

it

of the

its simplicity.

The algorithm might be formulated more


Algorithm S

and

many

shall see that sequential searching involves

If

unsuccessfully.

i

<

i

by

1.

N, go back to S2. Otherwise the algorithm terminates

|

Notice that this algorithm can terminate in two different ways, successfully
(having located the desired key) or unsuccessfully (having established that the
given argument


is not present in the table).
algorithms in this chapter.

The same

will

be true of most other

SUCCESS

Fig. 1 . Sequential or “house-to-house” search.

FAILURE


SEQUENTIAL SEARCHING

6.1

A

397

MIX program can be written down immediately.

Program S ( Sequential search). Assume that Ki appears in location KEY + i,
and that the remainder of record Ri appears in location INFO + i. The following
program uses rA = K, rll ee i - N.
START


K
1-N

2H

KEY+N
SUCCESS

LDA
ENT1
CMPA
JE
INC1
J1NP
FAILURE EQU

01

02
03
04

05
06

07

1


SI. Initialize.

1

i

C
C

,

*

1

1.

Compare.

Exit

c-s
c-s

1

2B

<-


S2.

K = Ki.

if

S3. Advance.
S4.

-s

End

Exit

of file?

not in table.

if

At location SUCCESS, the instruction “LDA INF0+N.1”
information into rA.

The

will

now


bring the desired

|

analysis of this

program

is

straightforward;

it

shows that the running

time of Algorithm S depends on two things,

C=
5=

number

the

of key comparisons;

0

1 if successful,


unsuccessful.

if

(l)

Program S takes 5C — 25 + 3 units of time. If the search successfully finds
K = Ki, we have C = i, S — 1; hence the total time is (5 i + l)u. On the other
hand if the search is unsuccessful, we have C = N, S = 0, for a total time of

N + 3 )u.

(5

of

C

every input key occurs with equal probability, the average value

If

in a successful search will
1

be

+2+


\-

N

~

N
the standard deviation

N+
2

1

^



of course, rather large, about 0.289IV (see exercise

is,

1).

The algorithm above is surely familiar to all programmers. But too few
know that it is not always the right way to do a sequential search! A

people

straightforward change makes the algorithm faster, unless the


list

of records

is

quite short:

Algorithm
rithm

S,

Ql.

Q

(

file.

[Initialize.]

Q2. [Compare.]
Q3.

Quick sequential search). This algorithm is the same as Algoit assumes the presence of a dummy record Rn+i at the

except that


end of the

Set
If

i

<—

Q4. [End

of

file?]

1,

and

If

i

<

i

by


1

01

02

START

(

K^+i

K.

and return to Q2.

N, the algorithm terminates successfully; otherwise

terminates unsuccessfully

Program Q

set

K — Ki, go to Q4.

[Advance.] Increase

(i


=

N+

1).

Quick sequential search). rA

LDA
STA

|

=

K, rll

K

1

Ql.

KEY+N+1

1

Kn+i

= —


Initialize.

<- K.

*

N.

it


SEARCHING

398
03
04

05
06

07
08

ENT1
INC1
CMPA
JNE
J1NP
FAILURE EqU


-N

1

Exit

C

we should

11

The

of file?

not in table.

in the analysis of

10) it; this

is

Program S, the running
an improvement whenever C > 6

N > 8 in an unsuccessful search.


When

(

will

make Program

Quicker sequential search ).

START

LDA
STA
ENT1
3H
INC1
CMPA
JE
CMPA
JNE
INC1
4H
J1NP
FAILURE EqU

Q

makes use of an importests two or more


an inner loop of a program

try to reduce the testing to just one condition.

Another technique

05
06
07
08
09
10

if

transition from Algorithm S to Algorithm

Program Q'

04

and 5

and whenever

tant speed-up principle:
conditions,

01


End

1-5
— 45 +

02
03

To Q3
Q4.

time has decreased to (4C

The

Comnare.
if Ki / K.

0.2.

1

In terms of the quantities
in a successful search,

0.

Q3. Advance.

C+l-5

C+l-5

*

<—

i

C + l-5

1

KEY+N.l
*-2
SUCCESS

K

KEY+N+1
-1-N

Q

still faster.

rA

=

K, rll


1

Ql.

1

Kn+i

1

i <

— 5 + 2)/2J
L(C-5 + 2)/2j
L(C-S + 2)/2j
L(C-S + 1)/2J
L(C-5 + l)/2j

2

(C — 5) mod

1

SUCCESS

N.

1.

1

Q2. Compare.

To Q4

if

K = Ki.

Compare, (nexti

0.2.

To Q3 if
Advance

2

K ^ Ki+1

.

i.

Q4. End of file?
Exit if not in table.

1


1-5

*

i

<— K.

Q3. Advance, (twice

L(C

KEY+N.l
4F
KEY+N+1,
3B

= —

Initialize.

inner loop has been duplicated; this avoids about half of the
it reduces the running time to

+

1”

instructions, so


3.5

C - 3.55 + 10 +

(

C~

S ) mod 2
2

We

have saved 30 percent of the running time of Program S, when large
tables are being searched; many existing programs can be improved
in this way.
units.

The same

ideas apply to programming in high-level languages.
D. E. Knuth, Computing Surveys 6 (1974), 266-269.]

A

slight variation of the

algorithm

is


appropriate

if

[See, for

example,

we know that the keys

are in increasing order:

Algorithm T (Sequential search in ordered table ).
Ri,R2
Rn whose keys are in increasing order
,

,

this algorithm searches for a given

the algorithm assumes that there
Kn+i = oo > K.

Tl.

[Initialize.]

T2. [Compare.]


Set
If

i

+-

K<

T3. [Advance.] Increase

is

1.

AT,,
i

go to T4.

by

1

Given a table of records

K

x


<

K

2

<

<

KN

,

argument K. For convenience and speed,
a dummy record FI _\ + whose key value is

and return to T2.

i


SEQUENTIAL SEARCHING

6.1

T4.

K—


[Equality?] If

terminates unsuccessfully.
If all

399

Ki, the algorithm terminates successfully. Otherwise

it

|

input keys are equally

likely, this

algorithm takes essentially the same

average time as Algorithm Q, for a successful search. But unsuccessful searches
are performed about twice as fast, since the absence of a record can be established

more quickly.
Each of the algorithms above uses subscripts to denote the table entries. It
is convenient to describe the methods in terms of these subscripts, but the same
search procedures can be used for tables that have a linked representation, since
the data

is


being traversed sequentially. (See exercises

Frequency of access. So

far

2, 3,

and

4.)

we have been assuming that every argument occurs

as often as every other. This

not always a realistic assumption; in a general

is

Kj will occur with probability pj, where Pi + P2 +
+ Pn = 1The time required to do a successful search is essentially proportional to the
number of comparisons, C, which now has the average value
situation, key

C N =px+ 2p 2
If

we have


CV

smallest

is

is,

Np N

(3)

.

when

>P2 >

Pi
that

f

-\

the option of putting the records into the table in any desired order,

this quantity


when the most

>Pn,

(4)

frequently used records appear near the beginning.

Let’s look at several probability distributions, in order to see

saving

is

possible

when the

records are arranged in the optimal

how much

manner

of a

specified

= Pn = 1 /N, formula (3) reduces to C m = (N + l)/2;
P2 =

we have already derived this in Eq. (2). Suppose, on the other hand, that
in (4). If pi

=

-

2>

Pi

Then
less



1

Cn =

than two

2
,

P2



21




-

~N
,

1

•••’

4’

by exercise

fbv-i

7;

=

1

the average

for this distribution, if the records

Pn=
number


1

2 N^i-

(s)

of comparisons

is

appear in the proper order

within the table.

Another probability distribution that suggests

Pi=Nc,

p2

— (N —

l)c,

...,

This wedge-shaped distribution
as (5). In this case


we

is

pjv

=

itself is

2
c,

where c

=

+

(6)

not as dramatic a departure from uniformity

find

N

C N = cJ2 k(N + l-k) = ^l-,

(7)


k=
the

optimum arrangement

have been obtained

if

saves about one-third of the search time that would

the records had appeared in

random

order.


SEARCHING

400

Of course the

6.1

probability distributions in

(5)


and

and they may never be a very good approximation to

(

6 ) are rather

reality.

A

artificial,

more

typical

sequence of probabilities, called “Zipf’s law,” has

=

Pi

c/1,

=

P2


c/2,

pN

...,

=

c /N,

where c

= l/H^.

(

8)

This distribution was popularized by G. K. Zipf, who observed that the nth most
common word in natural language text seems to occur with a frequency approximately proportional to 1/n. [The Psycho-Biology of Language (Boston, Mass.:
Mifflin, 1935); Human Behavior and the Principle of Least Effort
(Reading, Mass.: Addison- Wesley, 1949).] He observed the same phenomenon
when metropolitan areas are ranked in order of decreasing
population. If Zipf’s law governs the frequency of the keys in a table, we have

Houghton

in census tables,


immediately

C N = N/Hn
searching such a

file is

about | In

;

(

9)

A

times faster than searching the same file
[See A. D. Booth, L. Brandwood, and J. P.

with randomly ordered records.
Cleave, Mechanical Resolution of Linguistic Problems (New York: Academic
Press, 1958), 79.]

Another approximation to realistic distributions is the “80-20” rule of thumb
commonly been observed in commercial applications [see, for example,
IBM Systems J. 2 (1963), 114-115]. This rule states that 80 percent of the transactions deal with the most active 20 percent of a file; and the
that has

W.


P. Heising,

same

rule applies in fractal fashion to the top 20 percent, so that 64 percent of
the transactions deal with the most active 4 percent, etc. In other words,

Pi + P 2 +
+ p. 20 n
P1+P2+P3I
\-Pn


One





for all n.

distribution that satisfies this rule exactly whenever

P 1 =C,

=

P2


e

-l)c,

(2

P3



e

(3

—2 e )c,

n

is

e
e
Pn = (A -(A-l) )c,

...,

10 )

(


a multiple of 5

is

(

11 )

(

12 )

where

c-l/JV*,
'

+

=

6

80

[°8

«

j


0.1386,

log .20



since Pi
cn& for all n in this case. It is not especially easy
P2 +
pn
to work with the probabilities in ( 11 ); we have, however, n 6 — (n — l) 9
On?- 1 (l
(l/n)) so there is a simpler distribution that approximately fulfills




+



=

+0

,

the 80-20 rule, namely


Pi—

o/l

Here 0

6
>

P2



C/ 2

1

e
,

...,

Pat

log .80/ log .20 as before,

— c/N

and


1

6

where

,

is

the

c=l/H^

0

\

13 )

(

Ath harmonic number

of

s, namely l~ + 2~ +
+ A~ Notice that this probability distribution
very similar to that of Zipf’s law ( 8 ); as 0 varies from 1 to 0, the probabilities


order
is

1

=

s

s

s

.


SEQUENTIAL SEARCHING

6.1

vary from a uniform distribution to a Zipfian one. Applying
to
(3)

CN =
as the

mean number

Hk


e)

/H$- e =
>

^

+ 0(N

1

~e
)

«0

.

(

401

13 ) yields

1221V

(

14 )


of comparisons for the 80-20 law (see exercise 8 ).

A study of word frequencies carried out by E. S. Schwartz [see the interesting
graph on page 422 of JACM 10 (1963)] suggests that distribution 13 with a
(
)
slightly negative value of 9 gives a better fit to the data than Zipf’s law 8 ). In
(
this case the

mean

value

(-)
is

substantially smaller than

(

9 ) as

N—

00

.


Distributions like ( 11 ) and ( 13 ) were first studied by Vilfredo Pareto in
connection with disparities of personal income and wealth [Cours d’Economie
Politique 2 (Lausanne: Rouge, 1897), 304-312]. If p k is proportional to the
wealth of the fcth richest individual, the probability that a person’s wealth

exceeds or equals x times the wealth of the poorest individual is k/N when
x = Pk/PN Thus, when p k = ck e ~ l and x = (k/N) 8 - 1 the stated probability
1 /( 1-0
is x
); this is now called a Pareto distribution with parameter 1/(1 — 9).
,

Curiously, Pareto didn’t understand his own distribution; he believed that
a value of 9 near 0 would correspond to a more egalitarian society than a

value near

1!

His error was corrected by Corrado Gini [Atti della III Riunione
il Progresso delle Scienze (1910), reprinted in his

della Societa Italiana per

Memorie

Metodologia Statistica 1 (Rome: 1955), 3-120], who was the first
person to formulate and explain the significance of ratios like the 80-20 law 10 ).
(
People still tend to misunderstand such distributions; they often speak about a

di

if an a-b law makes sense only when a + b = 100,
12 ) shows that the sum 80 + 20 is quite irrelevant.
(
Another discrete distribution analogous to ( 11 ) and ( 13 ) was introduced by

“75-25 law” or a “90-10 law” as

while

G. Udny Yule when he studied the increase in biological species as a function of
time, assuming various models of evolution [Philos. Trans. B213 (1924), 21-87],
Yule’s distribution applies
c

Pi

c,P2
2

-9 ,P3

when

9

limiting value c

=


2:

(iV — 1)! c
PN ~
(N-9)...(2-9)

(3-0)(2-0)’

c=
The

<

2c

~

1 /Hpr

or c

(V)

0

1-0

is


used when 9

>

\N- 1)
16 )
(

i-eyy

— l/N

(N-0\

=

0 or 9

=

1

A

“self-organizing” file. These calculations with probabilities are very nice,
but in most cases we don’t know what the probabilities are. We could keep a
count in each record of how often it has been accessed, reallocating the records on
the basis of those counts; the formulas derived above suggest that this procedure
would often lead to a worthwhile savings. But we probably don’t want to devote



SEARCHING

402

6.1

much memory space to the count fields,
memory by using one of the nonsequential
so

since

we can make

better use of that

search techniques that are explained

later in this chapter.

A
is

simple scheme, which has been in use for

unknown, can be used to keep the records

moved to the front of the
The idea behind this

will

tend to be located

many

years although

successfully located,

“self-organizing” technique

fairly

is

when we need them.

near the beginning of the table,

shown that the average number of comparisons needed

PiPj

—1+2

^

]


lFor example,

familiar expression

comparisons (17)

+ 2JE =1 (,?
optimal value
1



(

pj

is

2



is

= 1/N

for 1

Tt


Pi

<

<

how

i

,Pn},
can be

(17)

+ pj

In general, the average

than twice the optimal value

1.

In fact,

is

always
J.


Comp.

1

2

l
Syst. Sci.

5

{c/i)(c/j
c/i

1

+ c/j

E

2

l2

1

+ cJ^(HN+i i


it is

36

(1988),

approached

when the key

prob-

have

N

1

=

We

E

+

of

Cn <


.

well the self-organizing procedure works
(8).

number

(3), since

than 7r/2 times the

less

the best possible constant in general, since

obey Zipf’s law

CN —

pi

2Cjv
Cn
[Chung, Hajela, and Seymour,

1 )pj

Cn


if

less

=

proportional to l/j 2

Let us see
abilities

.

PiPj

1

+ Pj

Pi

N + l)/2 derived above.
always

is

148-157]; this ratio

when


.

it

N, the self-organizing
always in completely random order, and this formula reduces to the

(See exercise 11.)
is

,

to find an item in such a

tends to the limiting value

file

Cjv

table

is

that the oft-used items

N

self-organizing


it

table.

we assume that the
keys occur with respective probabilities {pi,P2
with each search being completely independent of previous searches,
If

origin

its

good order without

in a pretty

Whenever a record has been

auxiliary count fields:

Hi )

=

-

N

N


+ cJ^Hi -

=1

i

1

i+j

2c

=1

£

Hi

i— 1

-2 N- 2 (AT + 1 )HN + 2 N)

=

\

+ c((2N + 1)H2N

=


|

+c(Aln4-lnAT +

0(l))

« 2N/lgN,

(18)

N

by Eqs. 1.2.7-(8) and 1.2.7-(3). This is substantially better than | IV, when
is reasonably large, and it is only about In 4 fa 1.386 times as many comparisons
as would be obtained in the optimum arrangement; see (9).
Computational experiments involving actual compiler symbol tables indicate
that the self-organizing method works even better than our formulas predict,
because successive searches are not independent (small groups of keys tend to
occur in bunches).

This self-organizing scheme was
Research 13 (1965), 609-618],

who

first

McCabe [Operations
McCabe also introduced


analyzed by John

established (17).


6.1

SEQUENTIAL SEARCHING

403

another interesting scheme, under which each successfully
located key that is not
already at the beginning of the table is simply
interchanged with the preceding
key, instead of being moved all the way to the
front. He conjectured that the
limiting average search time for this method,
assuming independent searches,
never exceeds (17). Several years later, Ronald L. Rivest
proved in fact that the
transposition method uses strictly fewer comparisons than
the move-to-front
method, in the long run, except of course when
< 2 or when all the nonzero
probabilities are equal [CACM 19 (1976), 63-67], However,
convergence to the
asymptotic limit is much slower than for the move-to-front
heuristic, so move-tofront is better unless the process is prolonged [J.

R. Bitner, SICOMP 8 (1979),
82-110], Moreover, J. L. Bentley, C. C. McGeoch,
D. D. Sleator, and R. E,'
Tarjan have proved that the move-to-front method never
makes more than four
times the total number of memory accesses made by
any algorithm on linear
lists, given any sequence of accesses
whatever to the data
even if the algorithm
knows the future; the frequency-count and transposition methods
do not have
this property [CACM 28 (1985), 202-208,
404-411], See SODA 8 (1997), 53-62,
for an interesting empirical study of more than
40 heuristics for self-organizing
lists, carried out by R. Bachrach and
R. El-Yaniv.

N



Tape searching

with unequal-length records. Now let’s give the problem
still another twist: Suppose the table
we are searching is stored on tape, and
the individual records have varying lengths. For
example, in an old-fashioned

operating system, the “system library tape” was such
a file; standard system
programs such as compilers, assemblers, loading routines, and
report generators
were the records on this tape, and most user jobs
would start by searching
down the tape until the appropriate routine had been input. This
setup makes
our previous analysis of Algorithm S inapplicable, since
step S3 takes a variable
amount of time each time we reach it. The number of comparisons
is therefore
not the only criterion of interest.
Let L, be the length of record R{ and let
p be the probability that this
record will be sought. The average running time of
the search method will now
be approximately proportional to
t

,

PiLi

+P 2 (L +L 2
1

)

-f








+p N (L +L 2 + L 3 +

~L n ).

1

1

(19)

Li = L 2 =
= L N = 1, this reduces to (3), the case already studied.
It seems logical to put the most frequently
needed records at the beginning
of the tape; but this is sometimes a bad idea! For
example, assume that the tape
contains just two programs, A and B, where A is needed
twice as often as B but
it is four times as long. Thus,

When






N=2
we

A

>

PA

= b

LA =

4,

PB

=b

Lg =

1.

on tape, according to the “logical” principle stated above, the
average running time is
-4 + ± -5 =
but if we use an “illogical” idea, placing

f
B first, the average running time is reduced
5= ii.
If

place

first

f

The optimum arrangement
as follows.

;

to|l +

|-

of programs on a library tape

may be determined


SEARCHING

404

Theorem


6.1

Let Li and pi be as defined above. The arrangement of records
optimal if and only if

S.

in the table is

Pi/Li
In other words, the

P ai L a 1
over

all

minimum

> P2/L2 >



> Pn/Ln-



(20)


value of

+Pa 2 (L ai + La 2 +





)

permutations a\

of

{1

,



+Pa N {La

2 ,..., TV},

H

1

is


h

L aN ),

equal to (19)

and only

if

if

(20) holds.

Suppose that Ri and Ri+i are interchanged on the tape; the cost (19)

Proof.

changes from




+ Pi{L\ +

+ Li - + Li) + pi + i(L\ +

+ Li+

1 )


+







to
'

+ Pi+i(Li +

4-

Li -

+ Lj+i) + Pi(Li +







+ Li +

1)


+







a net change of PiL i+ 1 - p i+ iLi. Therefore if Pi/Li < p l+1 /L t+1 such an
interchange will improve the average running time, and the given arrangement
,

is

not optimal.

is

optimal.

It

follows that (20) holds in any optimal arrangement.

Conversely, assume that (20) holds; we need to prove that the arrangement
The argument just given shows that the arrangement is “locally

make no improvement; but

optimal” in the sense that adjacent interchanges


may

there

conceivably be a long, complicated sequence of interchanges that leads to a

better “global optimum.”
science

We

shall consider

and one that uses a mathematical

First proof.

Assume

trick.

We know that any
Rn by using

that (20) holds.

records can be sorted into the order

two proofs, one that uses computer


R\ R2

permutation of the

a sequence of interchanges of adjacent records. Each of these interchanges replaces
RjRi ... by
.

.

.

.

.

.

RiRj ... for some i < j, so it decreases the search time by the nonnegative
amount p,Lj — PjLi. Therefore the order R\ R2
Rn must have minimum
.

.

.

search time.


Second

proof.

Replace each probability pi by
Pi(e)

will

= Pi + P -

(e

1

+ e2 +







+ e N )/N,

(21)

an extremely small positive number. When e is sufficiently small, we
never have xipi(e)-|
\-x N p N (e) = yiPi(e)-\

\-VnPn( e) unless x x = y u

where

e is

= Vn\ in particular, equality will not hold in (20). Consider now the
permutations of the records; at least one of them is optimum, and we know
it satisfies (20). But only one permutation satisfies (20) because there are
no equalities. Therefore (20) uniquely characterizes the optimum arrangement
..., xjv

IV!

that

of records in the table for the probabilities Pi(e), whenever e is sufficiently small.
By continuity, the same arrangement must also be optimum when e is set equal
to zero.

(This “tie-breaking” type of proof

combinatorial optimization.)

|

is

often useful in connection with



SEQUENTIAL SEARCHING

6.1

Theorem S

is

due to

E. Smith, Naval Research Logistics Quarterly 3
exercises below contain further results about optimum file

The

(1956), 59-66.

405

W.

arrangements.

EXERCISES
1.

[

M20 When


table of
2.

[15]

the search keys are equally probable, what

all

number

ation of the

of comparisons

made

is

the standard devi-

in a successful sequential search

N records?

through a

Restate the steps of Algorithm


S, using linked-memory notation instead of
(If P points to a record in the table, assume that KEY(P) is the key,
the associated information, and LINK(P) is a pointer to the next record.
also that FIRST points to the first record, and that the last record points to A.)

subscript notation.

INFO(P)

Assume

is

3. [16] Write a MIX program for the algorithm of exercise 2. What
time of your program, in terms of the quantities C and S in (l)?

4.

[17]

memory
5.

[20]

Does the idea of Algorithm
notation? (See exercise

Program Q'


is,

6.

[20]

Add

more

three

the running

2.)

C

and S

instructions to

for

Program Q, when C is large.
which Program Q' actually takes more

Program

Q',


reducing

its

running time to

+ constant) u.

[M20] Evaluate the average number of comparisons,

7.

is

carry over from subscript notation to linked-

of course, noticeably faster than

But are there any small values of
time than Program Q?
about (3.33C

Q

(3),

using the “binary” prob-

ability distribution (5).


A 00, when r/

8.

[HM22] Find an asymptotic

9.

[HM28] The text observes that the probability distributions given by (11), (13),
(16) are roughly equivalent when 0 < 6 < 1, and that the mean number of

and

comparisons using (13)
a) Is the

is

mean number

series for

^_AT + 0(N

l

n

as


1.

~e

).

^ N + 0(N

of comparisons equal to

1

~e
)

also

when

the

probabilities of (11) are used?
b)

What about

(16)?

c)


How do

and

(11)

(16)

compare to

(13)

when

<

0

0?

[M20] The best arrangement of records in a sequential table is specified by (4);
is the worst arrangement? Show that the average number of comparisons in the
worst arrangement has a simple relation to the average number of comparisons in the
best arrangement.
10.

what

11. [MSO] The purpose of this exercise is to analyze the limiting behavior of a selffile with the move-to-front heuristic. First we need to define some notation:

Let /m (ii,i 2 ,...,im) be the infinite sum of all distinct ordered products x tl x i2
x tk
such that 1 < ti,
< m, where each of *1, *2
x m appears in every term. For
example,

organizing

.

.

+ y) + y 1+j *(* + y) k ) = 7
1-x-y (\l-x + ^~)
1-yJ

.

.

.

h(x,

y)=J2

.

,


.

(x

1+j

y(*

k

.

.

.

xy


406

SEARCHING

Given a

set

X


Pnm —

}
1



xi

6.1

variables {xj

fm(Xj 1

1

,

.

.

,

.

x n },

.

.

Xj m

let



'

Qnm —

);

^

.

~'''~ X

l
P32 =

X3 )

f2 (x 1 ,x 2 ) + f2 (xx,x 3 ) + f2 (x 2 ,x 3 ) and Q 32 = 1/(1
1/(1 — x 2 — x 3 ). By convention we set Pn0 = Q n0 = 1

+


Assume that the

text’s self-organizing

with probability

pi.

a)

By summing

-

Xl

3m

- x2 +
)

.

file has been servicing requests for item R
t
After the system has been running a long time, show that
item from the front with limiting probability
PiP(N-i)(


Ri will be the mth
where the set of variables
b)

.

,


For example,
1/(1

n

of

X

m -i),

is

(p x

the result of (a) for

Pnn

.


,

,pi_x,Pi+i,

.

m=

+ Pn(n-l) +



'



+

.

,Pn}-

.
.

we obtain the

1, 2, ...


PnO

=

identity

Qnn-

Prove that, consequently,

^nm+y

)P„( m - 1

1

n~

o Vnm

(
^

c)

Compute

-I

)


,/
U
f(-l)

JQn(m- 1 +
_i_

)

1

the limiting average distance di

the front of the
12.

m+l \n

[MSS] Use

(

then evaluate

list;

CN =

17 ) to evaluate the average


the self-organizing

file

when

=

h
PnO — Qnm>
m
^
J
mfn-m + m\
)QnO = Pnmy
m

Y. m >i mPiP( N - i)(m-i) of

R

x

from

J2iLiPi di-

number


of comparisons needed to search
the search keys have the binary probability distribution
5 ).
(

13.

[M27] Use ( 17 ) to evaluate

Cn

for the

wedge-shaped probability distribution ( 6 ).

[M21] Given two sequences (xi, x 2
x„) and (j/i, y 2
y n ) of real numbers,
what permutation or a2
a n of the subscripts will make ]T\ Xi y ai a maximum? What
permutation will make it a minimum?
14.

,

.

.

.


.

,

.

.

.

.

.

[M22] The text shows how to arrange programs optimally on a system
library
tape, when only one program is being sought. But another set
of assumptions is more
appropriate for a subroutine library tape, from which we may wish to
load various
subroutines called for in a user’s program.
For this case let us suppose that subroutine
j is desired with probability Pj,
independently of whether or not other subroutines are desired. Then, for
example,
the probability that no subroutines at all are needed is
-P
(1 - Pi)(l — P2 ) ...
15.


(1

and the probability that the search
^ J (1



Pj+i)

will therefore

LiPi(l

What

is

P2
the

)



N )\

will end just after loading the jth subroutine is
(1 ~ Pn )• If Lj is the length of subroutine j, the average search time
be essentially proportional to





... (1

— PN + (Lx + L 2 )P2 (1 - P3
)

optimum arrangement

)

... (1

- PN

)

-|

f (Lx

-1

-L n )Pn

1

.


of subroutines on the tape, under these assump-

tions?

16. [M22] (H. Riesel.) We often need to test whether or not n given
conditions are
simultaneously true. (For example, we may want to test whether both
x > 0 and
2
y < z and it is not immediately clear which condition should be tested first.) Suppose
that the testing of condition
j costs Tj units of time, and that the condition will be
true with probability pj, independent of the outcomes of all
the other conditions. In
what order should we make the tests?
all

,


SEQUENTIAL SEARCHING

6.1

An

Fig. 2.

407


“organ-pipe arrangement” of probabilities minimizes the average seek time

in a catenated search.

M23

(J. R. Jackson.) Suppose you have to do n jobs; the jth job takes T, units
it has a deadline Dj. In other words, the jth job is supposed to be finished
most Dj units of time have elapsed. What schedule ai a 2
a„ for processing
the jobs will minimize the maximum tardiness, namely

17.

[

]

of time, and
after at

.

max(T01 -D ai T0l +T02

Tai +Ta2

,


M30

+

.

.

+Ta „ - £>„„)?

N

Catenated search.) Suppose that
records are located in a linear array
Ri
Rn, with probability pj that record Rj will be sought. A search process is called
“catenated” if each search begins where the last one left off. If consecutive searches
are independent, the average time required will be ^2
1represents the amount of time to do a search that starts at position i and ends at
18.

[

position

(

]


j.

This model can be applied, for example, to disk

the time needed to travel from cylinder

The

object of this exercise

is

seek time,

file

if

d(i,j)

is

to cylinder j.

i

optimum placement

to characterize the


of records for

catenated searches, whenever d(i,j) is an increasing function of — j |, that is, whenever
we have d(i,j) = d|j_j| for d\ < d 2 <
< djv-i. (The value of do is irrelevant.) Prove
that in this case the records are optimally placed, among all AT! permutations, if and
only if either p\ < Pn < P 2 < Pn- i <
< P[iv/ 2 j-i-i or Pn < Pi < pjv-i < P 2 <
|



<

P\N/ 2 ]

in Fig. 2.)

qi 92





-qk



sr k


n ti

-r 2

the rearrangement q[ q'2
r' = max [q, r,), except
,

and
19.

tj

=

i 7^

0 for

all

i

.

.

.

.


t

m

q'k

,

when the

some

for

s r'k

.

.

18,

.

.

=

true


t

m

is

>

when

s

is

+ m+ 1. Show that
= min (q r,) and
q[ = r\ and r' = q
present and N = 2 k + m.

N=

0;

better,

r, for all i

2fc


where

or

not

b

q[

it

when

function d(i,j) has the property that d(i,j)

= a + b(L i+ i is

.

r\

t

what are the optimal arrangements

j ? [This situation occurs, for example,

where r


20

m > 0 and k

t

when we do not know the appropriate
d(i,j)



r'2 r[ tj

when q[ = q and
and j. The same holds

[M20] Continuing exercise

searches



(Thus, an “organ-pipe arrangement” of probabilities is best, as shown
Consider any arrangement where the respective probabilities are

Hint:

Lj) and d(j,i)

+


for

d(j,i)

=

on tapes without read-backwards

direction to search; for

= a + b(L j+ H

bTjv)

i

<

+r+

catenated
c for all

capability,

j we have, say,
fe(LiH
b L t ),


the rewind time.]

M28] Continuing exercise 18, what are the optimal arrangements for catenated
searches when the function d(i,j) is min(d|,_j| d„_| 1 _ |), for dj < d 2 < •••? [This
.

[

,

:J

situation occurs, for example, in a two-way linked circular
register storage device.]

list,

or in a two-way shift-


SEARCHING

408

M28

6.1

Consider an n-dimensional cube whose vertices have coordinates (di,. ,,d n
)

with dj = 0 or 1; two vertices are called adjacent if they differ in exactly one coordinate.
Suppose that a set of 2 n numbers x 0 < x x <
< x 2 ~-i is to be assigned to the 2"
~
vertices in such a way that
x
is
x
minimized,
where
the
sum
is
i
over
all
i
and
J2i,j
j\
j
such that Xi and Xj have been assigned to adjacent vertices. Prove that this minimum
will be achieved if, for all j, Xj is assigned to the vertex whose coordinates are the
21.

[

.

]


\

binary representation of

j.

22. [20] Suppose you want to search a large file, not for equality but to find the 1000
records that are closest to a given key, in the sense that these 1000 records have the
smallest values of d(Kj,K) for some given distance function d. What data structure is

most appropriate

for such a sequential search?

Attempt the end, and never stand to doubt;
Nothing's so hard, but search will find it out.

— ROBERT HERRICK,

Seeke and finde (1648)


SEARCHING AN ORDERED TABLE

6 2.1
.

409


SEARCHING BY COMPARISON OF KEYS

6.2.

In THIS SECTION we shall discuss search methods that are based on a linear
ordering of the keys, such as alphabetic order or numeric order. After comparing
the given argument A' to a key AT, in the table, the search continues in three
different ways,

depending on whether

methods of Section

sequential search

K

<

Ki,

K =

-

A ,,

or A"

>


K,.

The

were essentially limited to a two-way
we free ourselves from the restriction
of sequential access we are able to make effective use of an order relation.

(K = Ki

decision

versus

K ^ Ki),

6.1

but

if

an Ordered Table

6.2.1. Searching

What would you do

if someone handed you a large telephone directory and

name of the person whose number is 795-6841? There is
no better way to tackle this problem than to use the sequential methods of
Section 6.1. (Well, you might try to dial the number and talk to the person who
answers; or you might know how to obtain a special directory that is sorted by
number instead of by name.) The point is that it is much easier to find an entry
by the party’s name, instead of by number, although the telephone directory
contains all the information necessary in both cases. When a large file must
be searched, sequential scanning is almost out of the question, but an ordering

told you to find the

relation simplifies the job enormously.

With so many

sorting

difficulty rearranging

Of

course,

if

a

methods

file


at our disposal (Chapter 5),

into order so that

we need to search the

be faster than to do a complete sort of the
searches in the
section

table

we

same

file,

we

shall concentrate

whose keys

it

may be

we will have


little

searched conveniently.

table only once, a sequential search would
file;

but

are better off having

if

it

we need to make repeated

in order. Therefore in this

on methods that are appropriate

for searching

a

satisfy

K\ < K2 <






assuming that we can easily access the key
to Ki in such a table, we have either



in

< Kn,
any given position. After comparing

K



K < Ki

or

• AT

or



=


Ki

K > Ki

[Ri, R-i+i,







>

Rn

are eliminated from consideration];

[the search is done];
[i?i, i? 2 ,





,Ri are eliminated from consideration].

In each of these three cases, substantial progress has been made, unless i is
near one of the ends of the table; this is why the ordering leads to an efficient


algorithm.

Binary search. Perhaps the first such method that suggests itself is to start by
comparing
to the middle key in the table; the result of this probe tells which

K

half of the table should be searched next,

about

lg

and the same procedure can be used

K to the middle key of the selected half, etc. After at most
N comparisons, we will have found the key or we will have established

again, comparing


SEARCHING

410

6.2.1

SUCCESS


Fig. 3. Binary search.

it is not present. This procedure is sometimes known as “logarithmic search”
it is most commonly called binary search.
Although the basic idea of binary search is comparatively straightforward,
the details can be surprisingly tricky, and many good programmers have done it
wrong the first few times they tried. One of the most popular correct forms of
the algorithm makes use of two pointers, l and u, that indicate the current lower
and upper limits for the search, as follows:

that

or “bisection,” but

Algorithm

B

(

Binary search). Given a table of records Ri,

Ki <

keys are in increasing order

K

2


<

<

KN

,

R2

.

,

.
,

Rn

whose

this algorithm searches for a

given argument K.

Bl.

Set

[Initialize.]


<—

l

u <— N.

1,

B2. [Get midpoint.] (At this point we know that if K is in the table, it satisfies
Ki < K < Ku A more precise statement of the situation appears in exercise 1 below.) If u < l, the algorithm terminates unsuccessfully. Otherwise,
.

set

i

<-

[(l

+ u)/ 2j

K

B3. [Compare.]

the approximate midpoint of the relevant table area.

,


K

If
< i: go to B4;
algorithm terminates successfully.

B4. [Adjust u
B5. [Adjust

l

Set

.]

.]

Set

u
I

*r- i

«—

i




+

1

1

K

if

>

Ki, go to B5; and

if

AT

=

Ki, the

and return to B2.

and return to B2.

|

Figure 4 illustrates two cases of this binary search algorithm: first to search

for the argument 653, which is present in the table, and then to search for 400,

which

is

absent.

The

brackets indicate

l

and u and the underlined key repremaking four comparisons.
,

sents Ki. In both examples the search terminates after


SEARCHING AN ORDERED TABLE

6.2.1

a)

411

Searching for 653:


087 154 170 275 426 503 509 512 612 653 677 703
426 503 509 [512 612 653 677 703
426 503 509 [512 612 653] 677 703
061 087 154 170 275 426 503 509 512 612 [653] 677 703
061 087 154 170 275

765 897 908]
765 897 908]

061 087 154 170 275

765 897 908

[061

765 897 908

b) Searching for 400:
[061

087 154 170 275 426 503 509 512 612 653 677 703 765 897 908]
677 703 765 897 908

087 154 170 275 426 503] 509 512 612 653
061 087 154 170 [275 426 503] 509 512 612 653
061 087 154 170 [275] 426 503 509 512 612 653
061 087 154 170 275] [426 503 509 512 612 653

[061


677 703 765 897 908
677 703 765 897 908
677 703 765 897 908

Fig. 4. Examples of binary search.

Program B

(

Binary search). As in the programs of Section 6.1, we assume
a full- word key appearing in location KEY + i. The following code

here that Ki is
uses rll
l, rI2 = u, rI3
:

=

i.

START ENT1 1
ENT2 N
JMP 2F
5H
JE
SUCCESS
ENT1 1,3
2H

ENTA 0,1
INCA 0,2
SRB 1
STA TEMP
CMP1 TEMP
JG
FAILURE
LD3 TEMP
3H
LDA K
CMPA KEY, 3
JGE 5B
ENT2 -1,3
JMP 2B

01

02
03
04

05
06

01
08
09
10
11


12
13
14

15
16

17

1

Bl.

1

u <— N.

1

To B2.

Initialize.

1

<—

1.

Cl

Jump if K = Ki.
<— + 1.
Cl-S B 5. Adjust
C + l-S B2. Get midpoint.
C + l-S tA «— + u.
C + l-S rA «— [rA/2j. (rX changes
C + l-S
C + l-S
C + l-S Jump if u <
<— midpoint.
c
c
B3. Compare.
c
c
Jump if K > K{.
l.

1

i

l

too.

l.

i


C2
C2

B4. Adjust u. u t—

To B2.

i



1.

|

This procedure doesn’t blend with MIX quite as smoothly as the other
algorithms we have seen, because MIX does not allow much arithmetic in index
registers. The running time is (18C - 105 + 12)u, where C = Cl + C2 is the

number

of comparisons

5 = [outcome

is

right binary 1,”

byte


size, this

made (the number of times step B3 is performed), and
The operation on line 08 of this program is “shift

successful].

which

is

legitimate only on binary versions of MIX; for general

instruction should be replaced by “MUL =l//2+l=”, increasing the

running time to (26 C

- 185 +

20) u.

A tree representation. In order to really understand what is happening in
Algorithm B, our best bet is to think of the procedure as a binary decision tree,
as

shown

in Fig. 5 for the case


N = 16.


SEARCHING

412

Fig. 5.

When

A

N

6.2.1

comparison tree that corresponds to binary search when

K

N = 16.

comparison made by the algorithm is
Kg; this is
represented by the root node
in the figure. Then if
< Kg, the algorit hm
follows the left subtree, comparing
to K^\ similarly if

> Kg, the right
subtree is used. An unsuccessful search will lead to one of the external square
nodes numbered [o] through [77] for example, we reach node IT] if and only if
is 16,

the

first

@

:

K

K

K

;

K6

K
<

7

.


The binary

tree corresponding to a binary search

constructed as follows:

node

If

N=

0,

the tree

is

on

simply [¥].

N

records can be
Otherwise the root

is


(Wp

,

the corresponding binary tree with \N/2\ - 1 nodes, and the
the corresponding binary tree with [N/ 2j nodes and with all
node numbers increased by [jV/2].
the

left

subtree

right subtree

is

is

In an analogous fashion, any algorithm for searching an ordered table of
by means of comparisons can be represented as an IV-node binary tree
which the nodes are labeled with the numbers 1 to IV (unless the algorithm
makes redundant comparisons). Conversely, any binary tree corresponds to a
valid method for searching an ordered table; we simply label the nodes

length

N

in


B

0 0 â 0

0

đ đ

(i)

symmetric order, from left to right.
If the search argument input to Algorithm B is
w the algorithm makes the
comparisons
= Kiq. This corresponds to the path from
> Kg,
< K\ 2
in

K

K

the root to

K

,


,

K

B on other keys
corresponds to the other paths leading from the root of the tree. The method of
constructing the binary trees corresponding to Algorithm B therefore makes it
easy to prove the following result by induction on N:
(to)

in Fig. 5. Similarly, the

Theorem B. If 2 k ~ < N <
(min 1, max k) comparisons.
1

2

k
,

If

behavior of Algorithm

a successful search using Algorithm

N=

2


fc



1,

B requires

an unsuccessful search requires


SEARCHING AN ORDERED TABLE

6 2.1
.

k comparisons; and

k



1

if

2

k


1

or k comparisons.

<

N < 2k —

1,

413

an unsuccessful search requires either

|

Further analysis of binary search. (Nonmathematical readers should skip
to Eq. (4).) The tree representation shows us also how to compute the average
number of comparisons in a simple way. Let Cn be the average number of

N

comparisons

in a successful search, assuming that each of the
keys is an
let C'
N be the average number of comparisons in
an unsuccessful search, assuming that each of the

+ 1 intervals between and
outside the extreme values of the keys is equally likely. Then we have

equally likely argument; and

N

Cn — 1 +

internal path length of tree

C'N

N

=

external path length of tree

N+

1

by the definition of internal and external path length. We saw in Eq. 2.3.4.5-(3)
that the external path length is always 2
more than the internal path length.
Hence there is a rather unexpected relationship between CN and C'N

N


:

Cn =
This formula, which

(2)

due to T. N. Hibbard

is

[JACM

9 (1962), 16-17], holds

methods that correspond to binary trees; in other words, it holds
methods that are based on nonredundant comparisons. The variance of

for all search

for all

successful-search comparisons can also be expressed in terms of the corresponding

variance for unsuccessful searches (see exercise 25).

From

the formulas above


comparisons

is

N

we can

one whose tree has

way

see that the “best”

minimum external path

to search by

length, over all binary

can be proved that Algorithm B is
optimum in this sense, for all N; for we have seen (exercise 5.3.1-20) that a
binary tree has minimum path length if and only if its external nodes all occur
on at most two adjacent levels. It follows that the external path length of the
tree corresponding to Algorithm B is

trees with

internal nodes. Fortunately


(IV

+

From

(See Eq. 5.3.1-(34).)

l)([lgATj

it

LlgivJ+1
+2) -2

this formula

and

average number of comparisons, assuming that

(2)
all

(3)

.

we can compute


the exact

search arguments are equally

probable.

N=

1

2

3

4

5

6

7

8

CN =

1

1?


2

H

2^

2-

2Z

cN =

H

1



2

In general,

if

k






|_lg

2

N

CN = k + 1 C'N

where 0

<

=

e,e'

k

<

+

2

-

8

2-


I
J

,


3— 3—
6
10
2I
Z
9

2
Z
10

11

12

13

3

3_L
*12
qlO

^ 13


3_1


12

14

15

fe+1

k+1

-

k

- 2 )/N =
=
1)

/{N +

0.0861; see Eq. 5.3.1-(35).

lg

12


q
6
14

q 14
J
15

4

N - 1 + + (k + 2 )/N,

lg(N

e

+

1) '+ e'

16

3A 3—
3—
3—
* 14 ^
°16
15


we have

(2

2

32

3

10

9

4-2-

*17


SEARCHING

414

6.2.1

To summarize: Algorithm B never makes more than [lg N\ +1 comparisons,
and it makes about lg N — 1 comparisons in an average successful search. No
search method based on comparisons can do better than this. The average
running time of Program B is approximately
(18


IgA —

(18 lg
if

we assume that

An

all

16) w

for a successful search,

N + 12) u

for

important variation.

^

an unsuccessful search,

outcomes of the search are equally

likely.


Instead of using three pointers

l,

i,

and u

in the

tempting to use only two, namely the current position i and its rate
after each unequal comparison, we could then set i <- i ± <5 and
& t
(approximately).
<5/2
It is possible to do this, but only if extreme care
is paid to the details, as in the following algorithm.
Simpler approaches are
search,

it is

of change,

doomed

5;

to failure!


Algorithm

U

( Uniform binary search). Given a table of records Ri,R2 ,...,R
n
whose keys are in increasing order Ki < 2 <
< n, this algorithm searches
for a given argument K. If
is even, the algorithm will sometimes refer to a

K

K



N

dummy key Kq

that should be set to

that

N>

Ul.

[Initialize.]


— oo

any value

less

than K).

Set

*

\N/2~\,

m

K

•<-

\_N/2\.

If
< Ki, go to U3;
algorithm terminates successfully.

[Decrease
either


i]

if

K > Ki,

go to U4; and

U4.

m or m-1 records;

i.\

K=K

{

,

the

i

the search to an interval that contains
points just to the right of this interval.) If
= 0,

m


m

[Increase

if

(We have pinpointed

the algorithm terminates unsuccessfully. Otherwise set
[m/2j and return to U2.

set

We assume

1.

U2. [Compare.]
U3.

(or

(We have pinpointed the search

morm-1

*

i


- [m/2]

then

;

to an interval that contains

m

either
records; i points just to the left of this interval.) If
= 0,
the algorithm terminates unsuccessfully. Otherwise set * -f- *
+ [m/2]; then
<— [m/2j and return to U2.
set

m

|

N

Figure 6 shows the corresponding binary tree for the search, when
= 10.
In an unsuccessful search, the algorithm may make a redundant comparison
just
before termination; those nodes are shaded in the figure. We may call the search
process uniform because the difference between the number of a node on level l

and the number of its ancestor on level l — l has a constant value 5 for all nodes

on

level

l.

The theory underlying Algorithm U can be understood as follows: Suppose
we have an interval of length n — 1 to search; a comparison with the middle
element (for n even) or with one of the two middle elements (for n odd) leaves us
that

with two intervals of lengths [n/2\ - 1 and [n/2] - 1. After repeating this process
k times, we obtain 2 k intervals, of which the smallest has length [n/2
- 1 and
fc

the largest has length \n/ k
]

J

-

1.

Hence the lengths of two

intervals at the


same


SEARCHING AN ORDERED TABLE

6 2.1
.

Fig. 6.
level differ

The comparison

by

most

at

tree for a “uniform” binary search,

makes

unity; this

N=

when


415

10.

possible to choose an appropriate

it

“middle” element, without keeping track of the exact lengths.

The

U

principal advantage of Algorithm

is

that

we need not maintain the

value of

m

which

equally good on binary or decimal computers:


at all; we need only refer to a short table of the various (5 to use at
each level of the tree. Thus the algorithm reduces to the following procedure,
is

Algorithm
but

it

C

(

Uniform binary search). This algorithm

is

just like

uses an auxiliary table in place of the calculations involving

Algorithm U,

m. The

table

entries are

= N+y


DELTA [j]

Cl.

Set

[Initialize.]

C2. [Compare.]

«— DELTA [1]

i

A

If

<

,

1

for 1

<

j


<

[lg

N + 2.

(6)

j <— 2.

A;, go to C3;

if

A

>

Ki, go to C4; and

if

A = A*,

the

algorithm terminates successfully.

C3. [Decrease

[Increase

DELTA [j] = 0, the algorithm terminates unsuccessfully.
— DELTA [j] j <— j + 1, and go to C2.
i

If

*.]

Otherwise, set

C4.

i

If

i.]

Otherwise, set

i

<—

,

DELTA [j] = 0, the algorithm terminates unsuccessfully.
<— i + DELTA [j] j f— j + 1, and go to C2.

|
,

Exercise 8 proves that this algorithm refers to the artificial key

only

when

N

Program C

is

START

(

02

03
04

05
06

07

3H


0

— oo

This program does the same job as

Uniform binary search).

Program B, using Algorithm C with rA
01

A =

even.

ENT1
ENT2
LDA
JMP
JE
J3Z
DEC1

=

A, rll

=


i,

rI2

N+l/2

1

Cl. Initialize,

2

1

3*-

K

1

2F

SUCCESS
FAILURE
0,3

= j,
i

rI3


1

Cl

Cl-S

Cl-S-A

Jump
Jump

if

A = Ki.

if

DELTA [j]

C3. Decrease

i.

= DELTA [j]

(N +

«—


2-

=

0.

l')/2


SEARCHING

416

5H
2H

INC2
LD3
CMPA
JLE
INC1
J3NZ
FAILURE EQU

08
09
10
11

12

13

U

6.2.1

CC
C
C

1

DELTA,
KEY,
3B
0,3
5B

3 <- 3

1

Jump

C2
C2

1.

if


AT

<

C4. Increase

Jump

1-5

*

+

C2. Compare.

Exit

if

if

Ki.
i.

DELTA [j]

^


0.

not in table.

|

In a successful search, this algorithm corresponds to a binary tree with the
same internal path length as the tree of Algorithm B, so the average number of
comparisons
is the same as before. In an unsuccessful search,
Algorithm C

C

always makes exactly |_lg N\ + 1 comparisons. The total running time of Program C is not quite symmetrical between left and right branches, since Cl is
weighted more heavily than C2, but exercise 11 shows that we have K < K,
roughly as often as K > K^ hence Program C takes approximately
(8.5 lg
(8.5

This

N — 6)u

|_lg

N\ + 12)«

more than twice


is

as fast as

for a successful search,
for

^

an unsuccessful search.

Program B, without using any

special prop-

binary computers, even though the running times
(5) for Program
“shift right binary” instruction.

erties of

B

assume that MIX has a

Another modification of binary search, suggested in 1971 by L. E. Shar, will
be still faster on some computers, because it is uniform after the first step, and
it requires no table.
The first step is to compare
with

where i = 2 k
t
k = [lgN\. If
< Ki, we use a uniform search with the S's equal to 2 k ~\
~
2k 2
On the other hand, if
1, 0.
> Ki we reset i to i' =
+1—2
where l = |"lg(IV — 2 k + 1)] and pretend that the first comparison was actually
> Ki using a uniform search with the

1,0.
Shar s method is illustrated for TV = 10 in Fig. 7. Like the previous

K

K

,

K

.

K

,

K



.

.

N

l

,

,

l

1

,

,

.

,

.

.

it never makes more than |_lg

7VJ + 1 comparisons; hence it makes
most one more than the minimum possible average number of comparisons,

algorithms,
at

in spite of the fact that

it occasionally goes through several redundant steps
in
succession (see exercise 12).


×