Tải bản đầy đủ (.pdf) (10 trang)

Tài liệu Thuật toán Algorithms (Phần 31) docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (49.9 KB, 10 trang )

FlLE COMPRESSION
293
the message: this means that we need to save the tree along with the message
in order to decode it. Fortunately, this does not present any real difficulty.
It is actually necessary only to store the code array, because the radix search
trie which results from inserting the entries from that array into an initially
empty tree is the decoding tree.
Thus, the storage savings quoted above is not entirely accurate, because
the message can’t be decoded without the trie and we must take into account
the cost of storing the trie (i.e., the code array) along with the message.
Huffman encoding is therefore only effective for long files where the savings in
the message is enough to offset the cost, or in situations where the coding trie
can be precomputed and used for a large number of messages. For example, a
trie based on the frequencies of occurrence of letters in the English language
could be used for text documents. For that matter, a trie based on the
frequency of occurrence of characters in Pascal programs could be used for
encoding programs (for example,
“;”
is likely to be near the top of such a
trie). A Huffman encoding algorithm saves about 23% when run on the text
for this chapter.
As before, for truly random files, even this clever encoding scheme won’t
work because each character will occur approximately the same number of
times, which will lead to a fully balanced coding tree and an equal number of
bits per letter in the code.
I
294
1.
2.
3.
4.


5.
6.
7.
8.
9.
10.
Exercises
Implement compression and expansion procedures for the run-length en-
coding method for a fixed alphabet described in the text, using Q as the
escape character.
Could “QQ” occur somewhere in a file compressed using the method
described in the text? Could “QQ&” occur?
Implement compression and expansion procedures for the binary file en-
coding method described in the text.
The letter
“q”
given in the text can be processed as a sequence of
five-
bit characters. Discuss the pros and cons of doing so in order to use a
character-based run-length encoding method.
Draw a Huffman coding tree for the string “ABRACADABRA.” How
many bits does the encoded message require?
What is the Huffman code for a binary file? Give an example showing
the maximum number of bits that could be used in a Huffman code for a
N-character ternary (three-valued) file.
Suppose that the frequencies of the occurrence of all the characters to be
encoded are different. Is the Huffman encoding tree unique?
Huffman coding could be extended in a straightforward way
t,o
encode

in two-bit characters (using 4-way trees). What would be the main
advantage and the main disadvantage of doing so?
What would be the result of breaking up a Huffman-encoded string into
five-bit characters and Huffman encoding that string?
Implement a procedure to decode a Huffman-encoded string, given the
code and len arrays.
23. Cryptology
In the previous chapter we looked at methods for encoding strings of
characters to save space. Of course, there is another very important
reason to encode strings of characters: to keep them secret.
Cryptology, the study of systems for secret communications, consists of
two competing fields of study: cryptography, the design of secret communica-
tions systems, and cryptanalysis, the study of ways to compromise secret com-
munications systems. The main application of cryptology has been in military
and diplomatic communications systems, but other significant applications
are becoming apparent. Two principal examples are computer file systems
(where each user would prefer to keep his files private) and “electronic funds
transfer” systems (where very large amounts of money are involved). A com-
puter user wants to keep his computer files just as private as papers in his
file cabinet, and a bank wants electronic funds transfer to be just as secure
as funds transfer by armored car.
Except for military applications, we assume that cryptographers are “good
guys” and cryptanalysts are “bad guys”:
our goal is to protect our computer
files and our bank accounts from criminals. If this point of view seems some-
what unfriendly, it must be noted (without being over-philosophical) that by
using cryptography one is assuming the existence of unfriendliness! Of course,
even “good guys” must know something about cryptanalysis, since the very
best way to be sure that a system is secure is to try to compromise it yourself.
(Also, there are several documented instances of wars being brought to an

end, and many lives saved, through successes in cryptanalysis.)
Cryptology has many close connections with computer science and al-
gorithms, especially the arithmetic and string-processing algorithms that we
have studied. Indeed, the art (science?) of cryptology has an intimate relation-
ship with computers and computer science that is only beginning to be fully
understood. Like algorithms, cryptosystems have been around far longer
295
296
CHAPTER 23
than computers. Secrecy system design and algorithm design have a common
heritage, and the same people are attracted to both.
It is not entirely clear which branch of cryptology has been affected most
by the availability of computers. The cryptographer now has available a much
more powerful encryption machine than before, but this also gives him more
room to make a mistake. The cryptanalyst has much more powerful tools
for breaking codes, but the codes to be broken are more complicated than
ever before. Cryptanalysis can place an incredible strain on computational
resources; not only was it among the first applications areas for computers,
but it still remains a principal applications area for modern supercomputers.
More recently, the widespread use of computers has led to the emergence
of a variety of important new applications for cryptology, as mentioned above.
New cryptographic methods have recently been developed appropriate for such
applications, and these have led to the discovery of a fundamental relationship
between cryptology and an important area of theoretical computer science
that we’ll examine briefly in Chapter 40.
In this chapter, we’ll examine some of the basic characteristics of cryp-
tographic algorithms because of the importance of cryptography in modern
computer systems and because of close relationships with many of the algo-
rithms we have studied. We’ll refrain from delving into detailed implementa-
tions: cryptography is certainly a field that should be left to experts. While

it’s not difficult to “keep people honest” by encrypting things with a simple
cryptographic algorithm, it is dangerous to rely upon a method implemented
by a non-expert.
Rules of the Game
All the elements that go into providing a means for secure communications
between two individuals together are called a cryptosystem. The canonical
structure of a typical cryptosystem is diagramed below:
“attack
at dawn”
The sender (S) wishes to send a message (called the plaintezt) to the
receiver (R). To do so, he transforms the plaintext into a secret form suitable
CRYPTOLOGY
for transmission (called the ciphertext) using a cryptographic algorithm (the
encryption method) and some key (K) parameters. To read the message,
the receiver must have a matching cryptographic algorithm (the decryption
method) and the same key parameters, which he can use to transform the
ciphertext back into the plaintext, the message. It is usually assumed that
the ciphertext is sent over insecure communications lines and is available to
the cryptanalyst (A). It also is usually assumed that the encryption and
decryption methods are known to the cryptanalyst: his aim is to recover the
plaintext from the ciphertext without knowing the key parameters. Note that
the whole system depends on some separate prior method of communication
between the sender and receiver to agree on the key parameters. As a rule,
the more key parameters, the more secure the cryptosystem is but the more
inconvenient it is to use. This situation is akin to that for more conventional
security systems: a combination safe is more secure with more numbers on the
combination lock, but it is harder to remember the combination. The parallel
with conventional systems also serves as a reminder that any security system
is only as secure as the trustworthiness of the people that have the key.
It is important to remember that economic questions play a central role

in cryptosystems. There is an economic motivation to build simple encryption
and decryption devices (since many may need to be provided and complicated
devices cost more). Also, there is an economic motivation to reduce the
amount of key information that must be distributed (since a very secure and
expensive method of communications must be used). Balanced against the cost
of implementing cryptographic algorithms and distributing key information
is the amount of money the cryptanalyst would be willing to pay to break
the system. For most applications, it is the cryptographer’s aim to develop a
low-cost system with the property that it would cost the cryptanalyst much
more to read messages than he would be willing to pay. For a few applications,
a “provably secure” cryptosystem may be required: one for which it can be
ensured that the cryptanalyst can never read messages no matter what he is
willing to spend. (The very high stakes in some applications of cryptology
naturally imply that very large amounts of money are used for cryptanalysis.)
In algorithm design, we try to keep track of costs to help us choose the best
algorithms; in cryptology, costs play a central role in the design process.
Simple Methods
Among the simplest (and among the oldest) methods for encryption is the
Caesar cipher: if a letter in the plaintext is the Nth letter in the alphabet,
replace it by the (N +
K)th
letter in the alphabet, where
K
is some fixed
integer (Caesar used
K
= 3). For example, the table below shows how a
message is encrypted using this method with K = 1:
298 CHAPTER 23
Plaintext: ATTACK AT DAWN

Ciphertext: BUUBDLABUAEB X 0
This method is weak because the cryptanalyst has only to guess the value
of K: by trying each of the 26 choices, he can be sure that he will read the
message.
A far better method is to use a general table to define the substitution
to be made: for each letter in the plaintext, the table tells which letter to put
in the ciphertext. For example, if the table gives the correspondence
ABCDEFGHI JKLMNOPQRSTUVWXYZ
THE QUICKBROWNFXJMPDVRLAZYG
then the message is encrypted as follows:
Plaintext: ATTACK AT DAWN
Ciphertext: HWH OTHVTQHAF
This is much more powerful than the simple Caesar cipher because the crypt-
analyst would have to try many more (about
27!
>
1028)
tables to be sure
of reading the message. However, “simple substitution” ciphers like this are
easy to break because of letter frequencies inherent in the language. For ex-
ample, since E is the most frequent letter in English text, the cryptanalyst
could get good start on reading the message by looking for the most frequent
letter in the ciphertext and assuming that it is to be replaced by E. While
this might not be the right choice, he certainly is better off than if he had to
try all 26 letters. He can do even better by looking at two-letter combinations
(“digrams”): certain digrams (such as QJ) never occur in English text while
others (such as ER) are very common. By examining frequencies of letters
and combinations of letters, a cryptanalyst can very easily break a simple
substitution cipher.
One way to make this type of attack more difficult is to use more than

one table. A simple example of this is an extension of the Caesar cipher called
the Vigenere cipher: a small repeated key is used to determine the value of K
for each letter. At each step, the key letter index is added to the plaintext
letter index to determine the ciphertext letter index. Our sample plaintext,
with the key ABC, is encrypted as follows:
Key: ABCABCABCABCAB
Plaintext: ATTACK AT DAWN
Ciphertext: BVWBENACWAFDX P
CRYPTOLOGY
299
For example, the last letter of the ciphertext is P, the 16th letter of the
alphabet, because the corresponding plaintext letter is N (the 14th letter),
and the corresponding key letter is B (the 2nd letter).
The Vigenere cipher can obviously be made more complicated by using
different general tables for each letter of the plaintext (rather than simple
offsets). Also, it is obvious that the longer the key, the better. In fact, if the
key is as long as the plaintext, we have the
V&am
cipher, more commonly
called the one-time pad. This is the only provably secure cryptosystem known,
and it is reportedly used for the Washington-Moscow hotline and other vital
applications. Since each key letter is used only once, the cryptanalyst can
do no better than try every possible key letter for every message position,
an obviously hopeless situation since this is as difficult as trying all possible
messages. However, using each key letter only once obviously leads to a severe
key distribution problem, and the one-time pad is only useful for relatively
short messages which are to be sent infrequently.
If the message and key are encoded in binary, a more common scheme
for position-by-position encryption is to use the “exclusive-or” function: to
encrypt the plaintext, “exclusive-or” it (bit by bit) with the key. An attractive

feature of this method is that decryption is the same operation as encryption:
the ciphertext is the exclusive-or of the plaintext and the key, but doing
another exclusive-or of the ciphertext and the key returns the plaintext.
Notice that the exclusive-or of the ciphertext and the plaintext is the key.
This seems surprising at first, but actually many cryptographic systems have
the property that the cryptanalyst can discover the key if he knows the
plaintext.
Encryption/Decryption Machines
Many cryptographic applications (for example, voice systems for military
communications) involve the transmission of large amounts of data, and this
makes the one-time pad infeasible. What is needed is an approximation to
the one-time pad in which a large amount of “pseudo-key” can be generated
from a small amount of true key to be distributed.
The usual setup in such situations is as follows: an encryption machine
is fed some cryptovariables (true key) by the sender, which it uses to generate
a long stream of key bits (pseudo-key). The exclusive-or of these bits and
the plaintext forms the ciphertext. The receiver, having a similar machine
and the same cryptovariables, uses them to generate the same key stream to
exclusive-or against the ciphertext and to retrieve the plaintext.
Key generation in this context is obviously very much like random number
generation, and our random number generation methods are appropriate for
key generation (the cryptovariables are the initial seeds of the random number
300
CHAPTER 23
generator). In fact, the linear feedback shift registers that we discussed in
Chapter 3 were first developed for use in encryption/decryption machines
such as described here. However, key generators have to be somewhat more
complicated than random number generators, because there are easy ways to
attack simple linear feedback shift registers. The problem is that it might be
easy for the cryptanalyst to get some plaintext (for example, silence in a voice

system), and therefore some key. If the cryptanalyst can get enough key that
he has the entire contents of the shift register, then he can get all the key
from that point on.
Cryptographers have several ways to avoid such problems. One way is to
make the feedback function itself a cryptovariable. It is usually assumed that
the cryptanalyst knows everything about the structure of the machine (maybe
he stole one) except the cryptovariables, but if some of the cryptovariables are
used to “configure” the machine, he may have difficulty finding their values.
Another method commonly used to confuse the cryptanalyst is the product
cipher, where two different machines are combined to produce a complicated
key stream (or to drive each other). Another method is nonlinear substitution;
here the translation between plaintext and ciphertext is done in large chunks,
not bit-by-bit. The general problem with such complex methods is that they
can be too complicated for even the cryptographer to understand and that
there always is the possibility that things may degenerate badly for some
choices of the cryptovariables.
Public-Key Cryptosystems
In commercial applications such as electronic funds transfer and (real) com-
puter mail, the key distribution problem is even more onerous than in the
traditional applications of cryptography. The prospect of providing long
keys (which must be changed often) to every citizen, while still maintain-
ing both security and cost-effectiveness, certainly inhibits the development of
such systems. Methods have recently been developed, however, which promise
to eliminate the key distribution problem completely. Such systems, called
public-key cryptosystems, are likely to come into widespread use in the near
future. One of the most prominent of these systems is based on some of the
arithmetic algorithms that we have been studying, so we will take a close look
at how it works.
The idea in public-key cryptosystems is to use a
“phone

book” of encryp-
tion keys. Everyone’s encryption key (denoted by P) is public knowledge: a
person’s key could be listed, for example, next to his number in the telephone
book. Everyone also has a secret key used for decryption; this secret key
(denoted by S) is not known to anyone else. To transmit a message M, the
sender looks up the receiver’s public key, uses it to encrypt the message, and
then transmits the message. We’ll denote the encrypted message (ciphertext)
CRYPTOLOGY
by C=P(M). The receiver uses his private decryption key to decrypt and read
the message. For this system to work we must have at least the following
properties:
(4
S(P(M))=M for every message M.
(ii) All (S,P) pairs are distinct.
(iii) Deriving S from P is as hard as reading M.
(iv)
Both S and P are easy to compute.
The first of these is a fundamental cryptographic property, the second two
provide the security, and the fourth makes the system feasible for use.
This general scheme was outlined by W. Diffie and M. Hellman in 1976,
but they had no method which satisfied all of these properties. Such a
method was discovered soon afterwards by R. Rivest, A. Shamir, and L.
Adleman. Their scheme, which has come to be known as the RSA public-
key cryptosystem, is based on arithmetic algorithms performed on very large
integers. The encryption key P is the integer pair (N,p) and the decryption
key S is the integer pair
(N,s),
where
s
is kept secret. These numbers are

intended to be very large (typically, N might be 200 digits and p and
s
might
be 100 digits). The encryption and decryption methods are then simple: first
the message is broken up into numbers less than N (for example, by taking
lg N bits at a time from the binary string corresponding to the character
encoding of the message). Then these numbers are independently raised to a
power modulo N: to encrypt a (piece of a) message M, compute C = P(M) =
MPmod N, and to decrypt a ciphertext C, compute M = S(C) = C”mod
N. This computation can be quickly and easily performed by modifying the
elementary exponentiation algorithm that we studied in Chapter 4 to take
the remainder when divided by N after each multiplication. (No more than
2 log N such operations are required for each piece of the message, so the
tot,al
number of operations (on 100 digit numbers!) required is linear in the number
of bits in the message.)
Property (iv) above is therefore satisfied, and property (ii) can be easily
enforced. We still must make sure that the cryptovariables N, p, and
s
can be
chosen so as to satisfy properties (i) and (iii). To be convinced of these requires
an exposition of number theory which is beyond the scope of this book, but
we can outline the main ideas. First, it is necessary to generate three large
(approximately loo-digit) “random” prime numbers: the largest will be
s
and
we’ll call the other two x and y. Then N is chosen to be the product of x
and y, and p is chosen so that ps mod (x
-
l)(y

-
1) = 1. It is possible to prove
that, with N, p, and
s
chosen in this way, we have
Mps
mod N = M for all
messages M.
More specifically, each large prime can be generated by generating a large
random number, then testing successive numbers starting at that point until
302 CHAPTER 23
a prime is found. One simple method performs a calculation on a random
number that, with probability
l/2,
will “prove” that the number to be tested
is not prime. (A number which is not prime will survive 20 applications of
this test less than one time out of a million, 30 applications less than 1 time
out of a billion.) The last step is to compute p: it turns out that a variant of
Euclid’s algorithm (see Chapter 1) is just what is needed.
Furthermore,
s
seems to be difficult to compute from knowledge of p (and
N), though no one has been able to prove that to be the case. Apparently,
finding p from
s
requires knowledge of x and
y,
and apparently it is necessary
to factor N to calculate x and y. But factoring N is thought to be very
difficult: the best factoring algorithms known would take millions of years to

factor a 200-digit number, using current technology.
An attractive feature of the RSA system is that the complicated com-
putations involving N, p, and
s
are performed only once for each user who
subscribes to the system, which the much more frequent operations of encryp-
tion and decryption involve only breaking up the message and applying the
simple exponentiation procedure. This computational simplicity, combined
with all the convenience features provided by public-key cryptosystems, make
this system quite attractive for secure communications, especially on computer
systems and networks.
The RSA method has its drawbacks: the exponentiation procedure is ac-
tually expensive by cryptographic standards, and, worse, there is the linger-
ing possibility that it might be possible to read messages encrypted using the
method. This is true with many cryptosystems: a cryptographic method must
withstand serious cryptanalytic attacks before it can be used with confidence.
Several other methods have been suggested for implementing public-key
cryptosystems. Some of the most interesting are linked to an important class
of problems which are generally thought to be very hard (though this is not
known for sure), which we’ll discuss in Chapter 40. These cryptosystems
have the interesting property that a successful attack could provide insight on
how to solve some well-known difficult unsolved problems (as with factoring
for the RSA method). This link between cryptology and fundamental topics
in computer science research, along with the potential for widespread use of
public-key cryptography, have made this an active area of current research.
rl

×