Tải bản đầy đủ (.pdf) (30 trang)

Tài liệu Data Preparation for Data Mining- P13 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (248.18 KB, 30 trang )

system that can affect the outcome. The more of them that there are, the more likely it is
that, purely by happenstance, some particular, but actually meaningless, pattern will show
up. The number of variables in a data set, or the number of weights in a neural network,
all represent things that can change. So, yet again, high-dimensionality problems turn up,
this time expressed as degrees of freedom. Fortunately for the purposes of data
preparation, a definition of degrees of freedom is not needed as, in any case, this is a
problem previously encountered in many guises. Much discussion, particularly in this
chapter, has been about reducing the dimensionality/combinatorial explosion problem
(which is degrees of freedom in disguise) by reducing dimensionality. Nonetheless, a data
set always has some dimensionality, for if it does not, there is no data set! And having
some particular dimensionality, or number of degrees of freedom, implies some particular
chance that spurious patterns will turn up. It also has implications about how much data is
needed to ensure that any spurious patterns are swamped by valid, real-world patterns.
The difficulty is that the calculations are not exact because several needed measures,
such as the number of significant system states, while definable in theory, seem
impossible to pin down in practice. Also, each modeling tool introduces its own degrees of
freedom (weights in a neural network, for example), which may be unknown to the
minere .mi..



The ideal, if the miner has access to software that can make the measurements (such as
data surveying software), requires use of a multivariable sample determined to be
representative to a suitable degree of confidence. Failing that, as a rule of thumb for the
minimum amount of data to accept, for mining (as opposed to data preparation), use at least
twice the number of instances required for a data preparation representative sample. The
key is to have enough representative instances of data to swamp the spurious patterns.
Each significant system state needs sufficient representation, and having a truly
representative sample of data is the best way to assure that.



10.7 Beyond Joint Distribution




So far, so good. Capturing the multidimensional distribution captures a representative
sample of data. What more is needed? On to modeling!




Unfortunately, things are not always quite so easy. Having a representative sample in
hand is a really good start, but it does not assure that the data set is modelable! Capturing
a representative sample is an essential minimum—that, and knowing what degree of
confidence is justified in believing the sample to be representative. However, the miner
needs a modelable representative sample, and the sample simply being representative of
the population may not be enough. How so?




Actually, there are any number of reasons, all of them domain specific, why the minimum
representative sample may not suffice—or indeed, why a nonrepresentative sample is
needed. (Heresy! All this trouble to ensure that a fully representative sample is collected,
and now we are off after a nonrepresentative sample. What goes on here?)


Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.



Suppose a marketing department needs to improve a direct-mail marketing campaign.
The normal response rate for the random mailings so far is 1.5%. Mailing rolls out, results
trickle in. A (neophyte) data miner is asked to improve response. “Aha!,” says the miner, “I
have just the thing. I’ll whip up a quick response model, infer who’s responding, and
redirect the mail to similar likely responders. All I need is a genuinely representative
sample, and I’ll be all set!” With this terrific idea, the miner applies the modeling tools, and
after furiously mining, the best prediction is that no one at all will respond! Panic sets in;
staring failure in the face, the neophyte miner begins the balding process by tearing out
hair in chunks while wondering what to do next.




Fleeing the direct marketers with a modicum of hair, the miner tries an industrial chemical
manufacturer. Some problem in the process occasionally curdles a production batch. The
exact nature of the process failure is not well understood, but the COO just read a
business magazine article extolling the miraculous virtues of data mining. Impressed by
the freshly minted data miner (who has a beautiful certificate attesting to skill in mining),
the COO decides that this is a solution to the problem. Copious quantities of data are
available, and plenty more if needed. The process is well instrumented, and continuous
chemical batches are being processed daily. Oodles of data representative of the process
are on hand. Wielding mining tools furiously, the miner conducts an onslaught designed to
wring every last confession of failure from the recalcitrant data. Using every art and
artifice, the miner furiously pursues the problem until, staring again at failure and with
desperation setting in, the miner is forced to fly from the scene, yet more tufts of hair
flying.





Why has the now mainly hairless miner been so frustrated? The short answer is that while
the data is representative of the population, it isn’t representative of the problem.
Consider the direct marketing problem. With a response rate of 1.5%, any predictive
system has an accuracy of 98.5% if it uniformly predicts “No response here!” Same thing
with the chemical batch processing—lots of data in general, little data about the failure
conditions.




Both of these examples are based on real applications, and in spite of the light manner of
introducing the issue, the problem is difficult to solve. The feature to be modeled is
insufficiently represented for modeling in a data set that is representative of the
population. Yet, if the mining results are to be valid, the data set mined must be
representative of the population or the results will be biased, and may well be useless in
practice. What to do?




10.7.1 Enhancing the Data Set




When the density of the feature to be modeled is very low, clearly the density of that
feature needs to be increased—but in a way that does least violence to the distribution of
the population as a whole. Using the direct marketing response model as an example,

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

simply increasing the proportion of responders in the sample may not help. It’s assumed
that there are some other features in the sample that actually do vary as response varies.
It’s just that they’re swamped by spurious patterns, but only because of their low density
in the sample. Enhancing the density of responders is intended to enhance the variability
of connected features. The hope is that when enhanced, these other features become
visible to the predictive mining tool and, thus, are useful in predicting likely responders.



These assumptions are to some extent true. Some performance improvement may be
obtained this way, usually more by happenstance than design, however. The problem is
that low-density features have more than just low-level interactions with other, potentially
predictive features. The instances with the low-level feature represent some small
proportion of the whole sample and form a subsample—the subsample containing only
those instances that have the required feature. Considered alone, because it is so small,
the subsample almost certainly does not represent the sample as a whole—let alone the
population. There is, therefore, a very high probability that the subsample contains much
noise and bias that are in fact totally unrelated to the feature itself, but are simply
concomitant to it in the sample taken for modeling.




Simply increasing the desired feature density also increases the noise and bias patterns
that the subsample carries with it—and those noise and bias patterns will then appear to
be predictive of the desired feature. Worse, the enhanced noise and bias patterns may
swamp any genuinely predictive feature that is present.





This is a tough nut to crack. It is very similar to any problem of extracting information from
noise, and that is the province of information theory, discussed briefly in Chapter 11 in the
context of the data survey. One of the purposes of the data survey is to understand the
informational structure of the data set, particularly in terms of any identified predictive
variables. However, a practical approach to solving the problem does not depend on the
insights of the data survey, helpful though they might be. The problem is to construct a
sample data set that represents the population as much as possible while enhancing
some particular feature.




Feature Enhancement with Plentiful Data




If there is plenty of data to draw upon, instances of data with the desired feature may also
be plentiful. This is the case in the first example above. The mailing campaign produces
many responses. The problem is their low density as a proportion of the sample. There
may be thousands or tens of thousands of responses, even though the response rate is
only 1.5%.




In such a circumstance, the shortage of instances with the desired feature is not the
problem, only their relative density in the mining sample. With plenty of data available, the
miner constructs two data sets, both fully internally representative of the

population—except for the desired feature. To do this, divide the source data set into two

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
subsets such that one subset has only instances that contain the feature of interest and
the other subset has no instances that contain the feature of interest. Use the already
described techniques (Chapter 5) to extract a representative sample from each subset,
ignoring the effect of the key feature. This results in two separate subsets, both similar to
each other and representative of the population as a whole when ignoring the effect of the
key feature. They are effectively identical except that one has the key feature and the
other does not.



Any difference in distribution between the two subsets is due either to noise, bias, or the
effect of the key feature. Whatever differences there are should be investigated and
validated whatever else is done, but this procedure minimizes noise and bias since both
data sets are representative of the population, save for the effect of the key feature.
Adding the two subsets together gives a composite data set that has an enhanced
presence of the desired feature, yet is as free from other bias and noise as possible.




Feature Enhancement with Limited Data




Feature enhancement is more difficult when there is only limited data available. This is the
case in the second example of the chemical processor. The production staff bends every

effort to prevent the production batch from curdling, which only happens very infrequently.
The reasons for the batch failure are not well understood anyway (that is what is to be
investigated), so may not be reliably reproducible. Whether possible or not, batch failure
is a highly expensive event, hitting directly at the bottom line, so deliberately introducing
failure is simply not an option management will countenance. The miner was constrained
to work with the small amount of failure data already collected.




Where data is plentiful, small subsamples that have the feature of interest are very likely
to also carry much noise and bias. Since more data with the key feature is unavailable,
the miner is constrained to work with the data at hand. There are several modeling
techniques that are used to extract the maximum information from small subsamples,
such as multiway cross-validation on the small feature sample itself, and intersampling
and resampling techniques. These techniques do not affect data preparation since they
are only properly applied to already prepared data. However, there is one data
preparation technique used when data instances with a key feature are particularly low in
density: data multiplication.




The problem with low feature-containing instance counts is that the mining tool might
learn the specific pattern in each instance and take those specific patterns as predictive.
In other words, low key feature counts prevent some mining tools from generalizing from
the few instances available. Instead of generalizing, the mining tool learns the particular
instance configurations—which is particularizing rather than generalizing. Data
multiplication is the process of creating additional data instances that appear to have the
feature of interest. White (or colorless) noise is added to the key feature subset, producing

a second data subset. (See Chapter 9 for a discussion of noise and colored noise.) The

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
interesting thing about the second subset is that its variables all have the same mean
values, distributions and so on, as the original data set—yet no two instance values,
except by some small chance, are identical. Of course, the noise-added data set can be
made as large as the miner needs. If duplicates do exist, they should be removed.



When added to the original data set, these now appear as more instances with the
feature, increasing the apparent count and increasing the feature density in the overall
data set. The added density means that mining tools will generalize their predictions from
the multiplied data set. A problem is that any noise or bias present will be multiplied too.
Can this be reduced? Maybe.




A technique called color matching helps. Adding white noise multiplies everything exactly
as it is, warts and all. Instead of white noise, specially constructed colored noise can be
added. The multidimensional distribution of a data sample representative of the
population determines the precise color. Color matching adds noise that matches the
multivariable distribution found in the representative sample (i.e., it is the same color, or
has the same spectrum). Any noise or bias present in the original key feature subsample
is still present, but color matching attempts to avoid duplicating the effect of the original
bias, even diluting it somewhat in the multiplication.





As always, whenever adding bias to a data set, the miner should put up mental warning
flags. Data multiplication and color matching adds features to, or changes features of, the
data set that simply are not present in the real world—or if present, not at the density
found after modification. Sometimes there is no choice but to modify the data set, and
frequently the results are excellent, robust, and applicable. Sometimes even good results
are achieved where none at all were possible without making modifications. Nonetheless,
biasing data calls for extreme caution, with much validation and verification of the results
before applying them.




10.7.2 Data Sets in Perspective




Constructing a composite data set enhances the visibility of some pertinent feature in the
data set that is of interest to the miner. Such a data set is no longer an unbiased sample,
even if the original source data allowed a truly unbiased sample to be taken in the first
place. Enhancing data makes it useful only from one particular point of view, or from a
particular perspective. While more useful in particular circumstances, it is nonetheless not
so useful in general. It has been biased, but with a purposeful bias deliberately
introduced. Such data has a perspective.




When mining perspectival data sets, it is very important to use nonperspectival test and

evaluation sets. With the best of intentions, the mining data has been distorted and, to at
least that extent, no longer accurately represents the population. The only place that the
inferences or predictions can be examined to ensure that they do not carry an unacceptable
distortion through into the real world is to test them against data that is as undistorted—that
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
is, as representative of the real world—as possible.


10.8 Implementation Notes




Of the four topics covered in this chapter, the demonstration code implements algorithms
for the problems that can be automatically adjusted without high risk of unintended data
set damage. Some of the problems discussed are only very rarely encountered or could
cause more damage than benefit to the data if applied without care. Where no preparation
code is available, this section includes pointers to procedures the miner can follow to
perform the particular preparation activity.




10.8.1 Collapsing Extremely Sparsely Populated Variables




The demonstration code has no explicit support for collapsing extremely sparsely
populated variables. It is usual to ignore such variables, and only in special circumstances

do they need to be collapsed. Recall that these variables are usually populated at levels
of small fractions of 1%, so a much larger proportion than 99% of the values are missing
(or empty).




While the full tool from which the demonstration code was drawn will fully collapse such
variables if needed, it is easy to collapse them manually using the statistics file and the
complete-content file produced by the demonstration code, along with a commercial data
manipulation tool, say, an implementation of SQL. Most commercial statistical packages
also provide all of the necessary tools to discover the problem, manipulate the data, and
create the derived variables.




1.

If using the demonstration code, start with the “stat” file.




2.

Identify the population density for each variable.





3.

Check the number of discrete values for each candidate sparse variable.




4.

Look in the complete-content file, which lists all of the values for all of the variables.




5.

Extract the lists for the sparse variables.




6.

Access the sample data set with your tool of choice and search for, and list, those
cases where the sparse variables simultaneously have values. (This won’t happen
often, even in sparse data sets.)





7.

Create unique labels for each specific present-value pattern (PVP).




8.

Numerate the PVPs.



Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

Now comes the only tricky part. Recall that the PVPs were built from the representative
sample. (It’s representative only to some selected degree of confidence.) The execution
data set may, and if large enough almost certainly will, contain a PVP that was not in the
sample data set. If important, and only the domain of the problem provides that answer,
create labels for all of the possible PVPs, and assign them appropriate values. That is a
judgment call. It may be that you can ignore any unrecognized PVPs, or more likely, flag
them if they are found.




10.8.2 Reducing Excessive Dimensionality





Neural networks comprise a vast topic on their own. The brief introduction in this chapter
only touched the surface. In keeping with all of the other demonstration code segments,
the neural network design is intended mainly for humans to read and understand.
Obviously, it also has to be read (and executed) by computer systems, but the primary
focus is that the internal working of the code be as clearly readable as possible. Of all the
demonstration code, this requirement for clarity most affects the network code. The
network is not optimized for speed, performance, or efficiency. The sparsity mechanism is
modified random assignment without any dynamic interconnection. Compression factor
(hidden-node count) is discovered by random search.




The included code demonstrates the key principles involved and compresses information.
Code for a fully optimized autoassociative neural network, including dynamic connection
search with modified cascade hidden-layer optimization, is an impenetrable beast! The
full version, from which the demonstration is drawn, also includes many other obfuscating
(as far as clarity of reading goes) “bells and whistles.” For instance, it includes
modifications to allow maximum compression of information into the hidden layer, rather
than spreading it between hidden and output layers, as well as modifications to remove
linear relationships and represent those separately. While improving performance and
compression, such features completely obscure the underlying principles.




10.8.3 Measuring Variable Importance





Everything just said about neural networks for data compression applies when using the
demonstration code to measure variable importance. For explanatory ease, both data
compression and variable importance estimation use the same code segment. A network
optimized for importance search can, once again, improve performance, but the principles
are as well demonstrated by any SCANN-type BP-ANN.




10.8.4 Feature Enhancement




To enhance features in a data set, build multiple representative data subsets, as
described, and merge them.




Describing construction of colored noise for color matching, if needed, is unfortunately

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
outside the scope of the present book. It involves significant multivariable frequency
modeling to reproduce a characteristic noise pattern emulating the sample multivariable
distribution. Many statistical analysis software packages provide the basic tools for the
miner to characterize the distribution and develop the necessary noise generation

function.


10.9 Where Next?




A pause at this point. Data preparation, the focus of this book, is now complete. By
applying all of the insights and techniques so far covered, raw data in almost any form is
turned into clean prepared data ready for modeling. Many of the techniques are illustrated
with computer code on the accompanying CD-ROM, and so far as data preparation for
data mining is concerned, the journey ends here.




However, the data is still unmined. The ultimate purpose of preparing data is to gain
understanding of what the data “means” or predicts. The prepared data set still has to be
used. How is this data used? The last two chapters look not at preparing data, but at
surveying and using prepared data.


Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

Chapter 11: The Data Survey





Overview




Suppose that three separate families are planning a vacation. The Abbott family really
enjoys lake sailing. Their ideal vacation includes an idyllic mountain lake, surrounded by
trees, with plenty of wildlife and perhaps a small town or two nearby in case supplies are
needed. They need only a place to park their car and boat trailer, a place to launch the
boat, and they are happy.




The Bennigans are amateur archeologists. There is nothing they like better than to find an
ancient encampment, or other site, and spend their time exploring for artifacts. Their
four-wheel-drive cruiser can manage most terrain and haul all they need to be entirely
self-sufficient for a couple of weeks exploring—and the farther from civilization, the better
they like it.




The Calloways like to stay in touch with their business, even while on vacation. Their ideal
is to find a luxury hotel in the sun, preferably near the beach but with nightlife. Not just any
nightlife; they really enjoy cabaret, and would like to find museums to explore and other
places of interest to fill their days.





These three families all have very different interests and desires for their perfect vacation.
Can they all be satisfied? Of course. The locations that each family would like to find and
enjoy exist in many places; their only problem is to find them and narrow down the
possibilities to a final choice. The obvious starting point is with a map. Any map of the
whole country indicates broad features—mountains, forests, deserts, lakes, cities, and
probably roads. The Abbotts will find, perhaps, the Finger Lakes in upstate New York a
place to focus their attention. The Bennigans may look at the deserts of the Southwest,
while the Calloways look to Florida. Given their different interests, each family starts by
narrowing down the area of search for their ideal vacation to those general areas of the
country that seem likely to meet their needs and interests.




Once they have selected a general area, a more detailed map of the particular territory
lets each family focus in more closely. Eventually, each family will decide on the best
choice they can find and leave for their various vacations. Each family explores its own
vacation site in detail. While the explorations do not seem to produce maps, they reveal
small details—the very details that the vacations are aimed at. The Abbotts find particular
lake coves, see particular trees, and watch specific birds and deer. The Bennigans find
individual artifacts in specific places. The Calloways enjoy particular cabaret performers
and see specific exhibits at particular museums. It is these detailed explorations that each
family feels to be the whole purpose for their vacations.



Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

Each family started with a general search to find places likely to be of interest. Their initial

search was easy. The U.S. Geological Survey has already done the hard work for them.
Other organizations, some private survey companies, have embellished maps in
particular ways and for particular purposes—road maps, archeological surveys, sailing
maps (called “charts”), and so on. Eventually, the level of detail that each family needed
was more than a general map could provide. Then the families constructed their own
maps through detailed exploration.




What does this have to do with data mining? The whole purpose of the data survey is to help
the miner draw a high-level map of the territory. With this map, a data miner discovers the
general shape of the data, as well as areas of danger, of limitation, and of usefulness. With a
map, the Abbotts avoided having to explore Arizona to see if any lakes suitable for sailing
were there. With a data survey, a miner can avoid trying to predict the stock market from
meteorological data. “Everybody knows” that there are no lakes in Arizona. “Everybody
knows” that the weather doesn’t predict the stock market. But these “everybodies” only know
that through experience—mainly the experience of others who have been there first. Every
territory needed exploring by pioneers—people who entered the territory first to find out what
there was in general—blazing the trail for the detailed explorations to follow. The data survey
provides a miner with a map of the territory that guides further exploration and locates the
areas of particular interest, the areas suitable for mining. On the other hand, just as with
looking for lakes in Arizona, if there is no value to be found, that is well to know as early as
possible.


11.1 Introduction to the Data Survey





This chapter deals entirely with the data survey, a topic at least as large as data
preparation. The introduction to the use, purposes, and methods of data surveying in this
chapter discusses how prepared data is used during the survey. Most, if not all, of the
surveying techniques can be automated. Indeed, the full suite of programs from which the
data preparation demonstration code is drawn is a full data preparation and survey tool
set. This chapter touches only on the main topics of data surveying. It is an introduction to
the territory itself. The introduction starts with understanding the concept of “information.”




This book mentions “information” in several places. “Information is embedded in a data
set.” “The purpose of data preparation is to best expose information to a mining tool.”
“Information is contained in variability.” Information, information, information. Clearly,
“information” is a key feature of data preparation. In fact, information—its discovery,
exposure, and understanding—is what the whole preparation-survey-mining endeavor is
about. A data set may represent information in a form that is not easily, or even at all,
understandable by humans. When the data set is large, understanding significant and
salient points becomes even more difficult. Data mining is devised as a tool to transform
the impenetrable information embedded in a data set into understandable relationships or
predictions.



Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

However, it is important to keep in mind that mining is not designed to extract information.
Data, or the data set, enfolds information. This information describes many and various
relationships that exist enfolded in the data. When mining, the information is being mined

for what it contains—an explanation or prediction based on the embedded relationships. It
is almost always an explanation or prediction of specific details that solves a problem, or
answers a question, within the domain of inquiry—very often a business problem. What is
required as the end result is human understanding (enabling, if necessary, some action).
Examining the nature of, and the relationships in, the information content of a data set is a
part of the task of the data survey. It prepares the path for the mining that follows.




Some information is always present in the data—understandable or not. Mining finds
relationships or predictions embedded in the information inherent in a data set. With luck,
they are not just the obvious relationships. With more luck, they are also useful. In
discovering and clarifying some novel and useful relationship embedded in data, data
mining has its greatest success. Nonetheless, the information exists prior to mining. The
data set enfolds it. It has a shape, a substance, a structure. In some places it is not well
defined; in others it is bright and clear. It addresses some topics well; others poorly. In
some places, the relationships are to be relied on; in others not. Finding the places,
defining the limits, and understanding the structures is the purpose of data surveying.




The fundamental question posed by the data survey is, “Just what information is in here
anyway?”


11.2 Information and Communication





Everything begins with information. The data set embeds it. The data survey surveys it.
Data mining translates it. But what exactly is information? The Oxford English Dictionary
begins its definition with “The act of informing, . . .” and continues in the same definition a
little later, “Communication of instructive knowledge.” The act referred to is clearly one
where this thing, “information,” is passed from one person to another. The latter part of the
definition explicates this by saying it is “communication.” It is in this sense of
communicating intelligence—transferring insight and understanding—that the term
“information” is used in data mining. Data possesses information only in its latent form.
Mining provides the mechanism by which any insight potentially present is explicated.
Since information is so important to this discussion, it is necessary to try to clarify, and if
possible quantify, the concept.




Because information enables the transferring of insight and understanding, there is a
sense in which quantity of information relates to the amount of insight and understanding
generated; that is, more information produces greater insight. But what is it that creates
greater insight?




A good mystery novel—say, a detective story—sets up a situation. The situation
described includes all of the necessary pieces to solve the mystery, but in a nonobvious

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
way. Insight comes when, at the end of the story, some key information throws all of the

established structure into a suddenly revealed, surprising new relationship. The larger
and more complex the situation that the author can create, the greater the insight when
the true situation is revealed. But in addition to the complexity of the situation, it seems to
be true that the more surprising or unexpected the solution, the greater the insight.



The detective story illustrates the two key ingredients for insight. The first is what for a
detective story is described as “the situation.” The situation comprises a number of
individual components and the relationship between the components. For a detective
story, these components are typically the characters, the attributes of characters, their
relationship to one another, and the revealed actions taken by each during the course of
the narrative. These various components, together with their relationships, form a
knowledge structure. The second ingredient is the communication of a key insight that
readjusts the knowledge structure, changing the relationship between the components.
The amount of insight seems intuitively related to how much readjustment of the
knowledge structure is needed to include the new insight, and the degree to which the
new information is unexpected.




As an example, would you be surprised if you learned that to the best of modern scientific
knowledge, the moon really is made of green cheese? Why? For a start, it is completely
unexpected. Can you honestly say that you have ever given the remotest credence to the
possibility that the moon might really be made of green cheese? If true, such a simple
communication carries an enormous amount of information. It would probably require you
to reconfigure a great deal of your knowledge of the world. After all, what sort of possible
rational explanation could be constructed to explain the existence of such a
phenomenon? In fact, it is so unlikely that it would almost certainly take much repetition of

the information in different contexts (more evidence) before you would accept this as
valid. (Speaking personally, it would take an enormous readjustment of my world view to
accept any rational explanation that includes several trillion tons of curdled milk products
hanging in the sky a quarter of a million miles distant!)




These two very fundamental points about information—how surprising the communication
is, and how much existing knowledge requires revision—both indicate something about
how much information is communicated. But these seem very subjective measures, and
indeed they are, which is partly why defining information is so difficult to come to grips
with.




Claude E. Shannon did come to grips with the problem in 1948. In what has turned out to
be one of the seminal scientific papers of the twentieth century, “A Mathematical Theory
of Communication,” he grappled directly with the problem. This was published the next
year as a book and established a whole field of endeavor, now called “information theory.”
Shannon himself referred to it as “communication theory,” but its effects and applicability
have reached out into a vast number of areas, far beyond communications. In at least one
sense it is only about communications, because unless information is communicated, it

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

×