Tải bản đầy đủ (.pdf) (285 trang)

web dragons inside the myths of search engine technology

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (9.01 MB, 285 trang )

TEAM LinG
Web
Dragons
TEAM LinG
This page intentionally left blank
TEAM LinG
Web
Dragons
Ian H. Witten
Marco Gori
Teresa Numerico
Inside the Myths
of Search Engine Technology
AMSTERDAM • BOSTON • HEIDELBERG • LONDON
NEW YORK • OXFORD • PARIS • SAN DIEGO
SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO
Morgan Kaufmann Publishers is an imprint of Elsevier
TEAM LinG
Publisher Diane D. Cerra
Publishing Services Manager George Morrison
Project Manager Marilyn E. Rash
Assistant Editor Asma Palmeiro
Cover Design Yvo Riezebos Design
Text Design Mark Bernard, Design on Time
Composition CEPHA Imaging Pvt. Ltd.
Copyeditor Carol Leyba
Proofreader Daniel Stone
Indexer Steve Rath
Interior Printer Sheridan Books
Cover Printer Phoenix Color Corp.
Morgan Kaufmann Publishers is an imprint of Elsevier.


500 Sansome Street, Suite 400, San Francisco, CA 94111
This book is printed on acid-free paper.
© 2007 by Elsevier Inc. All rights reserved.
Designations used by companies to distinguish their products are often claimed as trademarks or
registered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim,
the product names appear in initial capital or all capital letters. Readers, however, should contact the
appropriate companies for more complete information regarding trademarks and registration.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means—electronic, mechanical, photocopying, scanning, or otherwise—without prior written
permission of the publisher.
Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford,
UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, E-mail: You may
also complete your request on-line via the Elsevier homepage (), by selecting “Support
& Contact” then “Copyright and Permission” and then “Obtaining Permissions.”
Library of Congress Cataloging-in-Publication Data
Witten, I. H. (Ian H.)
Web dragons: inside the myths of search engine technology / Ian H.
Witten, Marco Gori, Teresa Numerico.
p. cm. — (Morgan Kaufmann a series in multimedia and information systems)
Includes bibliographical references and index.
ISBN-13: 978-0-12-370609-6 (alk. paper)
ISBN-10: 0-12-370609-2 (alk. paper)
1. Search engines. 2. World Wide Web. 3. Electronic information resources literacy. I. Gori, Marco.
II. Numerico, Teresa. III. Title. IV. Title: Inside the myths of search engine technology.
TK5105.884.W55 2006
025.04 dc22 2006023512
For information on all Morgan Kaufmann Publishers visit our Web site
at www.books.elsevier.com
Printed in the United States of America
0607080910 10987654321

TEAM LinG
v
CONTENTS
List of Figures xi
List of Tables xiii
Preface xv
About the Authors xxi
1. SETTING THE SCENE 3
According to the Philosophers 5
Knowledge as Relations 5
Knowledge Communities 7
Knowledge as Language 8
Enter the Technologists 9
The Birth of Cybernetics 9
Information as Process 10
The Personal Library 12
The Human Use of Technology 13
The Information Revolution 14
Computers as Communication Tools 14
Time-Sharing and the Internet 15
Augmenting Human Intellect 17
The Emergence of Hypertext 18
And Now, the Web 19
The World Wide Web 20
A Universal Source of Answers? 20
What Users Know About Web Search 22
Searching and Serendipity 24
So What? 25
Notes and Sources 26
2. LITERATURE AND THE WEB 29

The Changing Face of Libraries 30
Beginnings 32
The Information Explosion 33
The Alexandrian Principle: Its Rise, Fall, and Re-Birth 35
The Beauty of Books 37
TEAM LinG
Metadata 41
The Library Catalog 43
The Dublin Core Metadata Standard 46
Digitizing Our Heritage 48
Project Gutenberg 49
Million Book Project 50
Internet Archive and the Bibliotheca Alexandrina 51
Amazon: A Bookstore 52
Google: A Search Engine 53
Open Content Alliance 55
New Models of Publishing 55
So What? 57
Notes and Sources 58
3. MEET THE WEB 61
Basic Concepts 62
HTTP: Hypertext Transfer Protocol 63
URI: Uniform Resource Identifier 65
Broken Links 66
HTML: Hypertext Markup Language 67
Crawling 70
Web Pages: Documents and Beyond 72
Static, Dynamic, and Active Pages 72
Avatars and Chatbots 74
Collaborative Environments 75

Enriching with Metatags 77
XML: Extensible Markup Language 78
Metrology and Scaling 79
Estimating the Web’s Size 80
Rate of Growth 81
Coverage, Freshness, and Coherence 83
Structure of the Web 85
Small Worlds 85
Scale-free Networks 88
Evolutionary Models 90
Bow Tie Architecture 91
Communities 94
Hierarchies 95
The Deep Web 96
So What? 97
Notes and Sources 98
vi
CONTENTS
TEAM LinG
4. HOW TO SEARCH 101
Searching Text 104
Full-text Indexes 104
Using the Index 106
What’s a Word? 107
Doing It Fast 109
Evaluating the Results 110
Searching in a Web 111
Determining What a Page Is About 113
Measuring Prestige 113
Hubs and Authorities 118

Bibliometrics 123
Learning to Rank 124
Distributing the Index 126
Developments in Web Search 128
Searching Blogs 128
Ajax Technology 129
The Semantic Web 129
Birth of the Dragons 131
The Womb Is Prepared 132
The Dragons Hatch 133
The Big Five 135
Inside the Dragon’s Lair 137
So What? 142
Notes and Sources 142
5. THE WEB WARS 145
Preserving the Ecosystem 146
Proxies 147
Crawlers 148
Parasites 149
Restricting Overuse 151
Resilience to Damage 152
Vulnerability to Attack 153
Viruses 154
Worms 155
Increasing Visibility: Tricks of the Trade 156
Term Boosting 157
Link Boosting 158
Content Hiding 161
Discussion 162
Business, Ethics, and Spam 162

The Ethics of Spam 163
Economic Issues 165
vii
CONTENTS
TEAM LinG
Search-Engine Advertising 165
Content-Targeted Advertising 167
The Bubble 168
Quality 168
The Anti-Spam War 169
The Weapons 170
The Dilemma of Secrecy 172
Tactics and Strategy 173
So What? 174
Notes and Sources 174
6. WHO CONTROLS INFORMATION? 177
The Violence of the Archive 179
Web Democracy 181
The Rich Get Richer 182
The Effect of Search Engines 183
Popularity Versus Authority 185
Privacy and Censorship 187
Privacy on the Web 188
Privacy and Web Dragons 190
Censorship on the Web 191
Copyright and the Public Domain 193
Copyright Law 193
The Public Domain 195
Relinquishing Copyright 197
Copyright on the Web 198

Web Searching and Archiving 199
The WIPO Treaty 201
The Business of Search 201
The Consequences of Commercialization 202
The Value of Diversity 203
Personalization and Profiling 204
So What? 206
Notes and Sources 207
7. THE DRAGONS EVOLVE 211
The Adventure of Search 214
Personalization in Practice 216
My Own Web 217
Analyzing Your Clickstream 218
Communities 219
Social Space or Objective Reality? 220
Searching within a Community Perspective 221
Defining Communities 222
viii
CONTENTS
TEAM LinG
Private Subnetworks 223
Peer-to-Peer Networks 224
A Reputation Society 227
The User as Librarian 229
The Act of Selection 229
Community Metadata 230
Digital Libraries 232
Your Computer and the Web 233
Personal File Spaces 234
From Filespace to the Web 235

Unification 236
The Global Office 236
So What? 238
Notes and Sources 241
REFERENCES 243
INDEX 251
ix
CONTENTS
TEAM LinG
This page intentionally left blank
TEAM LinG
xi
LIST OF FIGURES
Figure 2.1: Rubbing from a stele in Xi’an. 31
Figure 2.2: A page of the original Trinity College Library catalog. 35
Figure 2.3: The Bibliothèque Nationale de France. 36
Figure 2.4: Part of a page from the Book of Kells.38
Figure 2.5: Pages from a palm-leaf manuscript in Thanjavur, India. 39
Figure 2.6: Views of an electronic book. 40
Figure 3.1: Representation of a document. 68
Figure 3.2: Representation of a message in XML. 78
Figure 3.3: Distributions: (a) Gaussian and (b) power-law. 89
Figure 3.4: Chart of the web. 93
Figure 4.1: A concordance entry for the verb to search from the Greek New
Testament. 102
Figure 4.2: Entries from an early computer-produced concordance of
Matthew Arnold. 103
Figure 4.3: Making a full-text index. 104
Figure 4.4: A tangled web. 114
Figure 4.5: A comparison of search engines (early 2006). 138

Figure 5.1: The taxonomy of web spam. 157
Figure 5.2: Insights from link analysis. 159
Figure 5.3: A link farm. 160
Figure 5.4: A spam alliance in which two link farms jointly boost two target
pages. 173
Figure 7.1: The Warburg Library. 215
TEAM LinG
This page intentionally left blank
TEAM LinG
xiii
LIST OF TABLES
Table 2.1: Spelling variants of the name Muammar Qaddafi 44
Table 2.2: Title pages of different editions of Hamlet 45
Table 2.3: The Dublin Core metadata standard 47
Table 3.1: Growth in websites 82
Table 4.1: Misspellings of Britney Spears typed into Google 139
TEAM LinG
This page intentionally left blank
TEAM LinG
xv
PREFACE
In the eye-blink that has elapsed since the turn of the millennium, the lives of those
of us who work with information have been utterly transformed. Much—most—
perhaps even all—of what we need to know is on the World Wide Web; if not
today, then tomorrow. The web is where society keeps the sum total of human
knowledge. It’s where we learn and play, shop and do business, keep up with old
friends and meet new ones. And what has made all this possible is not just the fan-
tastic amount of information out there, it’s a fantastic new technology: search
engines. Efficient and effective ways of searching through immense tracts of text is
one of the most striking technical advances of the last decade. And today search

engines do it for us. They weigh and measure every web page to determine whether
it matches our query. And they do it all for free. We call on them whenever we
want to find something that we need to know. To learn how they work, read on!
We refer to search engines as “web dragons” because they are the gatekeep-
ers of our society’s treasure trove of information. Dragons are all-powerful fig-
ures that stand guard over great hoards of treasure. The metaphor fits. Dragons
are mysterious: no one really knows what drives them. They’re mythical: the
subject of speculation, hype, legend, old wives’ tales, and fairy stories. In this
case, the immense treasure they guard is society’s repository of knowledge.
What could be more valuable than that? In oriental folklore, dragons not only
enjoy awesome grace and beauty, they are endowed with immense wisdom. But
in the West, they are often portrayed as evil—St. George vanquishes a fearsome
dragon, as does Beowulf—though sometimes they are friendly (Puff). In both
traditions, they are certainly magic, powerful, independent, and unpredictable.
The ambiguity suits our purpose well because, in addition to celebrating the joy
of being able to find stuff on the web, we want to make you feel uneasy about
how everyone has come to rely on search engines so utterly and completely.
The web is where we record our knowledge, and the dragons are how we
access it. This book examines their interplay from many points of view: the
philosophy of knowledge; the history of technology; the role of libraries, our
traditional knowledge repositories; how the web is organized; how it grows and
evolves; how search engines work; how people and companies try to take advan-
tage of them to promote their wares; how the dragons fight back; who controls
information on the web and how; and what we might see in the future.
TEAM LinG
We have laid out our story from beginning to end, starting with early
philosophers and finishing with visions of tomorrow. But you don’t have to
read this book that way: you can start in the middle. To find out how search
engines work, turn to Chapter 4. To learn about web spam, go to Chapter 5.
For social issues about web democracy and the control of information, head

straight for Chapter 6. To see how the web is organized and how its massively
linked structure grows, start at Chapter 3. To learn about libraries and how they
are finding their way onto the web, go to Chapter 2. For philosophical and his-
torical underpinnings, read Chapter 1. Unlike most books, which you start at
the beginning, and give up when you run out of time or have had enough, we
recommend that you consider reading this book starting in the middle and, if
you can, continuing right to the end. You don’t really need the early chapters
to understand the later parts, though they certainly provide context and add
depth. To help you chart a passage, here’s a brief account of what each chapter
has in store.
The information revolution is creating turmoil in our lives. For years it has
been opening up a wondrous panoply of exciting new opportunities and
simultaneously threatening to drown us in them, dragging us down, gasping,
into murky undercurrents of information overload. Feeling confused? We all
are. Chapter 1 sets the scene by placing things in a philosophical and historical
context. The web is central to our thinking, and the way it works resembles
the very way we think—by linking pieces of information together. Its growth
reflects the growth in the sum total of human knowledge. It’s not just a store-
house into which we drop nuggets of information or pearls of wisdom. It’s the
stuff out of which society’s knowledge is made, and how we use it determines
how humankind’s knowledge will grow. That’s why this is all so important.
How we access the web is central to the development of humanity.
The World Wide Web is becoming ever larger, qualitatively as well as quan-
titatively. It is slowly but surely beginning to subsume “the literature,” which
up to now has been locked away in libraries. Chapter 2 gives a bird’s-eye view
of the long history of libraries and then describes how today’s custodians are
busy putting their books on the web, and in their public-spirited way giving
as much free access to them as they can. Initiatives such as the Gutenberg
Project, the United States, China, and India Million Book Project, and the
Open Content Alliance, are striving to create open collections of public

domain material. Web bookstores such as Amazon present pages from pub-
lished works and let you sample them. Google is digitizing the collections of
major libraries and making them searchable worldwide. We are witnessing a
radical convergence of online and print information, and of commercial and
noncommercial information sources.
Chapter 3 paints a picture of the overall size, scale, construction, and
organization of the web, a big picture that transcends the details of all those
xvi
PREFACE
TEAM LinG
millions of websites and billions of web pages. How can you measure the size
of this beast? How fast is it growing? What about its connectivity: is it one net-
work, or does it drop into disconnected parts? What’s the likelihood of being
able to navigate through the links from one randomly chosen page to another?
You’ve probably heard that complete strangers are joined by astonishingly
short chains of acquaintanceship: one person knows someone who knows
someone who…through about six degrees of separation…knows the other.
How far apart are web pages? Does this affect the web’s robustness to random
failure—and to deliberate attack? And what about the deep web—those pages
that are generated dynamically in response to database queries—and sites that
require registration or otherwise limit access to their contents?
Having surveyed the information landscape, Chapter 4 tackles the key
ideas behind full-text searching and web search engines, the Internet’s new
“killer app.” Despite the fact that search engines are intricate pieces of soft-
ware, the underlying ideas are simple, and we describe them in plain English.
Full-text search is an embodiment of the classical concordance, with the
advantage that, being computerized, it works for all documents, no matter
how banal—not just sacred texts and outstanding works of literature.
Multiword queries are answered by combining concordance entries and rank-
ing the results, weighing rare words more heavily than commonplace ones.

Web search services augment full-text search with the notion of the prestige of
a source, which they estimate by counting the web pages that cite the source,
and their prestige—in effect weighting popular works highly. This book
focuses exclusively on techniques for searching text, for even when we seek
pictures and movies, today’s search engines usually find them for us by analyzing
associated textual descriptions.
Chapter 5 turns to the dark side. Once the precise recipe for attribution of
prestige is known, it can be circumvented, or “spammed,” by commercial interests
intent on artificially raising their profile. On the web, visibility is money. It’s
excellent publicity—better than advertising—and it’s free. We describe some
of the techniques of spamming, techniques that are no secret to the spammers,
but will come as a surprise to web users. Like e-mail spam, this is a scourge
that will pollute our lives. Search engine operators strive to root it out and
neutralize it in an escalating war against misuse of the web. And that’s not all.
Unscrupulous firms attack the advertising budget of rival companies by mind-
lessly clicking on their advertisements, for every referral costs money. Some see
click fraud as the dominant threat to the search engine business.
There’s another problem: access to information is controlled by a few com-
mercial enterprises that operate in secret. This raises ethical concerns that have
been concealed by the benign philosophy of today’s dominant players and the
exceptionally high utility of their product. Chapter 6 discusses the question of
democracy (or lack of it) in cyberspace. We also review the age-old system of
xvii
PREFACE
TEAM LinG
copyright—society’s way of controlling the flow of information to protect the
rights of authors. The fact that today’s web concentrates enormous power over
people’s information-seeking activities into a handful of major players has led
some to propose that the search business should be nationalized—or perhaps
“internationalized”—into public information utilities. But we disagree, for

two reasons. First, the apolitical nature of the web—it is often described as
anarchic—is one of its most alluring features. Second, today’s exceptionally
effective large-scale search engines could only have been forged through intense
commercial competition—particularly in a mere decade of development.
We believe that we stand on the threshold of a new era, and Chapter 7 pro-
vides a glimpse of what’s in store. Today’s search engines are just the first, most
obvious, step. While centralized indexes will continue to thrive, they will be
augmented—and for many purposes usurped—by local control and cus-
tomization. Search engine companies are already experimenting with person-
alization features, on the assumption that users will be prepared to sacrifice
some privacy and identify themselves if they thereby receive better service.
Localized rather than centralized control will make this more palatable and less
susceptible to corruption. Information gleaned from end users—searchers and
readers—will play a more prominent role in directing searches. The web drag-
ons are diversifying from search alone toward providing general information
processing services, which could generate a radically new computer ecosystem
based on central hosting services rather than personal workstations. Future
dragons will offer remote application software and file systems that will
augment or even replace your desktop computer. Does this presage a new
generation of operating systems?
We want you to get involved with this book. These are big issues. The nat-
ural reaction is to concede that they may be important in theory but to ques-
tion what difference they really make in practice—and anyway, what can you
do about them? To counter any feeling of helplessness, we’ve put a few activities
at the end of each chapter in gray boxes: things you can do to improve life for
yourself—perhaps for others too. If you like, peek ahead before reading each
chapter to get a feeling for what practical actions it might suggest.
ACKNOWLEDGMENTS
The seeds for this project were sown during a brief visit by Ian Witten to Italy,
sponsored by the Italian Artificial Intelligence Society, and the book was con-

ceived and begun during a more extended visit generously supported by the
University of Siena. We would all like to thank our home institutions for their
support for our work over the years: the University of Waikato in New Zealand,
and the Universities of Siena and Salerno in Italy. Most of Ian’s work on the
xviii
PREFACE
TEAM LinG
book was done during a sabbatical period while visiting the École Nationale
Supérieure des Télécommunications in Paris, Google in New York (he had to
promise not to learn anything there), and the University of Cape Town in
South Africa (where the book benefited from numerous discussions with Gary
Marsden); the generous support of these institutions is gratefully acknowl-
edged. Marco benefited from insightful discussions during a brief visit to the
Université de Montréal, and from collaboration with the Automated
Reasoning System division of IRST, Trento, Italy. Teresa would like to thank
the Leverhulme Foundation for its generous support and the Logic group at
the University of Rome, and in particular Jonathan Bowen, Roberto
Cordeschi, Marcello Frixione, and Sandro Nannini for their interesting, wise,
and stimulating comments.
In developing these ideas, we have all been strongly influenced by our stu-
dents and colleagues; they are far too numerous to mention individually but
gratefully acknowledged all the same. We particularly want to thank members
of our departments and research groups: the Computer Science Department
at Waikato, the Artificial Intelligence Research Group at Siena, and the
Department of Communication Sciences at Salerno. Parts of Chapter 2 are
adapted from How to Build a Digital Library by Witten and Bainbridge; parts
of Chapter 4 come from Managing Gigabytes by Witten, Moffat, and Bell.
We must thank the web dragons themselves, not just for providing such an
interesting topic for us to write about, but for all their help in ferreting out
facts and other information while writing this book. We may be critical, but

we are also grateful! In addition, we would like to thank all the authors in
the Wikipedia community for their fabulous contributions to the spread of
knowledge, from which we have benefited enormously.
The delightful cover illustration and chapter openers were drawn for us by
Lorenzo Menconi. He did it for fun, with no thought of compensation, the
only reward being to see his work in print. We thank him very deeply and
sincerely hope that this will boost his sideline in imaginative illustration.
We are extremely grateful to the reviewers of this book, who have helped us
focus our thoughts and correct and enrich the text: Rob Akscyn, Ed Fox,
Jonathan Grudin, Antonio Gulli, Gary Marchionini, Edie Rasmussen, and
Sarah Shieff.
We received sterling support from Diane Cerra and Asma Palmeiro at
Morgan Kaufmann while writing this book. Diane’s enthusiasm infected us
from the very beginning, when she managed to process our book proposal and
give us the go-ahead in record time. Marilyn Rash, our project manager, has
made the production process go very smoothly for us.
Finally, without the support of our families, none of our work would have
been possible. Thank you Agnese, Anna, Cecilia, Fabrizio, Irene, Nikki, and
Pam; this is your book too!
xix
PREFACE
TEAM LinG
This page intentionally left blank
TEAM LinG
xxi
ABOUT THE AUTHORS
Ian H. Witten is professor of computer science at the University of Waikato
in New Zealand. He directs the New Zealand Digital Library research project.
His research interests include information retrieval, machine learning, text
compression, and programming by demonstration. He received an MA in

mathematics from Cambridge University in England, an MSc in computer sci-
ence from the University of Calgary in Canada, and a PhD in electrical engi-
neering from Essex University in England. Witten is a fellow of the ACM and
of the Royal Society of New Zealand. He has published widely on digital
libraries, machine learning, text compression, hypertext, speech synthesis and
signal processing, and computer typography. He has written several books, the
latest being How to Build a Digital Library (2002) and Data Mining, Second
Edition (2005), both published by Morgan Kaufmann.
Marco Gori is professor of computer science at the University of Siena, where
he is the leader of the artificial intelligence research group. His research inter-
ests are machine learning with applications to pattern recognition, web
mining, and game playing. He received a Laurea from the University of
Florence and a PhD from the University of Bologna. He is the chairman of the
Italian Chapter of the IEEE Computational Intelligence Society, a fellow of
the IEEE and of the ECCAI, and a former president of the Italian Association
for Artificial Intelligence.
Teresa Numerico teaches network theory and communication studies at the
University of Rome. She is also a researcher in the philosophy of science at
the University of Salerno (Italy). She earned her PhD in the history of science
and was a visiting researcher at London South Bank University in the
United Kingdom in 2004, having been awarded a Leverhulme Trust Research
Fellowship. She was formerly employed as a business development and market-
ing manager for several media companies, including the Italian branch of
Turner Broadcasting System (CNN and Cartoon Network).
TEAM LinG
This page intentionally left blank
TEAM LinG
Web
Dragons
TEAM LinG

TEAM LinG

×