Tải bản đầy đủ (.pdf) (547 trang)

Search Engines Information Retrieval in Practice

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (26.71 MB, 547 trang )

Search Engines
Information Retrieval
in Practice
This page intentionally left blank
W.
BRUCE CROFT
DONALD METZLER
TREVOR
STROHMAN
Addison
Wesley
Boston Columbus Indianapolis
New
York
San
Francisco Upper Saddle
River
Amsterdam Cape Town Dubai
London
Madrid Milan Munich Paris Montreal Toronto
Delhi
Mexico City
Sao
Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo
Information Retrieval
in
Practice
Search Engines
University of Massachusetts, Amherst
Yahoo! Research


Google Inc.
Editor-in-Chief Michael Hirsch
Acquisitions Editor Matt Goldstein
Editorial Assistant Sarah Milmore
Managing
Editor
Jeff
Holcomb
Online Product Manager Bethany Tidd
Director
of
Marketing Margaret Waples
Marketing Manager Erin Davis
Marketing Coordinator Kathryn Ferranti
Senior Manufacturing Buyer Carol Melville
Text Design, Composition,
W.
Bruce
Croft,
Donald Metzler,
and
Illustrations
and
Trevor Strohman
Art
Direction Linda Knowles
Cover Design Elena Sidorova
Cover Image
©
Peter Gudella

/
Shutterstock
Many
of the
designations used
by
manufacturers
and
sellers
to
distinguish their products
are
claimed
as
trademarks. Where those designations appear
in
this book,
and
Addison-
Wesley
was
aware
of a
trademark claim,
the
designations have been printed
in
initial caps
or all
caps.

The
programs
and
applications presented
in
this book have been included
for
their
instructional value. They have been tested with care,
but are not
guaranteed
for any
particular purpose.
The
publisher does
not
offer
any
warranties
or
representations,
nor
does
it
accept
any
liabilities with respect
to the
programs
or

applications.
Library
of
Congress Cataloging-in-Publication Data available upon request
Copyright
©
2010
Pearson Education, Inc.
All
rights reserved.
No
part
of
this publication
may
be
reproduced, stored
in a
retrieval system,
or
transmitted,
in any
form
or by any
means,
electronic, mechanical, photocopying, recording,
or
otherwise, without
the
prior

written
permission
of the
publisher. Printed
in the
United States
of
America.
For
information
on
obtaining permission
for use of
material
in
this work, please submit
a
written
request
to
Pearson Education, Inc., Rights
and
Contracts Department,
501
Boylston
Street, Suite
900,
Boston,
MA
02116,

fax
(617) 671-3447,
or
online
at
/>ISBN-13:
978-0-13-607224-9
ISBN-10:
0-13-607224-0
12345678910-HP-13
1211 1009
Preface
This book provides
an
overview
of the
important issues
in
information retrieval,
and how
those issues
affect
the
design
and
implementation
of
search engines.
Not
every

topic
is
covered
at the
same level
of
detail.
We
focus
instead
on
what
we
consider
to be the
most important alternatives
to
implementing search engine
components
and the
information retrieval models underlying them.
Web
search
engines
are
obviously
a
major
topic,
and we

base
our
coverage primarily
on the
technology
we all use on the
Web,
1
but
search engines
are
also used
in
many other
applications. That
is the
reason
for the
strong emphasis
on the
information
re-
trieval theories
and
concepts
that
underlie
all
search engines.
The

target audience
for the
book
is
primarily undergraduates
in
computer sci-
ence
or
computer engineering,
but
graduate students
should
also
find
this useful.
We
also consider
the
book
to be
suitable
for
most students
in
information sci-
ence
programs. Finally, practicing search engineers
should
benefit

from
the
book,
whatever their background. There
is
mathematics
in the
book,
but
nothing
too
esoteric. There
are
also code
and
programming exercises
in the
book,
but
nothing
beyond
the
capabilities
of
someone
who has
taken some basic computer science
and
programming classes.
The

exercises
at the end of
each chapter make extensive
use of a
Java™-based
open source search engine called Galago. Galago
was
designed
both
for
this book
and to
incorporate lessons learned
from
experience with
the
Lemur
and
Indri
projects.
In
other words, this
is a
fully
functional search engine
that
can be
used
to
support real applications. Many

of the
programming exercises require
the
use,
modification,
and
extension
of
Galago components.
1
In
keeping
with
common
usage,
most
uses
of the
word
"web"
in
this
book
are no
t
cap-
italized,
except
when
we

refer
to the
World
Wide
Web as a
separate
entity.
VI
Preface
Contents
In the first
chapter,
we
provide
a
high-level review
of the field of
information
re-
trieval
and its
relationship
to
search engines.
In the
second chapter,
we
describe
the
architecture

of a
search engine. This
is
done
to
introduce
the
entire range
of
search
engine components without getting stuck
in the
details
of any
particular
aspect.
In
Chapter
3, we
focus
on
crawling, document
feeds,
and
other techniques
for
acquiring
the
information that will
be

searched. Chapter
4
describes
the
sta-
tistical
nature
of
text
and the
techniques that
are
used
to
process
it,
recognize
im-
portant
features,
and
prepare
it for
indexing. Chapter
5
describes
how to
create
indexes
for

efficient
search
and how
those indexes
are
used
to
process queries.
In
Chapter
6, we
describe
the
techniques that
are
used
to
process queries
and
trans-
form
them into better representations
of the
user's information need.
Ranking algorithms
and the
retrieval models they
are
based
on are

covered
in
Chapter
7.
This chapter also includes
an
overview
of
machine learning tech-
niques
and how
they relate
to
information retrieval
and
search engines. Chapter
8
describes
the
evaluation
and
performance metrics that
are
used
to
compare
and
tune search engines. Chapter
9
covers

the
important classes
of
techniques used
for
classification,
filtering, clustering,
and
dealing with spam. Social search
is a
term
used
to
describe search applications that involve communities
of
people
in
tag-
ging content
or
answering questions. Search techniques
for
these applications
and
peer-to-peer search
are
described
in
Chapter
10.

Finally,
in
Chapter
11, we
give
an
overview
of
advanced techniques that capture more
of the
content
of
documents
than simple word-based approaches. This includes techniques that
use
linguistic
features,
the
document structure,
and the
content
of
nontextual media, such
as
images
or
music.
Information
retrieval theory
and the

design, implementation, evaluation,
and
use
of
search engines cover
too
many topics
to
describe them
all in
depth
in one
book.
We
have tried
to
focus
on the
most important topics while giving some
coverage
to all
aspects
of
this challenging
and
rewarding subject.
Supplements
A
range
of

supplementary material
is
provided
for the
book. This material
is de-
signed
both
for
those taking
a
course based
on the
book
and for
those giving
the
course.
Specifically,
this includes:

Extensive lecture slides
(in PDF and PPT
format)
Preface
VII

Solutions
to
selected

end-of-chapter
problems (instructors only)

Test collections
for
exercises

Galago search engine
The
supplements
are
available
at
www.search-engines-book.com,
or at
www.aw.com.
Acknowledgments
First
and
foremost, this book would
not
have happened without
the
tremen-
dous support
and
encouragement
from
our
wives,

Pam
Aselton, Anne-Marie
Strohman,
and
Shelley Wang.
The
University
of
Massachusetts
Amherst provided
material
support
for the
preparation
of the
book
and
awarded
a
Conti
Faculty Fel-
lowship
to
Croft,
which sped
up our
progress
significantly.
The
staff

at the
Center
for
Intelligent Information Retrieval (Jean Joyce, Kate Moruzzi, Glenn Stowell,
and
Andre Gauthier) made
our
lives easier
in
many ways,
and our
colleagues
and
students
in the
Center provided
the
stimulating environment that makes work-
ing
in
this area
so
rewarding.
A
number
of
people reviewed parts
of the
book
and

we
appreciated their comments. Finally,
we
have
to
mention
our
children,
Doug,
Eric,
Evan,
and
Natalie,
or
they would never
forgive
us.
BRUCE
CROFT
DONMETZLER
TREVOR
STROHMAN
This page intentionally left blank
Contents
1
2
3
Search Engines
and
Information

Retrieval
1.1
What
Is
Information Retrieval?
1.2 The Big
Issues
1.3
Search Engines
1.4
Search Engineers
Architecture
of a
Search Engine
2.1
What
Is an
Architecture?
2.2
Basic Building Blocks
2.3
Breaking
It
Down
2.3.1 Text Acquisition
2.3.2 Text Transformation
2.3.3 Index Creation
2.3.4 User Interaction
2.3.5 Ranking
2.3.6 Evaluation

2.4 How
Does
ItReaUy
Work?
Crawls
and
Feeds
3.
1
Deciding
What
to
Search
3.2
Crawling
the Web
3.2.
1
Retrieving
Web
Pages
3.2.2
The Web
Crawler
3.2.3 Freshness
3.2.4 Focused Crawling
3.2.5
Deep
Web
1

1
4
6
9
13
13
14
17
17
19
22
23
25
27
28
31
31
32
33
35
37
41
41
X
Contents
3.2.6 Sitemaps
3.2.7 Distributed Crawling
3.3
Crawling Documents
and

Email
3.4
Document Feeds
3.5 The
Conversion Problem
3.5.1 Character Encodings
3.6
Storing
the
Documents
3.6.
1
Using
a
Database System
3.6.2 Random Access
3.6.3 Compression
and
Large Files
3.6.4 Update
3.6.5 BigTable
3.7
Detecting Duplicates
3.8
Removing Noise
4
Processing
Text
4. 1
From Words

to
Terms
4.2
Text Statistics
4.2.
1
Vocabulary Growth
4.2.2 Estimating Collection
and
Result
Set
Sizes
4.3
Document Parsing
4.3.
1
Overview
4.3.2 Tokenizing
4.3.3 Stopping
4.3.4 Stemming
4.3.5 Phrases
and
N-grams
4.4
Document Structure
and
Markup
4.5
Link Analysis
4.5.1 Anchor Text

4.5.2 PageRank
4.5.3 Link
Quality
4.6
Information Extraction
4.6.
1
Hidden
Markov Models
for
Extraction
4.7
Internationalization
43
44
46
47
49
50
52
53
53
54
56
57
60
63
73
73
75

80
83
86
86
87
90
91
97
101
104
105
105
Ill
113
115
118
5
Ranking
with
Indexes
5.
1
Overview
5.2
Abstract Model
of
Ranking
5.3
Inverted Indexes
5.3.1 Documents

5.3.2 Counts
5.3.3 Positions
5.3.4 Fields
and
Extents
5.3.5 Scores
5.3.6 Ordering
5.4
Compression
5.4.1 Entropy
and
Ambiguity
5.4.2 Delta Encoding
5.4.3 Bit-Aligned Codes
5.4.4
Byte
-Aligned Codes
5.4.5 Compression
in
Practice
5.4.6 Looking Ahead
5.4.7 Skipping
and
Skip Pointers
5.5
Auxiliary Structures
5.6
Index Construction
5.6.
1

Simple Construction
5.6.2 Merging
5.6.3 Parallelism
and
Distribution
5.6.4 Update
5.7
Query
Processing
5.7.1 Document-at-a-time Evaluation
5.7.2 Term-at-a-time Evaluation
5.7.3 Optimization Techniques
5.7.4 Structured Queries
5.7.5 Distributed Evaluation
5.7.6 Caching
6
Queries
and
Interfaces
6. 1
Information Needs
and
Queries
6.2
Query
Transformation
and
Refinement
6.2.
1

Stopping
and
Stemming Revisited
6.2.2 Spell Checking
and
Suggestions
Contents
XI
125
125
126
129
131
133
134
136
138
139
140
142
144
145
148
149
151
151
154
156
156
157

158
164
165
166
168
170
178
180
181
187
187
190
190
193
XII
Contents
6.2.3
Query
Expansion
6.2.4 Relevance Feedback
6.2.5 Context
and
Personalization
6.3
Showing
the
Results
6.3.
1
Result Pages

and
Snippets
6.3.2 Advertising
and
Search
6.3.3 Clustering
the
Results
6.4
Cross-Language Search
7
Retrieval
Models
7.1
Overview
of
Retrieval Models
7.
1 . 1
Boolean Retrieval
7.1.2
The
Vector Space Model
7.2
Probabilistic Models
7.2.
1
Information Retrieval
as
Classification

7.2.2
The
BM25 Ranking Algorithm
7.3
Ranking Based
on
Language Models
7.3.1
Query
Likelihood Ranking
7.3.2 Relevance Models
and
Pseudo -Relevance Feedback
. .
7.4
Complex Queries
and
Combining Evidence
7.4.1
The
Inference
Network Model
7.4.2
The
Galago
Query
Language
7.5 Web
Search
7.6

Machine Learning
and
Information Retrieval
7.6.
1
Learning
to
Rank
7.6.2 Topic Models
and
Vocabulary Mismatch
7.7
Application-Based Models
8
Evaluating Search Engines
8.1 Why
Evaluate?
8.2 The
Evaluation Corpus
8.3
Logging
8.4
Effectiveness
Metrics
8.4.
1
Recall
and
Precision
8.4.2 Averaging

and
Interpolation
8.4.3 Focusing
on the Top
Documents
8.4.4 Using
Preferences
199
208
211
215
215
218
221
226
233
233
235
237
243
244
250
252
254
261
267
268
273
279
283

284
288
291
297
297
299
305
308
308
313
318
321
8.5
Efficiency
Metrics
8.6
Training, Testing,
and
Statistics
8.6.1
Significance
Tests
8.6.2 Setting Parameter Values
8.6.3
Online
Testing
8.7 The
Bottom Line
9
Classification

and
Clustering
9.
1
Classification
and
Categorization
9.1.1 Naive
Bayes
9.1.2 Support Vector Machines
9.1.3 Evaluation
9.
1 .4
Classifier
and
Feature Selection
9.1.5 Spam, Sentiment,
and
Online
Advertising
. .
9.2
Clustering
9.2.1 Hierarchical
and K
-Means Clustering
9.2.2
K
Nearest Neighbor Clustering
9.2.3 Evaluation

9.2.4
How to
Choose
K
9.2.5 Clustering
and
Search
10
Social Search
10.1
What
Is
Social
Search?
10.2 User Tags
and
Manual Indexing
10.2.1
Searching Tags
10.2.2
Inferring
Missing Tags
10.2.3
Browsing
and Tag
Clouds
10.3 Searching with Communities
10.3.1
What
Is a

Community?
10.3.2
Finding Communities
10.3.3
Community-Based
Question
Answering
10.3.4
Collaborative Searching
10.4 Filtering
and
Recommending
10.4.1
Document Filtering
10.4.2
Collaborative Filtering
10.5 Peer-to-Peer
and
Metasearch
10.5.1 Distributed Search
Contents
XIII
322
325
325
330
332
333
339
340

342
351
359
359
364
373
375
384
386
387
389
397
397
400
402
404
406
408
408
409
415
420
423
423
432
438
438
XIV
Contents
1052

P2P
Networks
1 1
Beyond
Bag
of
Words
11.1 Overview
112
Feature-Based Retrieval Models
113
Term Dependence Models
1 1.4
Structure Revisited
1
1.4.1
XML
Retrieval
1
1.4.2 Entity Search
1 1.5
Longer Questions, Better Answers
1 1.6
Words, Pictures,
and
Music
1 1.7 One
Search Fits All?
References
Index

442
451
451
452
454
459
461
464
466
470
479
487
513
List
of
Figures
1.1
2.1
2.2
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11

3.12
3.13
3.14
3.15
3.16
Search
engine design
and the
core information retrieval issues
. . .
The
indexing process
The
query process
A
uniform resource locator (URL), split into three parts
Crawling
the
Web.
The web
crawler connects
to web
servers
to
find
pages. Pages
may
link
to
other pages

on the
same server
or
on
different
servers
An
example robots.txt
file
A
simple crawling thread implementation
An
HTTP
HEAD
request
and
server response
Age and
freshness
of a
single page over time
Expected
age of a
page
with
mean change frequency
A = 1/7
(one
week)
An

example sitemap
file
An
example
RSS 2.0
feed
An
example
of
text
in the
TREC
Web
compound document
format
An
example
link
with anchor text
BigTable
stores data
in a
single logical table, which
is
split into
many
smaller tablets
A
BigTable
row

Example
of fingerprinting
process
Example
of
simhash
fingerprinting
process
Main content block
in a web
page
9
15
16
33
34
36
37
38
39
40
43
48
55
56
57
58
62
64
65

XVI
List
of
Figures
3.17
3.18
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
5.1
5.2
5.3
5.4
5.5
5.6
Tag
counts used
to

identify
text blocks
in a web
page
Part
of the DOM
structure
for the
example
web
page
Rank versus probability
of
occurrence
for
words assuming
Zipf
's
law
(rank
X
probability
=
0.1)
A
log-log
plot
of
Zipf
s law

compared
to
real data
from
AP89.
The
predicted relationship between probability
of
occurrence
and
rank breaks down badly
at
high ranks
Vocabulary
growth
for the
TREC
AP89 collection compared
to
Heaps'
law
Vocabulary
growth
for the
TREC
GOV2
collection compared
to
Heaps'
law

Result
size
estimate
for web
search
Comparison
of
stemmer
output
for a
TREC
query. Stopwords
have
also been removed
Output
of a POS
tagger
for a
TREC
query
Part
of a web
page
from
Wikipedia
HTML
source
for
example
Wikipedia

page
A
sample "Internet" consisting
of
just three
web
pages.
The
arrows
denote links between
the
pages
Pseudocode
for the
iterative PageRank algorithm
Trackback links
in
blog
postings
Text
tagged
by
information extraction
Sentence
model
for
statistical entity extractor
Chinese segmentation
and
bigrams

The
components
of the
abstract model
of
ranking:
documents,
features,
queries,
the
retrieval
function,
and
document
scores
A
more concrete model
of
ranking. Notice
how
both
the
query
and the
document have
feature
functions
in
this model
An

inverted index
for the
documents (sentences)
in
Table
5.1
An
inverted index, with word counts,
for the
documents
in
Table
5.1
An
inverted index, with word positions,
for the
documents
in
Table
5.1
Aligning posting lists
for
"tropical"
and
"fish"
to find the
phrase
"tropical
fish"
66

67
76
79
81
82
83
95
98
102
103
108
110
112
114
116
119
127
128
132
134
135
136
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14

5.15
5.16
5.17
5.18
5.19
5.20
5.21
5.22
5.23
6.1
6.2
6.3
6.4
6.5
6.6
List
of
Figures
Aligning posting lists
for
"fish"
and
tide
to find
matches
of the
word
"fish"
in the
title

field of a
document
Pseudocode
for a
simple indexer
An
example
of
index merging.
The first and
second indexes
are
merged together
to
produce
the
combined index
MapReduce
Mapper
for a
credit card summing algorithm
Reducer
for a
credit card summing algorithm
Mapper
for
documents
Reducer
for
word postings

Document-at-a-time query evaluation.
The
numbers
(x y)
represent
a
document number
(x)
and a
word count
(y)
A
simple document-at-a-time retrieval algorithm
Term-at-a-time
query evaluation
A
simple term-at-a-time retrieval algorithm
Skip pointers
in an
inverted list.
The
gray boxes show skip
pointers, which
point
into
the
white boxes, which
are
inverted
list postings

A
term-at-a-time retrieval algorithm with conjunctive processing
A
document-at-a-time retrieval algorithm
with
conjunctive
processing
MaxScore
retrieval
with
the
query "eucalyptus tree".
The
gray
boxes
indicate postings
that
can be
safely
ignored during scoring.
Evaluation tree
for the
structured query
#combine(#od:
1
(tropical
fish)
#od:
1
(aquarium

fish) fish)
Top ten
results
for the
query "tropical
fish"
Geographic representation
of
Cape
Cod
using bounding
rectangles
Typical document summary
for a web
search
An
example
of a
text span
of
words
(w)
bracketed
by
significant
words
(s)
using
Luhn's
algorithm

Advertisements displayed
by a
search engine
for the
query
"fish
tanks"
Clusters formed
by a
search engine
from
top-ranked documents
for
the
query "tropical
fish".
Numbers
in
brackets
are the
number
of
documents
in the
cluster.
XVII
138
157
158
161

162
162
163
164
166
167
168
169
170
173
174
176
179
209
214
215
216
221
222
XVIII
List
of
Figures
6.7
6.8
6.9
6.10
7.1
7.2
7.3

7.4
7.5
7.6
7.7
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
9.1
9.2
Categories returned
for the
query "tropical
fish" in a
popular
online retailer
Subcategories
and
facets
for the
"Home
&
Garden" category
. .
Cross-language search

A
French
web
page
in the
results list
for the
query
"pecheur
france"
Term-document matrix
for a
collection
of
four
documents

Vector
representation
of
documents
and
queries
Classifying
a
document
as
relevant
or
non-relevant

Example
inference
network model
Inference
network with three nodes
Galago query
for the
dependence model
Galago query
for web
data
Example
of a
TREC
topic
Recall
and
precision values
for two
rankings
of six
relevant
documents
Recall
and
precision values
for
rankings
from
two

different
quer
Recall-precision graphs
for two
queries
Interpolated recall-precision graphs
for two
queries
Average
recall-precision graph using standard recall levels
Typical recall-precision graph
for 50
queries
from
TREC
Probability distribution
for
test statistic values assuming
the
null
hypothesis.
The
shaded area
is the
region
of
rejection
for a
one-sided test
Example distribution

of
query
effectiveness
improvements

Illustration
of how
documents
are
represented
in the
multiple-
Bernoulli event space.
In
this example, there
are
10
documents
(each
with
a
unique id),
two
classes (spam
and not
spam),
and a
vocabulary that consists
of the
terms "cheap", "buy", "banking",

"dinner",
and
"the".
Illustration
of how
documents
are
represented
in the
multinomial event space.
In
this example, there
are 10
documents (each with
a
unique id),
two
classes (spam
and not
spam),
and a
vocabulary
that
consists
of the
terms "cheap",
«
1
»
«

1
1 .
»
«
1
.
»
1
«
l
»
buy
,
banking
,
dinner
, and the
225

225
226
228
239
240
245
269
271
282
282
302

311
ies314
315
316
317
318
327

335
346
349
9.3
9.4
9.5
9.6
97
9.8
99
9.10
9.11
9.12
9.13
List
of
Figures
Data
set
that
consists
of two

classes (pluses
and
minuses).
The
data
set on the
left
is
linearly separable, whereas
the one on the
right
is not
Graphical illustration
of
Support Vector Machines
for the
linearly separable case.
Here,
the
hyperplane defined
by w is
shown,
as
well
as the
margin,
the
decision regions,
and the
support vectors, which

are
indicated
by
circles
Generative process used
by the
Naive
Bayes
model. First,
a
class
is
chosen according
to
P(c),
and
then
a
document
is
chosen
according
to
P(d\c)
Example data
set
where non-parametric learning algorithms,
such
as a
nearest neighbor

classifier,
may
outperform parametric
algorithms.
The
pluses
and
minuses indicate positive
and
negative training examples, respectively.
The
solid
gray
line
shows
the
actual decision boundary, which
is
highly non-linear.
Example
output
of
SpamAssassin email spam
filter
Example
of web
page spam, showing
the
main page
and

some
of
the
associated term
and
link
spam
Example
product
review incorporating sentiment
Example semantic class match between
a web
page about
rainbow
fish (a
type
of
tropical
fish) and an
advertisement
for
tropical
fish
food.
The
nodes "Aquariums", "Fish",
and
"Supplies"
are
example nodes within

a
semantic hierarchy.
The web
page
is
classified
as
"Aquariums
-
Fish"
and the ad is
classified
as
"Supplies
-
Fish".
Here,
"Aquariums"
is the
least
common ancestor.
Although
the web
page
and ad do not
share
any
terms
in
common, they

can be
matched because
of
their
semantic
similarity.
Example
of
divisive clustering
with
K
=
4. The
clustering
proceeds
from
left
to
right
and top to
bottom, resulting
in
four
clusters
Example
of
agglomerative clustering
with
K = 4. The
clustering proceeds

from
left
to
right
and top to
bottom,
resulting
in
four
clusters
Dendrogram
that
illustrates
the
agglomerative clustering
of the
ooints
from
Fieure
9.12
XIX
. 352
. 353
. 360
. 361
365
367
. 370
. 372
376

. 377
^77
XX
List
of
Figures
9.
14
Examples
of
clusters
in a
graph
formed
by
connecting nodes
representing instances.
A
link
represents
a
distance between
the
two
instances
that
is
less
than
some

threshold
value
9.15
9.16
9.17
9.18
10.1
10.2
10.3
10.4
Illustration
of how
various clustering cost
functions
are
compute
Example
of
overlapping clustering using nearest neighbor
clustering
with
K = 5. The
overlapping clusters
for the
black
points
(A, B, C, and D) are
shown.
The five
nearest neighbors

for
each black
point
are
shaded
gray
and
labeled
accordingly.
. . .
Example
of
overlapping clustering using Parzen windows.
The
clusters
for the
black points
(A, B, C, and D) are
shown.
The
shaded circles indicate
the
windows used
to
determine cluster
membership.
The
neighbors
for
each black

point
are
shaded
gray
and
labeled accordingly.
Cluster
hypothesis tests
on two
TREC
collections.
The top
two
compare
the
distributions
of
similarity values between
relevant-relevant
and
relevant-nonrelevant pairs (light
gray)
of
documents.
The
bottom
two
show
the
local precision

of the
relevant documents
Search
results used
to
enrich
a tag
representation.
In
this
example,
the tag
being expanded
is
"tropical
fish". The
query
"tropical
fish" is run
against
a
search engine,
and the
snippets
returned
are
then
used
to
generate

a
distribution over related
terms
Example
of a tag
cloud
in the
form
of a
weighted list.
The
tags
are in
alphabetical order
and
weighted according
to
some
criteria,
such
as
popularity.
Illustration
of the
HITS
algorithm. Each
row
corresponds
to a
single

iteration
of the
algorithm
and
each column corresponds
to a
specific
step
of the
algorithm
Example
of how
nodes within
a
directed graph
can be
represented
as
vectors.
For a
given node
p, its
vector
renresentation
has
comnonent
a set to 1 if
v

>

a
379
d381
. 385
388
390
403
407
412
413
10.5
10.6
10.7
10.8
10.9
10.10
10.11
10.12
List
of
Figures
Overview
of the two
common collaborative search scenarios.
On the
left
is
co
-located
collaborative search, which involves

multiple
participants
in the
same location
at the
same time.
On the
right
is
remote collaborative search, where participants
are
in
different
locations
and not
necessarily
all
online
and
searching
at the
same time
Example
of a
static filtering system. Documents arrive over time
and are
compared against each profile. Arrows
from
documents
to

profiles indicate
the
document matches
the
profile
and is
retrieved
Example
of an
adaptive filtering system. Documents arrive
over
time
and are
compared against each profile. Arrows
from
documents
to
profiles indicate
the
document matches
the
profile
and is
retrieved.
Unlike
static filtering, where profiles
are
static
over time, profiles
are

updated
dynamically (e.g., when
a
new
match occurs)
A set of
users within
a
recommender system. Users
and
their
ratings
for
some item
are
given. Users with question marks
above
their heads have
not yet
rated
the
item.
It is the
goal
of
the
recommender system
to fill in
these question marks
Illustration

of
collaborative filtering using clustering. Groups
of
similar users
are
outlined
with dashed lines. Users
and
their
ratings
for
some item
are
given.
In
each group, there
is a
single
user
who has not
judged
the
item.
For
these users,
the
unjudged
item
is
assigned

an
automatic rating based
on the
ratings
of
similar
users
Metasearch engine architecture.
The
query
is
broadcast
to
multiple
web
search engines
and
result lists
are
merged
Network
architectures
for
distributed
search:
(a)
central hub;
(b)
pure P2P;
and (c)

hierarchical P2P.
Dark
circles
are hub
or
superpeer nodes, gray circles
are
provider nodes,
and
white
circles
are
consumer nodes
Neighborhoods (JVj) of a hub node (H) in a hierarchical P2P
network
XXI
4?1
4?
5
47
8
434
43
5
439
443
445
XXII
List
of

Figures
11.1
Example Markov Random Field model assumptions, including
full
independence (top
left),
sequential dependence (top
right),
full
dependence (bottom
left),
and
general dependence
fhorrom
ricrhr)
11.2
11.3
11.4
11.5
11.6
11.7
11.8
11.9
11.10
11.11
11.12
Graphical model representations
of the
relevance model
technique (top)

and
latent concept expansion (bottom) used
for
pseudo
-relevance
feedback
with
the
query
"hubble
telescope
achievements"
Functions provided
by a
search engine interacting with
a
simple
database system
Example
of an
entity search
for
organizations using
the
TREC
Wall
Street
Journal 1987 Collection
Question
answering system architecture

Examples
of OCR
errors
Examples
of
speech recognizer errors
Two
images
(a fish and a flower
bed) with color histograms.
The
horizontal
axis
is hue
value
Three examples
of
content-based image retrieval.
The
collection
for
the first two
consists
of
1,560 images
of
cars,
faces,
apes,
and

other miscellaneous subjects.
The
last example
is
from
a
collection
of
2,048
trademark images.
In
each case,
the
leftmost
image
is the
query.
Key
frames
extracted
from
a
TREC
video clip
Examples
of
automatic text annotation
of
images
Three representations

of
Bach's "Fugue
#10":
audio,
MIDI,
and
conventional music notation
455
459
461
464
467
472
473
474
475
476
477
478
List
of
Tables
1.1
3.1
4.1
4.2
4.3
4.4
4.5
4.6

4.7
4.8
4.9
4.10
5.1
5.2
5.3
5.4
Some
dimensions
of
information
retrieval
UTF-8
encoding
Statistics
for the
AP89 collection
Most frequent
50
words
from
AP89
Low-frequency
words
from
AP89
Example
word frequency ranking
Proportions

of
words occurring
n
times
in
336,310
documents
from
the
TREC
Volume
3
corpus.
The
total vocabulary
size
(number
of
unique words)
is
508,209
Document frequencies
and
estimated frequencies
for
word
combinations (assuming independence)
in the
GOV2
Web

collection.
Collection
size
(N) is
25,205,179
Examples
of
errors made
by the
original Porter stemmer. False
positives
are
pairs
of
words that have
the
same stem. False
negatives
are
pairs that have
different
stems
Examples
of
words with
the
Arabic root
ktb
High-frequency
noun phrases

from
a
TREC
collection
and
U.S.
patents
from
1996
Statistics
for the
Google n-gram sample
Four
sentences
from
the
Wikipedia
entry
for
tropical
fish
Elias-7
code examples
Elias-$
code examples
Space
requirements
for
numbers encoded
in

v-byte
4
51
77
78
78
79
80
84
93
96
99
101
13?
146
147
149
XXIV
List
of
Tables
5.5
5.6
6.1
6.2
6.3
6.4
6.5
7.1
7.2

7.3
7.4
7.5
7.6
7.7
8.1
8.2
8.3
8.4
8.5
8.6
Sample
encodings
for
v-byte
Skip
lengths
(k)
and
expected processing steps
Partial entry
for the
Medical Subject (MeSH)
Heading
"Neck
Pain"
Term
association measures
Most strongly associated words
for

"tropical"
in a
collection
of
TREC
news stories. Co-occurrence counts
are
measured
at the
document level
Most strongly associated words
for
"fish"
in a
collection
of
TREC
news stories. Co-occurrence counts
are
measured
at the
document level
Most strongly associated words
for
"fish"
in a
collection
of
TREC
news stories. Co-occurrence counts

are
measured
in
windows
of five
words
Contingency table
of
term occurrences
for a
particular query
. . .
BM25 scores
for an
example document
Query
likelihood
scores
for an
example document
Highest-probability terms
from
relevance model
for
four
example
queries (estimated using
top
10
documents)

Highest-probability terms
from
relevance model
for
four
example
queries (estimated using
top 50
documents)
Conditional
probabilities
for
example network
Highest-probability terms
from
four
topics
in LDA
model
Statistics
for
three example text collections.
The
average number
of
words
per
document
is
calculated

without
stemming
Statistics
for
queries
from
example text collections
Sets
of
documents defined
by a
simple search with binary
relevance
Precision
values
at
standard recall levels calculated using
interpolation
Definitions
of
some important
efficiency
metrics
Artificial
effectiveness
data
for two
retrieval algorithms
(A and
B)

over
10
queries.
The
column
B - A
gives
the
difference
in
effectiveness
149
152
200
203
204
205
205
248
252
260
266
267
272
290
301
301
309
317
323

328

×