Lucene in Action pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (17.67 MB, 528 trang )

MANNING
Michael McCandless
Erik Hatcher
Otis Gospodnetic
F
OREWORD

BY
D
OUG
C
UTTING
Covers Apache Lucene 3.0
SECOND EDITION
IN ACTION
,
www.it-ebooks.info
Praise for the First Edition
This is definitely the book to have if you’re planning on using Lucene in your application, or are
interested in what Lucene can do for you.
—JavaLobby
Search powers the information age. This book is a gateway to this invaluable resource It suc-
ceeds admirably in elucidating the application programming interface (API), with many code
examples and cogent explanations, opening the door to a fine tool.
—Computing Reviews
A must-read for anyone who wants to learn about Lucene or is even considering embedding
search into their applications or just wants to learn about information retrieval in general.
Highly recommended!
—TheServerSide.com
Well thought-out thoroughly edited stands out clearly from the crowd I enjoyed reading this
book. If you have any text-searching needs, this book will be more than sufficient equipment to

guide you to successful completion. Even, if you are just looking to download a pre-written search
engine, then this book will provide a good background to the nature of information retrieval in
general and text indexing and searching specifically.
—Slashdot.org
The book is more like a crystal ball than ink on pape I run into solutions to my most pressing
problems as I read through it.
—Arman Anwar, Arman@Web
Provides a detailed blueprint for using and customizing Lucene a thorough introduction to the
inner workings of what’s arguably the most popular open source search engine loaded with code
examples and emphasizes a hands-on approach to learning.
—SearchEngineWatch.com
Hatcher and Gospodnetic
´
bring their experience as two of Lucene’s core committers to author this
excellently written book. This book helps any developer not familiar with Lucene or development
of a search engine to get up to speed within minutes on the project and domain I would recom-
mend this book to anyone who is new to Lucene, anyone who needs powerful indexing and
searching capabilities in their application, or anyone who needs a great reference for Lucene.
—Fort Worth Java Users Group
Licensed to theresa smith <>
www.it-ebooks.info
More Praise for the First Edition
Outstanding comprehensive and up-to-date grab this book and learn how to leverage
Lucene’s potential.
—Val’s blog
the code examples are useful and reusable.
—Scott Ganyo, Lucene Java Committer
packed with examples and advice on how to effectively use this incredibly powerful tool.
—Brian Goetz, Quiotix Corporation
it unlocked for me the amazing power of Lucene.

—Reece Wilton, Walt Disney Internet Group
code samples as JUnit test cases are incredibly helpful.
—Norman Richards, co-author XDoclet in Action
A quick and easy guide to making Lucene work.
—Books-On-Line
A comprehensive guide The authors of this book are experts in this field they have unleashed
the power of Lucene the best guide to Lucene available so far.
—JavaReference.com
Licensed to theresa smith <>
www.it-ebooks.info
Lucene in Action
Second Edition
MICHAEL MCCANDLESS
ERIK HATCHER
OTIS GOSPODNETIĆ
MANNING
Greenwich
(74° w. long.)
Licensed to theresa smith <>
www.it-ebooks.info
For online information and ordering of this and other Manning books, please visit
www.manning.com. The publisher offers discounts on this book when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
180 Broad St.
Suite 1323
Stamford, CT 06901
Email:
©2010 by Manning Publications Co. All rights reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
permission of the publisher.
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine
Manning Publications Co. Development editor: Sebastian Stirling
180 Broad St. Copyeditor: Liz Welch
Suite 1323 Typesetter: Dottie Marsico
Stamford, CT 06901 Cover designer: Marija Tudor
ISBN 978-1-933988-17-7
Printed in the United States of America
12345678910–MAL–151413121110
Licensed to theresa smith <>
www.it-ebooks.info
v
brief contents
PART 1CORE LUCENE 1
1
■
Meet Lucene 3
2
■
Building a search index 31

3
■
Adding search to your application 74
4
■
Lucene’s analysis process 110
5
■
Advanced search techniques 152
6
■
Extending search 204
PART 2APPLIED LUCENE 233
7
■
Extracting text with Tika 235
8
■
Essential Lucene extensions 255
9
■
Further Lucene extensions 288
10
■
Using Lucene from other programming languages 325
11
■
Lucene administration and performance tuning 345
PART 3CASE STUDIES 381
12

■
Case study 1: Krugle 383
13
■
Case study 2: SIREn 394
14
■
Case study 3: LinkedIn 409
Licensed to theresa smith <>
www.it-ebooks.info
Licensed to theresa smith <>
www.it-ebooks.info
vii
contents
foreword xvii
preface xix
preface to the first edition xx
acknowledgments xxiii
about this book xxvi
JUnit primer xxxiv
about the authors xxxvii
PART 1CORE LUCENE 1
1
Meet Lucene 3
1.1 Dealing with information explosion 4
1.2 What is Lucene? 6
What Lucene can do 7
■
History of Lucene 7
1.3 Lucene and the components of a search application 9

Components for indexing 11
■
Components for searching 14
The rest of the search application 16
■
Where Lucene fits into your
application 18
1.4 Lucene in action: a sample application 19
Creating an index 19
■
Searching an index 23
1.5 Understanding the core indexing classes 25
IndexWriter 26
■
Directory 26
■
Analyzer 26
Document 27
■
Field 27
Licensed to theresa smith <>
www.it-ebooks.info
CONTENTSviii
1.6 Understanding the core searching classes 28
IndexSearcher 28
■
Term 28
■
Query 29
■

TermQuery 29
TopDocs 29
1.7 Summary 29
2
Building a search index 31
2.1 How Lucene models content 32
Documents and fields 32
■
Flexible schema 33
Denormalization 34
2.2 Understanding the indexing process 34
Extracting text and creating the document 34
Analysis 35
■
Adding to the index 35
2.3 Basic index operations 36
Adding documents to an index 37
■
Deleting documents from
an index 39
■
Updating documents in the index 41
2.4 Field options 43
Field options for indexing 43
■
Field options for storing fields 44
Field options for term vectors 44
■
Reader, TokenStream, and
byte[] field values 45

■
Field option combinations 46
■
Field
options for sorting 46
■
Multivalued fields 47
2.5 Boosting documents and fields 48
Boosting documents 48
■
Boosting fields 49
■
Norms 50
2.6 Indexing numbers, dates, and times 51
Indexing numbers 51
■
Indexing dates and times 52
2.7 Field truncation 53
2.8 Near-real-time search 54
2.9 Optimizing an index 54
2.10 Other directory implementations 56
2.11 Concurrency, thread safety, and locking issues 58
Thread and multi-JVM safety 58
■
Accessing an index over a
remote file system 59
■
Index locking 61
2.12 Debugging indexing 63
2.13 Advanced indexing concepts 64

Deleting documents with IndexReader 65
■
Reclaiming disk space
used by deleted documents 66
■
Buffering and flushing 66
Index commits 67
■
ACID transactions and index
consistency 69
■
Merging 70
2.14 Summary 72
Licensed to theresa smith <>
www.it-ebooks.info
CONTENTS ix
3
Adding search to your application 74
3.1 Implementing a simple search feature 76
Searching for a specific term 76
■
Parsing a user-entered query
expression: QueryParser 77
3.2 Using IndexSearcher 80
Creating an IndexSearcher 81
■
Performing searches 82
Working with TopDocs 82
■
Paging through results 84

Near-real-time search 84
3.3 Understanding Lucene scoring 86
How Lucene scores 86
■
Using explain() to understand
hit scoring 88
3.4 Lucene’s diverse queries 90
Searching by term: TermQuery 90
■
Searching within a term
range: TermRangeQuery 91
■
Searching within a numeric range:
NumericRangeQuery 92
■
Searching on a string:
PrefixQuery 93
■
Combining queries: BooleanQuery 94
Searching by phrase: PhraseQuery 96
■
Searching by wildcard:
WildcardQuery 99
■
Searching for similar terms:
FuzzyQuery 100
■
Matching all documents:
MatchAllDocsQuery 101
3.5 Parsing query expressions: QueryParser 101

Query.toString 102
■
TermQuery 103
■
Term range
searches 103
■
Numeric and date range searches 104
Prefix and wildcard queries 104
■
Boolean operators 105
Phrase queries 105
■
Fuzzy queries 106
MatchAllDocsQuery 107
■
Grouping 107
■
Field
selection 107
■
Setting the boost for a subquery 108
To QueryParse or not to QueryParse? 108
3.6 Summary 109
4
Lucene’s analysis process 110
4.1 Using analyzers 111
Indexing analysis 113
■
QueryParser analysis 114

Parsing vs. analysis: when an analyzer isn’t appropriate 114
4.2 What’s inside an analyzer? 115
What’s in a token? 116
■
TokenStream uncensored 117
Visualizing analyzers 120
■
TokenFilter order can be
significant 125
4.3 Using the built-in analyzers 127
StopAnalyzer 127
■
StandardAnalyzer 128
■
Which core
analyzer should you use? 128
Licensed to theresa smith <>
www.it-ebooks.info
CONTENTSx
4.4 Sounds-like querying 129
4.5 Synonyms, aliases, and words that mean the same 131
Creating SynonymAnalyzer 132
■
Visualizing token
positions 137
4.6 Stemming analysis 138
StopFilter leaves holes 138
■
Combining stemming and
stop-word removal 139

4.7 Field variations 140
Analysis of multivalued fields 140
■
Field-specific analysis 140
Searching on unanalyzed fields 141
4.8 Language analysis issues 144
Unicode and encodings 144
■
Analyzing non-English
languages 145
■
Character normalization 145
■
Analyzing
Asian languages 146
■
Zaijian 148
4.9 Nutch analysis 149
4.10 Summary 151
5
Advanced search techniques 152
5.1 Lucene’s field cache 153
Loading field values for all documents 154
■
Per-segment
readers 155
5.2 Sorting search results 155
Sorting search results by field value 156
■
Sorting by

relevance 158
■
Sorting by index order 159
■
Sorting by
a field 160
■
Reversing sort order 161
■
Sorting by multiple
fields 161
■
Selecting a sorting field type 163
■
Using a
nondefault locale for sorting 163
5.3 Using MultiPhraseQuery 163
5.4 Querying on multiple fields at once 166
5.5 Span queries 168
Building block of spanning, SpanTermQuery 170
■
Finding
spans at the beginning of a field 172
■
Spans near one
another 173
■
Excluding span overlap from matches 174
SpanOrQuery 175
■

SpanQuery and QueryParser 177
5.6 Filtering a search 177
TermRangeFilter 178
■
NumericRangeFilter 179
FieldCacheRangeFilter 179
■
Filtering by specific terms 180
Using QueryWrapperFilter 180
■
Using SpanQueryFilter 181
Licensed to theresa smith <>
www.it-ebooks.info
CONTENTS xi
Security filters 181
■
Using BooleanQuery for filtering 183
PrefixFilter 183
■
Caching filter results 184
■
Wrapping a
filter as a query 184
■
Filtering a filter 184
■
Beyond the built-
in filters 185
5.7 Custom scoring using function queries 185
Function query classes 185

■
Boosting recently modified
documents using function queries 187
5.8 Searching across multiple Lucene indexes 189
Using MultiSearcher 189
■
Multithreaded searching
using ParallelMultiSearcher 191
5.9 Leveraging term vectors 191
Books like this 192
■
What category? 195
TermVectorMapper 198
5.10 Loading fields with FieldSelector 200
5.11 Stopping a slow search 201
5.12 Summary 202
6
Extending search 204
6.1 Using a custom sort method 205
Indexing documents for geographic sorting 205
■
Implementing
custom geographic sort 206
■
Accessing values used in custom
sorting 209
6.2 Developing a custom Collector 210
The Collector base class 211
■
Custom collector:

BookLinkCollector 212
■
AllDocCollector 213
6.3 Extending QueryParser 214
Customizing QueryParser’s behavior 214
■
Prohibiting fuzzy and
wildcard queries 215
■
Handling numeric field-range
queries 216
■
Handling date ranges 218
■
Allowing ordered
phrase queries 220
6.4 Custom filters 221
Implementing a custom filter 221
■
Using our custom filter
during searching 223
■
An alternative: FilteredQuery 224
6.5 Payloads 225
Producing payloads during analysis 226
■
Using payloads
during searching 227
■
Payloads and SpanQuery 230

Retrieving payloads via TermPositions 230
6.6 Summary 231
Licensed to theresa smith <>
www.it-ebooks.info
CONTENTSxii
PART 2APPLIED LUCENE 233
7
Extracting text with Tika 235
7.1 What is Tika? 236
7.2 Tika’s logical design and API 238
7.3 Installing Tika 240
7.4 Tika’s built-in text extraction tool 240
7.5 Extracting text programmatically 242
Indexing a Lucene document 242
■
The Tika utility class 245
Customizing parser selection 246
7.6 Tika’s limitations 246
7.7 Indexing custom XML 247
Parsing using SAX 248
■
Parsing and indexing using
Apache Commons Digester 250
7.8 Alternatives 253
7.9 Summary 254
8
Essential Lucene extensions 255
8.1 Luke, the Lucene Index Toolbox 256
Overview: seeing the big picture 257
■

Document browsing 257
Using QueryParser to search 260
■
Files and plugins view 261
8.2 Analyzers, tokenizers, and TokenFilters 262
SnowballAnalyzer 264
■
Ngram filters 265
■
Shingle
filters 267
■
Obtaining the contrib analyzers 267
8.3 Highlighting query terms 268
Highlighter components 268
■
Standalone highlighter
example 271
■
Highlighting with CSS 272
■
Highlighting
search results 273
8.4 FastVectorHighlighter 275
8.5 Spell checking 277
Generating a suggestions list 278
■
Selecting the best
suggestion 280
■

Presenting the result to the user 281
Some ideas to improve spell checking 281
8.6 Fun and interesting Query extensions 283
MoreLikeThis 283
■
FuzzyLikeThisQuery 284
BoostingQuery 284
■
TermsFilter 284
■
DuplicateFilter 285
RegexQuery 285
Licensed to theresa smith <>
www.it-ebooks.info
CONTENTS xiii
8.7 Building contrib modules 286
Get the sources 286
■
Ant in the contrib directory 286
8.8 Summary 287
9
Further Lucene extensions 288
9.1 Chaining filters 289
9.2 Storing an index in Berkeley DB 292
9.3 Synonyms from WordNet 294
Building the synonym index 295
■
Tying WordNet synonyms
into an analyzer 297
9.4 Fast memory-based indices 298

9.5 XML QueryParser: Beyond “one box”
search interfaces 299
Using XmlQueryParser 300
■
Extending the XML query
syntax 304
9.6 Surround query language 306
9.7 Spatial Lucene 308
Indexing spatial data 308
■
Searching spatial data 312
Performance characteristics of Spatial Lucene 314
9.8 Searching multiple indexes remotely 316
9.9 Flexible QueryParser 320
9.10 Odds and ends 322
9.11 Summary 323
10
Using Lucene from other programming languages 325
10.1 Ports primer 326
Trade-offs 327
■
Choosing the right port 328
10.2 CLucene (C++) 328
Motivation 329
■
API and index compatibility 330
Supported platforms 332
■
Current and future work 332
10.3 Lucene.Net (C# and other .NET languages) 332

API compatibility 334
■
Index compatibility 335
10.4 KinoSearch and Lucy (Perl) 335
KinoSearch 336
■
Lucy 338
■
Other Perl options 338
10.5 Ferret (Ruby) 338
10.6 PHP 340
Zend Framework 340
■
PHP Bridge 341
Licensed to theresa smith <>
www.it-ebooks.info
CONTENTSxiv
10.7 PyLucene (Python) 341
API compatibility 342
■
Other Python options 343
10.8 Solr (many programming languages) 343
10.9 Summary 344
11
Lucene administration and performance tuning 345
11.1 Performance tuning 346
Simple performance-tuning steps 347
■
Testing approach 348
Tuning for index-to-search delay 349

■
Tuning for indexing
throughput 350
■
Tuning for search latency and
throughput 354
11.2 Threads and concurrency 356
Using threads for indexing 357
■
Using threads for
searching 361
11.3 Managing resource consumption 364
Disk space 364
■
File descriptors 367
■
Memory 371
11.4 Hot backups of the index 374
Creating the backup 374
■
Restoring the index 376
11.5 Common errors 376
Index corruption 377
■
Repairing an index 378
11.6 Summary 378
PART 3CASE STUDIES 381
12
Case study 1: Krugle
Krugle: Searching source code 383

12.1 Introducing Krugle 384
12.2 Appliance architecture 385
12.3 Search performance 386
12.4 Parsing source code 387
12.5 Substring searching 388
12.6 Query vs. search 391
12.7 Future improvements 391
FieldCache memory usage 392
■
Combining indexes 392
12.8 Summary 392
Licensed to theresa smith <>
www.it-ebooks.info
CONTENTS xv
13
Case study 2: SIREn
Searching semistructured documents with SIREn 394
13.1 Introducing SIREn 395
13.2 SIREn’s benefits 396
Searching across all fields 398
■
A single efficient lexicon 398
Flexible fields 398
■
Efficient handling of multivalued
fields 398
13.3 Indexing entities with SIREn 399
Data model 399
■
Implementation issues 400

■
Index
schema 400
■
Data preparation before indexing 401
13.4 Searching entities with SIREn 402
Searching content 402
■
Restricting search within a cell 403
Combining cells into tuples 404
■
Querying an entity
description 404
13.5 Integrating SIREn in Solr 405
13.6 Benchmark 405
13.7 Summary 407
14
Case study 3: LinkedIn
Adding facets and real-time search with Bobo Browse and Zoie 409
14.1 Faceted search with Bobo Browse 410
Bobo Browse design 410
■
Beyond simple faceting 415
14.2 Real-time search with Zoie 416
Zoie architecture 418
■
Real-time vs. near-real-time 421
Documents and indexing requests 421
■
Custom

IndexReaders 423
■
Comparison with Lucene near-real-time
search 424
■
Distributed search 425
14.3 Summary 427
appendix a Installing Lucene 428
appendix b Lucene index format 433
appendix c Lucene/contrib benchmark 443
appendix d Resources 465
index 469
Licensed to theresa smith <>
www.it-ebooks.info
Licensed to theresa smith <>
www.it-ebooks.info
xvii
foreword
Lucene started as a self-serving project. In late 1997, my job uncertain, I sought some-
thing of my own to market. Java was the hot new programming language, and I
needed an excuse to learn it. I already knew how to write search software, and thought
I might fill a niche by writing search software in Java. So I wrote Lucene.
In 2000, I realized that I didn’t like to market stuff. I had no interest in negotiating
licenses and contracts, and I didn’t want to hire people and build a company. I liked
writing software, not selling it. So I tossed Lucene up on SourceForge, to see if open
source might let me keep doing what I liked.
A few folks started using Lucene right away. In 2001, folks at Apache offered to
adopt Lucene. The number of daily messages on the Lucene mailing lists grew
steadily. Code contributions started to trickle in. Most were additions around the
edges of Lucene: I was still the only active developer who fully grokked its core. Still,

Lucene was on the road to becoming a real collaborative project.
Now, in 2010, Lucene has a pool of active developers with deep understanding of
its core. I’m no longer involved in day-to-day development; substantial additions and
improvements are regularly made by this strong team.
Through the years, Lucene has been translated into several other programming
languages, including C++, C#, Perl, and Python. In the original Java, and in these
other incarnations, Lucene is used much more widely than I ever would have
dreamed. It powers search in diverse applications like discussion groups at Fortune
100 companies, commercial bug trackers, email search supplied by Microsoft, and a
web search engine that scales to billions of pages. When, at industry events, I am intro-
duced to someone as the “Lucene guy,” more often than not folks tell me how they’ve
Licensed to theresa smith <>
www.it-ebooks.info
FOREWORDxviii
used Lucene in a project. I figure I’ve only heard about a small fraction of all Lucene
applications.
Lucene is much more widely used than it ever would have been if I had tried to sell
it. Application developers seem to prefer open source. Instead of having to contact
technical support when they have a problem (and then wait for an answer, hoping
they were correctly understood), they can frequently just look at the source code to
diagnose their problems. If that’s not enough, the free support provided by peers on
the mailing lists is better than most commercial support. A functioning open-source
project like Lucene makes application developers more efficient and productive.
Lucene, through open source, has become something much greater than I ever
imagined it would. I set it going, but it took the combined efforts of the Lucene com-
munity to make it thrive.
So what’s next for Lucene? I can’t predict the future. What I do know is that even
after over 10 years in existence, Lucene is still going strong, and its user and develop-
ment communities are bigger and busier than ever, in part thanks to the first edition
of Lucene in Action making it easier for more people to get started with Lucene. With

every new release Lucene is getting better, more mature, more feature-rich, and faster.
Since the first edition of Lucene in Action was published in 2004, Lucene internals
and its
API have gone through radical changes that called for more than just minor
book updates. In this totally revised second edition, the authors bring you up to speed
on the latest improvements and new
APIs in Lucene.
Armed with the second edition of Lucene in Action, you too are now a member of
the Lucene community, and it’s up to you to take Lucene to new places. Bon voyage!

D
OUG CUTTING
FOUNDER OF LUCENE,
N
UTCH, AND HADOOP

Licensed to theresa smith <>
www.it-ebooks.info
xix
preface
I first started with Lucene about a year after the first edition of Lucene in Action was
published. I already had experience building search engines, but didn’t know much
about Lucene in particular. So, I picked up a copy of Lucene in Action by Erik and Otis
and read it, cover to cover, and I was hooked!
As I used Lucene, I found small improvements here and there, so I started contrib-
uting small patches, updating javadocs, discussing topics on Lucene’s mailing lists,
and so forth. I eventually became an active core committer and
PMC member, com-
mitting many changes over the years.
It has now been five-and-a-half years since the first edition of Lucene in Action was

published, which is practically an eternity in the fast-paced world of open source
development! Lucene has gone through two major releases, and now has all sorts of
new functionality such as numeric fields, the reusable analysis
API, payloads, near-real-
time search, and transactional
APIs for indexing and searching, and so on.
When Manning first approached me, it was clear that a second edition was sorely
needed. Furthermore, as one of the active core committers largely responsible for
committing so many of these changes, I felt rather obligated to create the second edi-
tion. So I said yes, and then worked fiendishly to cover Lucene’s changes, and I’m
quite happy with the results. I hope this Second Edition of Lucene in Action will serve
you well as you create your search applications, and I look forward to seeing you on
the user and developer lists, asking your own interesting questions, and continuing to
drive Lucene's relentless growth!
M
ICHAEL MCCANDLESS

Licensed to theresa smith <>
www.it-ebooks.info
xx
preface to the first edition
From Erik Hatcher
I’ve been intrigued with searching and indexing from the early days of the Internet. I
have fond memories (circa 1991) of managing an email list using majordomo,
MUSH
(Mail User’s Shell), and a handful of Perl, awk, and shell scripts. I implemented a CGI
web interface to allow users to search the list archives and other users’ profiles using
grep tricks under the covers. Then along came Yahoo!, AltaVista, and Excite, all which
I visited regularly.
After my first child, Jakob, was born, my digital photo archive began growing rap-

idly. I was intrigued with the idea of developing a system to manage the pictures so
that I could attach meta-data to each picture, such as keywords and date taken, and, of
course, locate the pictures easily in any dimension I chose. In the late 1990s, I proto-
typed a filesystem-based approach using Microsoft technologies, including Microsoft
Index Server, Active Server Pages, and a third
COM component for image manipula-
tion. At the time, my professional life was consumed with these same technologies. I
was able to cobble together a compelling application in a couple of days of spare-time
hacking.
My professional life shifted toward Java technologies, and my computing life con-
sisted of less and less Microsoft Windows. In an effort to reimplement my personal
photo archive and search engine in Java technologies in an operating system–agnostic
way, I came across Lucene. Lucene’s ease of use far exceeded my expectations—I had
experienced numerous other open-source libraries and tools that were far simpler
conceptually yet far more complex to use.
Licensed to theresa smith <>
www.it-ebooks.info
PREFACE TO THE FIRST EDITION xxi
In 2001, Steve Loughran and I began writing Java Development with Ant (Manning).
We took the idea of an image search engine application and generalized it as a docu-
ment search engine. This application example is used throughout the Ant book and
can be customized as an image search engine. The tie to Ant comes not only from a
simple compile-and-package build process but also from a custom Ant task, <index>,
we created that indexes files during the build process using Lucene. This Ant task now
lives in Lucene’s Sandbox and is described in section 8.4 of the first edition.
This Ant task is in production use for my custom blogging system, which I call
BlogScene ( I run an Ant build process, after creat-
ing a blog entry, which indexes new entries and uploads them to my server. My blog
server consists of a servlet, some Velocity templates, and a Lucene index, allowing for
rich queries, even syndication of queries. Compared to other blogging systems, Blog-

Scene is vastly inferior in features and finesse, but the full-text search capabilities are
very powerful.
I’m now working with the Applied Research in Patacriticism group at the Univer-
sity of Virginia (), where I’m putting my text analysis,
indexing, and searching expertise to the test and stretching my mind with discussions
of how quantum physics relates to literature. “Poets are the unacknowledged engi-
neers of the world.”
From Otis Gospodnetic
´
My interest in and passion for information retrieval and management began during
my student years at Middlebury College. At that time, I discovered an immense source
of information known as the Web. Although the Web was still in its infancy, the long-
term need for gathering, analyzing, indexing, and searching was evident. I became
obsessed with creating repositories of information pulled from the Web, began writing
web crawlers, and dreamed of ways to search the collected information. I viewed
search as the killer application in a largely uncharted territory. With that in the back
of my mind, I began the first in my series of projects that share a common denomina-
tor: gathering and searching information.
In 1995, fellow student Marshall Levin and I created WebPh, an open-source pro-
gram used for collecting and retrieving personal contact information. In essence, it
was a simple electronic phone book with a web interface (CGI), one of the first of its
kind at that time. (In fact, it was cited as an example of prior art in a court case in the
late 1990s!) Universities and government institutions around the world have been the
primary adopters of this program, and many are still using it. In 1997, armed with my
WebPh experience, I proceeded to create Populus, a popular white pages at the time.
Even though the technology (similar to that of WebPh) was rudimentary, Populus car-
ried its weight and was a comparable match to the big players such as WhoWhere, Big-
foot, and Infospace.
After two projects that focused on personal contact information, it was time to
explore new territory. I began my next venture, Infojump, which involved culling

Licensed to theresa smith <>
www.it-ebooks.info
PREFACE TO THE FIRST EDITIONxxii
high-quality information from online newsletters, journals, newspapers, and maga-
zines. In addition to my own software, which consisted of large sets of Perl modules
and scripts, Infojump utilized a web crawler called Webinator and a full-text search
product called Texis. The service provided by Infojump in 1998 was much like that of
FindArticles.com today.
Although WebPh, Populus, and Infojump served their purposes and were fully
functional, they all had technical limitations. The missing piece in each of them was a
powerful information-retrieval library that would allow full-text searches backed by
inverted indexes. Instead of trying to reinvent the wheel, I started looking for a solu-
tion that I suspected was out there. In early 2000, I found Lucene, the missing piece
I’d been looking for, and I fell in love with it.
I joined the Lucene project early on when it still lived at SourceForge and, later, at
the Apache Software Foundation when Lucene migrated there in 2002. My devotion
to Lucene stems from its being a core component of many ideas that had queued up
in my mind over the years. One of those ideas was Simpy, my latest pet project. Simpy
is a feature-rich personal web service that lets users tag, index, search, and share infor-
mation found online. It makes heavy use of Lucene, with thousands of its indexes, and
is powered by Nutch, another project of Doug Cutting’s (see chapter 10 of the first
edition). My active participation in the Lucene project resulted in an offer from Man-
ning to co-author Lucene in Action with Erik Hatcher.
Lucene in Action is the most comprehensive source of information about Lucene.
The information contained in the chapters encompasses all the knowledge you need
to create sophisticated applications built on top of Lucene. It’s the result of a very
smooth and agile collaboration process, much like that within the Lucene community.
Lucene and Lucene in Action exemplify what people can achieve when they have simi-
lar interests, the willingness to be flexible, and the desire to contribute to the global
knowledge pool, despite the fact that they have yet to meet in person.

Licensed to theresa smith <>
www.it-ebooks.info
xxiii
acknowledgments
We are sincerely and humbly indebted to Doug Cutting. Without Doug’s generosity to
the world, there would be no Lucene. Without the other Lucene committers, Lucene
would have far fewer features, more bugs, and a much tougher time thriving with its
growing adoption. Many thanks to all the committers, past and present. Similarly, we
thank all those who contributed the case studies that appear in chapters 12, 13 and 14:
Michele Catasta, Renaud Delbru, Mikkel Kamstrup Erlandsen, Toke Eskildsen, Robert
Fuller, Grant Glouser, Ken Krugler, Jake Mannix, Nickolai Toupikov, Giovanni Tum-
marello, Mads Villadsen, and John Wang. We’d also like to thank Doug Cutting for
penning the foreword to the second edition.
Our thanks to the staff at Manning, including Marjan Bace, Jeff Bleiel, Sebstian
Stirling, Karen Tegtmeyer, Liz Welch, Elizabeth Martin, Dottie Marsico, Mary Piergies,
and Marija Tudor. Manning rounded up a great set of reviewers, whom we thank for
improving our drafts into the book you now read. The reviewers include Chad Davis,
Dave Pawson, Rob Allen, Rick Wagner, Michele Galli, Robi Sen, Stuart Caborn, Jeremy
Flowers, Robert Hanson, Rodney Woodruff, Anton Mazkovoi, Ramarao Kanneganti,
Matt Payne, Curtis Miller, Nathan Levesque, Cos DiFazio, and Andy Dingley. Extra-
special thanks go to Shai Erera for his technical editing. Thank you to all our
MEAP
readers who posted feedback on Manning’s forums.
Michael McCandless
Writing a book is not easy. Writing a book about something as technically rich as
Lucene is especially challenging. Writing a book about a successful, active, and fast
Licensed to theresa smith <>
www.it-ebooks.info
ACKNOWLEDGMENTSxxiv
moving open-source project is nearly impossible! Many things had to happen right for

me to start and finish this book.
I would never have been part of this book without Doug having the initial itch,
technical strength, and generosity to open-source his idea, without a vibrant commu-
nity relentlessly pushing Lucene forward, without a forward-looking
IBM supporting
my involvement with Lucene and this book, and without Erik and Otis writing the
first edition.
My four kids—Mia, Kyra, Joel, Kyle—always inspire me, with everything they do.
Their boundless energy, free thinking, infinite series of insightful questions, amazing
happiness, insatiable curiosity, gentle persistence, free sense of humor, sheer passion,
temper tantrums, and sharp minds keep me very young at heart and inspire me to
tackle big projects like this. You should strive, always, to remain a child.
I thank my wife, Jane, for convincing me to pursue this when Manning came
knocking, and for her unmatched skills in efficiently running our busy family.
Remarkably, she has made lots of time for me to work, write this book and still pursue
all my crazy hobbies, and I can see that this ability is very rare.
My parents, all four of them, raised me with the courage to always stretch myself in
what I try to tackle, but also with the discipline and persistence to finish what I start.
They taught me integrity: if you commit to do something, you do it well. Always under-
promise and overdeliver. They also led by example, showing me that individuals can
do big things when they work hard. More importantly, they taught me that you should
spend your life doing the things you love. Life is far too short to do otherwise.
Erik Hatcher
First, and really only, heartfelt thanks go to none other than Mike McCandless. He has
pretty much single-handedly revised this book from its 1.0 release to the current spiffy
“3.0” state. Mike approaches Lucene, this book, and life in general enthusiastically,
with eagerness to tackle any task at hand. The first edition acknowledgments also very
much apply here, as these influences are timelessly felt.
I personally thank Otis for his efforts with this book. Although we’ve yet to meet in
person, Otis has been a joy to work with. He and I have gotten along well and have

agreed on the structure and content on this book throughout. Thanks to Java Java in
Charlottesville, Virginia, for keeping me wired and wireless; thanks, also, to Green-
berry’s for staying open later than Java Java and keeping me out of trouble by not hav-
ing internet access (update: they now have wi-fi, much to the dismay of my
productivity). The people I’ve surrounded myself with enrich my life more than any-
thing. David Smith has been a life-long mentor, and his brilliance continues to chal-
lenge me; he gave me lots of food for thought regarding Lucene visualization (most of
which I’m still struggling to fully grasp, and I apologize that it didn’t make it into this
manuscript). Jay Zimmerman and the No Fluff, Just Stuff symposium circuit have
been dramatically influential for me. The regular
NFJS speakers, including Dave
Thomas, Stuart Halloway, James Duncan Davidson, Jason Hunter, Ted Neward, Ben
Licensed to theresa smith <>
www.it-ebooks.info

Lucene in Action pptx

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về