Manning lucene in action 2nd

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (15.03 MB, 528 trang )

Covers Apache Lucene 3.0

IN ACTION
SECOND EDITION

Michael McCandless
Erik Hatcher ,
Otis Gospodnetic
F OREWORD BY D OUG C UTTING

MANNING

Praise for the First Edition
This is definitely the book to have if you’re planning on using Lucene in your application, or are
interested in what Lucene can do for you.
—JavaLobby
Search powers the information age. This book is a gateway to this invaluable resource...It succeeds admirably in elucidating the application programming interface (API), with many code
examples and cogent explanations, opening the door to a fine tool.
—Computing Reviews
A must-read for anyone who wants to learn about Lucene or is even considering embedding
search into their applications or just wants to learn about information retrieval in general.
Highly recommended!
—TheServerSide.com
Well thought-out...thoroughly edited...stands out clearly from the crowd....I enjoyed reading this
book. If you have any text-searching needs, this book will be more than sufficient equipment to
guide you to successful completion. Even, if you are just looking to download a pre-written search
engine, then this book will provide a good background to the nature of information retrieval in
general and text indexing and searching specifically.
—Slashdot.org
The book is more like a crystal ball than ink on pape--I run into solutions to my most pressing

problems as I read through it.
—Arman Anwar, Arman@Web
Provides a detailed blueprint for using and customizing Lucene...a thorough introduction to the
inner workings of what’s arguably the most popular open source search engine...loaded with code
examples and emphasizes a hands-on approach to learning.
—SearchEngineWatch.com
Hatcher and Gospodnetic´ bring their experience as two of Lucene’s core committers to author this
excellently written book. This book helps any developer not familiar with Lucene or development
of a search engine to get up to speed within minutes on the project and domain....I would recommend this book to anyone who is new to Lucene, anyone who needs powerful indexing and
searching capabilities in their application, or anyone who needs a great reference for Lucene.
—Fort Worth Java Users Group

Licensed to theresa smith <>

More Praise for the First Edition
Outstanding...comprehensive and up-to-date ...grab this book and learn how to leverage
Lucene’s potential.
—Val’s blog
...the code examples are useful and reusable.
—Scott Ganyo, Lucene Java Committer
...packed with examples and advice on how to effectively use this incredibly powerful tool.
—Brian Goetz, Quiotix Corporation
...it unlocked for me the amazing power of Lucene.
—Reece Wilton, Walt Disney Internet Group
...code samples as JUnit test cases are incredibly helpful.
—Norman Richards, co-author XDoclet in Action
A quick and easy guide to making Lucene work.
—Books-On-Line
A comprehensive guide...The authors of this book are experts in this field...they have unleashed

the power of Lucene ...the best guide to Lucene available so far.
—JavaReference.com

Licensed to theresa smith <>

Lucene in Action
Second Edition
MICHAEL MCCANDLESS
ERIK HATCHER
OTIS GOSPODNETIĆ

MANNING
Greenwich
(74° w. long.)

Licensed to theresa smith <>

For online information and ordering of this and other Manning books, please visit
www.manning.com. The publisher offers discounts on this book when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
180 Broad St.
Suite 1323
Stamford, CT 06901
Email:

©2010 by Manning Publications Co. All rights reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.

Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine

Manning Publications Co.
180 Broad St.
Suite 1323
Stamford, CT 06901

Development editor:
Copyeditor:
Typesetter:
Cover designer:

Sebastian Stirling
Liz Welch
Dottie Marsico
Marija Tudor

ISBN 978-1-933988-17-7
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 – MAL – 15 14 13 12 11 10

Licensed to theresa smith <>

brief contents
PART 1 CORE LUCENE ............................................................. 1
1

■

2

■

3

■

4

■

5

■

6

■

Meet Lucene 3
Building a search index 31
Adding search to your application 74
Lucene’s analysis process 110
Advanced search techniques 152
Extending search 204

PART 2 APPLIED LUCENE...................................................... 233
7

■

8

■

9

■

10

■

11

■

Extracting text with Tika 235
Essential Lucene extensions 255
Further Lucene extensions 288
Using Lucene from other programming languages 325
Lucene administration and performance tuning 345

PART 3 CASE STUDIES........................................................... 381
12

■

13

■

14

■

Case study 1: Krugle 383
Case study 2: SIREn 394
Case study 3: LinkedIn 409
v

Licensed to theresa smith <>

Licensed to theresa smith <>

contents
foreword xvii
preface xix
preface to the first edition xx
acknowledgments xxiii
about this book xxvi
JUnit primer xxxiv
about the authors xxxvii

PART 1 CORE LUCENE......................................................1

1

Meet Lucene
1.1
1.2

3

Dealing with information explosion
What is Lucene? 6
What Lucene can do

1.3

7

■

4

History of Lucene

7

Lucene and the components of a search application

9

Components for indexing 11 Components for searching 14
The rest of the search application 16 Where Lucene fits into your
application 18
■

■

1.4

Lucene in action: a sample application
Creating an index

1.5

19

■

Searching an index

19
23

Understanding the core indexing classes
IndexWriter 26 Directory
Document 27 Field 27
■

26

■

25

Analyzer 26

■

vii

Licensed to theresa smith <>

viii

CONTENTS

1.6

Understanding the core searching classes
IndexSearcher
TopDocs 29

1.7

2

Summary

28

■

Query 29

TermQuery

29

31

How Lucene models content
Documents and fields 32
Denormalization 34

2.2

28

28
■

29

Building a search index
2.1

Term

■

■

32

Flexible schema 33

Understanding the indexing process

34

Extracting text and creating the document 34
Analysis 35 Adding to the index 35
■

2.3

Basic index operations

36

Adding documents to an index 37 Deleting documents from
an index 39 Updating documents in the index 41
■

■

2.4

Field options

43

Field options for indexing 43 Field options for storing fields 44
Field options for term vectors 44 Reader, TokenStream, and
byte[] field values 45 Field option combinations 46 Field
options for sorting 46 Multivalued fields 47
■

■

■

■

■

2.5

Boosting documents and fields
Boosting documents

2.6

■

Boosting fields 49

Indexing numbers, dates, and times
Indexing numbers

2.7
2.8
2.9
2.10
2.11

48

48

51

■

■

Norms

50

51

Indexing dates and times

Field truncation 53
Near-real-time search 54
Optimizing an index 54
Other directory implementations 56
Concurrency, thread safety, and locking issues

52

58

Thread and multi-JVM safety 58 Accessing an index over a
remote file system 59 Index locking 61
■

■

2.12
2.13

Debugging indexing 63
Advanced indexing concepts

64

Deleting documents with IndexReader 65 Reclaiming disk space
used by deleted documents 66 Buffering and flushing 66
Index commits 67 ACID transactions and index
consistency 69 Merging 70
■

■

■

■

2.14

Summary

72

Licensed to theresa smith <>

ix

CONTENTS

3

Adding search to your application
3.1

74

Implementing a simple search feature
Searching for a specific term 76
expression: QueryParser 77

3.2

76

Parsing a user-entered query

■

Using IndexSearcher 80
Creating an IndexSearcher 81 Performing searches 82
Working with TopDocs 82 Paging through results 84
Near-real-time search 84
■

■

3.3

Understanding Lucene scoring 86
How Lucene scores
hit scoring 88

3.4

86

■

Using explain() to understand

Lucene’s diverse queries

90

Searching by term: TermQuery 90 Searching within a term
range: TermRangeQuery 91 Searching within a numeric range:
NumericRangeQuery 92 Searching on a string:
PrefixQuery 93 Combining queries: BooleanQuery 94
Searching by phrase: PhraseQuery 96 Searching by wildcard:
WildcardQuery 99 Searching for similar terms:
FuzzyQuery 100 Matching all documents:
MatchAllDocsQuery 101
■

■

■

■

■

■

■

3.5

Parsing query expressions: QueryParser 101
Query.toString 102 TermQuery 103 Term range
searches 103 Numeric and date range searches 104
Prefix and wildcard queries 104 Boolean operators 105
Phrase queries 105 Fuzzy queries 106
MatchAllDocsQuery 107 Grouping 107 Field
selection 107 Setting the boost for a subquery 108
To QueryParse or not to QueryParse? 108
■

■

■

■

■

■

■

■

3.6

4

Summary

109

Lucene’s analysis process
4.1

Using analyzers

110

111

Indexing analysis 113 QueryParser analysis 114
Parsing vs. analysis: when an analyzer isn’t appropriate
■

4.2

114

What’s inside an analyzer? 115
What’s in a token? 116 TokenStream uncensored 117
Visualizing analyzers 120 TokenFilter order can be
significant 125
■

■

4.3

Using the built-in analyzers

127

StopAnalyzer 127 StandardAnalyzer
analyzer should you use? 128
■

128

■

Licensed to theresa smith <>

Which core

x

CONTENTS

4.4
4.5

Sounds-like querying 129
Synonyms, aliases, and words that mean the same
Creating SynonymAnalyzer

positions 137

4.6

Stemming analysis

132

Field variations

131

Visualizing token

138

StopFilter leaves holes 138
stop-word removal 139

4.7

■

■

Combining stemming and

140

Analysis of multivalued fields 140 Field-specific analysis

Searching on unanalyzed fields 141
■

4.8

Language analysis issues

140

144

Unicode and encodings 144 Analyzing non-English
languages 145 Character normalization 145 Analyzing
Asian languages 146 Zaijian 148
■

■

■

■

4.9
4.10

5

Nutch analysis 149
Summary 151

Advanced search techniques
5.1

Lucene’s field cache

152
153

Loading field values for all documents
readers 155

5.2

Sorting search results

154

■

Per-segment

155

Sorting search results by field value 156 Sorting by
relevance 158 Sorting by index order 159 Sorting by
a field 160 Reversing sort order 161 Sorting by multiple
fields 161 Selecting a sorting field type 163 Using a
nondefault locale for sorting 163
■

■

■

■

■

■

5.3
5.4
5.5

■

Using MultiPhraseQuery 163
Querying on multiple fields at once
Span queries 168

166

Building block of spanning, SpanTermQuery 170 Finding
spans at the beginning of a field 172 Spans near one
another 173 Excluding span overlap from matches 174
SpanOrQuery 175 SpanQuery and QueryParser 177
■

■

■

■

5.6

Filtering a search

177

TermRangeFilter 178 NumericRangeFilter 179
FieldCacheRangeFilter 179 Filtering by specific terms 180
Using QueryWrapperFilter 180 Using SpanQueryFilter 181
■

■

■

Licensed to theresa smith <>

xi

CONTENTS

Security filters 181 Using BooleanQuery for filtering 183
PrefixFilter 183 Caching filter results 184 Wrapping a
filter as a query 184 Filtering a filter 184 Beyond the builtin filters 185
■

■

■

■

5.7

■

Custom scoring using function queries

185

Function query classes 185 Boosting recently modified
documents using function queries 187
■

5.8

Searching across multiple Lucene indexes

189

Using MultiSearcher 189 Multithreaded searching
using ParallelMultiSearcher 191
■

5.9

Leveraging term vectors

191

Books like this 192 What category?
TermVectorMapper 198
■

5.10
5.11
5.12

6

Loading fields with FieldSelector
Stopping a slow search 201
Summary 202

Extending search
6.1

195

200

204

Using a custom sort method

205

Indexing documents for geographic sorting 205 Implementing
custom geographic sort 206 Accessing values used in custom
sorting 209
■

■

6.2

Developing a custom Collector 210
The Collector base class 211 Custom collector:
BookLinkCollector 212 AllDocCollector 213
■

■

6.3

Extending QueryParser 214
Customizing QueryParser’s behavior 214 Prohibiting fuzzy and
wildcard queries 215 Handling numeric field-range
queries 216 Handling date ranges 218 Allowing ordered
phrase queries 220
■

■

■

6.4

Custom filters

■

221

Implementing a custom filter 221 Using our custom filter
during searching 223 An alternative: FilteredQuery 224
■

■

6.5

Payloads

225

Producing payloads during analysis 226 Using payloads
during searching 227 Payloads and SpanQuery 230
Retrieving payloads via TermPositions 230
■

■

6.6

Summary

231

Licensed to theresa smith <>

xii

CONTENTS

PART 2 APPLIED LUCENE ............................................. 233

7

Extracting text with Tika
7.1
7.2
7.3
7.4
7.5

235

What is Tika? 236
Tika’s logical design and API 238
Installing Tika 240
Tika’s built-in text extraction tool 240
Extracting text programmatically 242
Indexing a Lucene document 242

Customizing parser selection 246

7.6
7.7

8

245

■

Parsing and indexing using
250

Alternatives 253
Summary 254

Essential Lucene extensions
8.1

The Tika utility class

Tika’s limitations 246
Indexing custom XML 247
Parsing using SAX 248
Apache Commons Digester

7.8
7.9

■

255

Luke, the Lucene Index Toolbox

256

Overview: seeing the big picture 257 Document browsing 257
Using QueryParser to search 260 Files and plugins view 261
■

■

8.2

Analyzers, tokenizers, and TokenFilters 262
SnowballAnalyzer 264 Ngram filters 265 Shingle
filters 267 Obtaining the contrib analyzers 267
■

■

■

8.3

Highlighting query terms

268

Highlighter components 268 Standalone highlighter
example 271 Highlighting with CSS 272 Highlighting
search results 273
■

■

8.4
8.5

■

FastVectorHighlighter
Spell checking 277

275

Generating a suggestions list 278 Selecting the best
suggestion 280 Presenting the result to the user 281
Some ideas to improve spell checking 281
■

■

8.6

Fun and interesting Query extensions
MoreLikeThis 283
BoostingQuery 284

RegexQuery 285

■
■

283

FuzzyLikeThisQuery 284
TermsFilter 284 DuplicateFilter
■

Licensed to theresa smith <>

285

xiii

CONTENTS

8.7

Building contrib modules
Get the sources

8.8

9

Summary

286

■

Ant in the contrib directory

288

Chaining filters 289
Storing an index in Berkeley DB 292
Synonyms from WordNet 294
Building the synonym index
into an analyzer 297

9.4
9.5

295

■

Tying WordNet synonyms

Fast memory-based indices 298
XML QueryParser: Beyond “one box”
search interfaces 299
Using XmlQueryParser
syntax 304

9.6
9.7

286

287

Further Lucene extensions
9.1
9.2
9.3

286

300

■

Surround query language
Spatial Lucene 308

Extending the XML query

306

Indexing spatial data 308 Searching spatial data
Performance characteristics of Spatial Lucene 314
■

9.8

9.9
9.10
9.11

10

Searching multiple indexes remotely
Flexible QueryParser 320
Odds and ends 322
Summary 323

316

Using Lucene from other programming languages
10.1

325

Ports primer 326
Trade-offs

10.2

312

327

Choosing the right port

■

CLucene (C++)

328

328

Motivation 329 API and index compatibility 330
Supported platforms 332 Current and future work 332
■

■

10.3

Lucene.Net (C# and other .NET languages)
API compatibility

10.4

■

Index compatibility

KinoSearch and Lucy (Perl)
KinoSearch

10.5
10.6

334

336

■

Lucy

338

335

335
■

Other Perl options

Ferret (Ruby) 338
PHP 340
Zend Framework

340

■

332

PHP Bridge

341

Licensed to theresa smith <>

338

xiv

CONTENTS

10.7

PyLucene (Python)
API compatibility

10.8
10.9

11

341

342

■

Other Python options

Solr (many programming languages)
Summary 344

343

343

Lucene administration and performance tuning
11.1

345

Performance tuning 346
Simple performance-tuning steps 347 Testing approach 348
Tuning for index-to-search delay 349 Tuning for indexing
throughput 350 Tuning for search latency and
throughput 354
■

■

■

11.2

Threads and concurrency
Using threads for indexing
searching 361

11.3

364

11.6

■

File descriptors

■

Using threads for

364

367

■

Memory

371

Hot backups of the index 374
Creating the backup

11.5

357

Managing resource consumption
Disk space

11.4

356

374

Common errors

376

Index corruption

377

Summary

■

■

Restoring the index

Repairing an index

376

378

378

PART 3 CASE STUDIES.................................................. 381

12

Case study 1: Krugle
Krugle: Searching source code 383
12.1
12.2
12.3
12.4
12.5
12.6
12.7

Introducing Krugle 384
Appliance architecture 385
Search performance 386
Parsing source code 387
Substring searching 388
Query vs. search 391
Future improvements 391
FieldCache memory usage

12.8

Summary

392

■

Combining indexes

392

Licensed to theresa smith <>

392

xv

CONTENTS

13

Case study 2: SIREn
Searching semistructured documents with SIREn 394
13.1
13.2

Introducing SIREn 395
SIREn’s benefits 396
Searching across all fields 398 A single efficient lexicon
Flexible fields 398 Efficient handling of multivalued
fields 398
■

398

■

13.3

Indexing entities with SIREn

399

Data model 399 Implementation issues 400
schema 400 Data preparation before indexing
■

■

13.4

Searching entities with SIREn

■

Index
401

402

Searching content 402 Restricting search within a cell
Combining cells into tuples 404 Querying an entity
description 404
■

403

■

13.5
13.6
13.7

14

Integrating SIREn in Solr 405
Benchmark 405
Summary 407

Case study 3: LinkedIn
Adding facets and real-time search with Bobo Browse and Zoie 409
14.1

Faceted search with Bobo Browse 410
Bobo Browse design

14.2

410

■

Beyond simple faceting 415

Real-time search with Zoie

416

Zoie architecture 418 Real-time vs. near-real-time 421
Documents and indexing requests 421 Custom
IndexReaders 423 Comparison with Lucene near-real-time
search 424 Distributed search 425
■

■

■

■

14.3
appendix a
appendix b
appendix c
appendix d

Summary

427

Installing Lucene 428
Lucene index format 433
Lucene/contrib benchmark 443
Resources 465

index

469

Licensed to theresa smith <>

Licensed to theresa smith <>

foreword
Lucene started as a self-serving project. In late 1997, my job uncertain, I sought something of my own to market. Java was the hot new programming language, and I
needed an excuse to learn it. I already knew how to write search software, and thought
I might fill a niche by writing search software in Java. So I wrote Lucene.
In 2000, I realized that I didn’t like to market stuff. I had no interest in negotiating
licenses and contracts, and I didn’t want to hire people and build a company. I liked
writing software, not selling it. So I tossed Lucene up on SourceForge, to see if open
source might let me keep doing what I liked.
A few folks started using Lucene right away. In 2001, folks at Apache offered to
adopt Lucene. The number of daily messages on the Lucene mailing lists grew
steadily. Code contributions started to trickle in. Most were additions around the
edges of Lucene: I was still the only active developer who fully grokked its core. Still,
Lucene was on the road to becoming a real collaborative project.
Now, in 2010, Lucene has a pool of active developers with deep understanding of
its core. I’m no longer involved in day-to-day development; substantial additions and
improvements are regularly made by this strong team.
Through the years, Lucene has been translated into several other programming
languages, including C++, C#, Perl, and Python. In the original Java, and in these
other incarnations, Lucene is used much more widely than I ever would have
dreamed. It powers search in diverse applications like discussion groups at Fortune

100 companies, commercial bug trackers, email search supplied by Microsoft, and a
web search engine that scales to billions of pages. When, at industry events, I am introduced to someone as the “Lucene guy,” more often than not folks tell me how they’ve

xvii

Licensed to theresa smith <>

xviii

FOREWORD

used Lucene in a project. I figure I’ve only heard about a small fraction of all Lucene
applications.
Lucene is much more widely used than it ever would have been if I had tried to sell
it. Application developers seem to prefer open source. Instead of having to contact
technical support when they have a problem (and then wait for an answer, hoping
they were correctly understood), they can frequently just look at the source code to
diagnose their problems. If that’s not enough, the free support provided by peers on
the mailing lists is better than most commercial support. A functioning open-source
project like Lucene makes application developers more efficient and productive.
Lucene, through open source, has become something much greater than I ever
imagined it would. I set it going, but it took the combined efforts of the Lucene community to make it thrive.
So what’s next for Lucene? I can’t predict the future. What I do know is that even
after over 10 years in existence, Lucene is still going strong, and its user and development communities are bigger and busier than ever, in part thanks to the first edition
of Lucene in Action making it easier for more people to get started with Lucene. With
every new release Lucene is getting better, more mature, more feature-rich, and faster.
Since the first edition of Lucene in Action was published in 2004, Lucene internals
and its API have gone through radical changes that called for more than just minor
book updates. In this totally revised second edition, the authors bring you up to speed

on the latest improvements and new APIs in Lucene.
Armed with the second edition of Lucene in Action, you too are now a member of
the Lucene community, and it’s up to you to take Lucene to new places. Bon voyage!
DOUG CUTTING
FOUNDER OF LUCENE,
NUTCH, AND HADOOP

Licensed to theresa smith <>

preface
I first started with Lucene about a year after the first edition of Lucene in Action was
published. I already had experience building search engines, but didn’t know much
about Lucene in particular. So, I picked up a copy of Lucene in Action by Erik and Otis
and read it, cover to cover, and I was hooked!
As I used Lucene, I found small improvements here and there, so I started contributing small patches, updating javadocs, discussing topics on Lucene’s mailing lists,
and so forth. I eventually became an active core committer and PMC member, committing many changes over the years.
It has now been five-and-a-half years since the first edition of Lucene in Action was
published, which is practically an eternity in the fast-paced world of open source
development! Lucene has gone through two major releases, and now has all sorts of
new functionality such as numeric fields, the reusable analysis API, payloads, near-realtime search, and transactional APIs for indexing and searching, and so on.
When Manning first approached me, it was clear that a second edition was sorely
needed. Furthermore, as one of the active core committers largely responsible for
committing so many of these changes, I felt rather obligated to create the second edition. So I said yes, and then worked fiendishly to cover Lucene’s changes, and I’m
quite happy with the results. I hope this Second Edition of Lucene in Action will serve
you well as you create your search applications, and I look forward to seeing you on
the user and developer lists, asking your own interesting questions, and continuing to
drive Lucene's relentless growth!
MICHAEL MCCANDLESS

xix

Licensed to theresa smith <>

preface to the first edition
From Erik Hatcher
I’ve been intrigued with searching and indexing from the early days of the Internet. I
have fond memories (circa 1991) of managing an email list using majordomo, MUSH
(Mail User’s Shell), and a handful of Perl, awk, and shell scripts. I implemented a CGI
web interface to allow users to search the list archives and other users’ profiles using
grep tricks under the covers. Then along came Yahoo!, AltaVista, and Excite, all which
I visited regularly.
After my first child, Jakob, was born, my digital photo archive began growing rapidly. I was intrigued with the idea of developing a system to manage the pictures so
that I could attach meta-data to each picture, such as keywords and date taken, and, of
course, locate the pictures easily in any dimension I chose. In the late 1990s, I prototyped a filesystem-based approach using Microsoft technologies, including Microsoft
Index Server, Active Server Pages, and a third COM component for image manipulation. At the time, my professional life was consumed with these same technologies. I
was able to cobble together a compelling application in a couple of days of spare-time
hacking.
My professional life shifted toward Java technologies, and my computing life consisted of less and less Microsoft Windows. In an effort to reimplement my personal
photo archive and search engine in Java technologies in an operating system–agnostic
way, I came across Lucene. Lucene’s ease of use far exceeded my expectations—I had
experienced numerous other open-source libraries and tools that were far simpler
conceptually yet far more complex to use.

xx

Licensed to theresa smith <>

PREFACE TO THE FIRST EDITION

xxi

In 2001, Steve Loughran and I began writing Java Development with Ant (Manning).
We took the idea of an image search engine application and generalized it as a document search engine. This application example is used throughout the Ant book and
can be customized as an image search engine. The tie to Ant comes not only from a
simple compile-and-package build process but also from a custom Ant task, <index>,
we created that indexes files during the build process using Lucene. This Ant task now
lives in Lucene’s Sandbox and is described in section 8.4 of the first edition.
This Ant task is in production use for my custom blogging system, which I call
BlogScene ( I run an Ant build process, after creating a blog entry, which indexes new entries and uploads them to my server. My blog
server consists of a servlet, some Velocity templates, and a Lucene index, allowing for
rich queries, even syndication of queries. Compared to other blogging systems, BlogScene is vastly inferior in features and finesse, but the full-text search capabilities are
very powerful.
I’m now working with the Applied Research in Patacriticism group at the University of Virginia (), where I’m putting my text analysis,
indexing, and searching expertise to the test and stretching my mind with discussions
of how quantum physics relates to literature. “Poets are the unacknowledged engineers of the world.”

From Otis Gospodnetic´
My interest in and passion for information retrieval and management began during
my student years at Middlebury College. At that time, I discovered an immense source
of information known as the Web. Although the Web was still in its infancy, the longterm need for gathering, analyzing, indexing, and searching was evident. I became
obsessed with creating repositories of information pulled from the Web, began writing
web crawlers, and dreamed of ways to search the collected information. I viewed
search as the killer application in a largely uncharted territory. With that in the back
of my mind, I began the first in my series of projects that share a common denominator: gathering and searching information.
In 1995, fellow student Marshall Levin and I created WebPh, an open-source program used for collecting and retrieving personal contact information. In essence, it
was a simple electronic phone book with a web interface (CGI), one of the first of its
kind at that time. (In fact, it was cited as an example of prior art in a court case in the

late 1990s!) Universities and government institutions around the world have been the
primary adopters of this program, and many are still using it. In 1997, armed with my
WebPh experience, I proceeded to create Populus, a popular white pages at the time.
Even though the technology (similar to that of WebPh) was rudimentary, Populus carried its weight and was a comparable match to the big players such as WhoWhere, Bigfoot, and Infospace.
After two projects that focused on personal contact information, it was time to
explore new territory. I began my next venture, Infojump, which involved culling

Licensed to theresa smith <>

xxii

PREFACE TO THE FIRST EDITION

high-quality information from online newsletters, journals, newspapers, and magazines. In addition to my own software, which consisted of large sets of Perl modules
and scripts, Infojump utilized a web crawler called Webinator and a full-text search
product called Texis. The service provided by Infojump in 1998 was much like that of
FindArticles.com today.
Although WebPh, Populus, and Infojump served their purposes and were fully
functional, they all had technical limitations. The missing piece in each of them was a
powerful information-retrieval library that would allow full-text searches backed by
inverted indexes. Instead of trying to reinvent the wheel, I started looking for a solution that I suspected was out there. In early 2000, I found Lucene, the missing piece
I’d been looking for, and I fell in love with it.
I joined the Lucene project early on when it still lived at SourceForge and, later, at
the Apache Software Foundation when Lucene migrated there in 2002. My devotion
to Lucene stems from its being a core component of many ideas that had queued up
in my mind over the years. One of those ideas was Simpy, my latest pet project. Simpy
is a feature-rich personal web service that lets users tag, index, search, and share information found online. It makes heavy use of Lucene, with thousands of its indexes, and
is powered by Nutch, another project of Doug Cutting’s (see chapter 10 of the first
edition). My active participation in the Lucene project resulted in an offer from Manning to co-author Lucene in Action with Erik Hatcher.

Lucene in Action is the most comprehensive source of information about Lucene.
The information contained in the chapters encompasses all the knowledge you need
to create sophisticated applications built on top of Lucene. It’s the result of a very
smooth and agile collaboration process, much like that within the Lucene community.
Lucene and Lucene in Action exemplify what people can achieve when they have similar interests, the willingness to be flexible, and the desire to contribute to the global
knowledge pool, despite the fact that they have yet to meet in person.

Licensed to theresa smith <>

acknowledgments
We are sincerely and humbly indebted to Doug Cutting. Without Doug’s generosity to
the world, there would be no Lucene. Without the other Lucene committers, Lucene
would have far fewer features, more bugs, and a much tougher time thriving with its
growing adoption. Many thanks to all the committers, past and present. Similarly, we
thank all those who contributed the case studies that appear in chapters 12, 13 and 14:
Michele Catasta, Renaud Delbru, Mikkel Kamstrup Erlandsen, Toke Eskildsen, Robert
Fuller, Grant Glouser, Ken Krugler, Jake Mannix, Nickolai Toupikov, Giovanni Tummarello, Mads Villadsen, and John Wang. We’d also like to thank Doug Cutting for
penning the foreword to the second edition.
Our thanks to the staff at Manning, including Marjan Bace, Jeff Bleiel, Sebstian
Stirling, Karen Tegtmeyer, Liz Welch, Elizabeth Martin, Dottie Marsico, Mary Piergies,
and Marija Tudor. Manning rounded up a great set of reviewers, whom we thank for
improving our drafts into the book you now read. The reviewers include Chad Davis,
Dave Pawson, Rob Allen, Rick Wagner, Michele Galli, Robi Sen, Stuart Caborn, Jeremy
Flowers, Robert Hanson, Rodney Woodruff, Anton Mazkovoi, Ramarao Kanneganti,
Matt Payne, Curtis Miller, Nathan Levesque, Cos DiFazio, and Andy Dingley. Extraspecial thanks go to Shai Erera for his technical editing. Thank you to all our MEAP
readers who posted feedback on Manning’s forums.

Michael McCandless
Writing a book is not easy. Writing a book about something as technically rich as

Lucene is especially challenging. Writing a book about a successful, active, and fast

xxiii

Licensed to theresa smith <>

xxiv

ACKNOWLEDGMENTS

moving open-source project is nearly impossible! Many things had to happen right for
me to start and finish this book.
I would never have been part of this book without Doug having the initial itch,
technical strength, and generosity to open-source his idea, without a vibrant community relentlessly pushing Lucene forward, without a forward-looking IBM supporting
my involvement with Lucene and this book, and without Erik and Otis writing the
first edition.
My four kids—Mia, Kyra, Joel, Kyle—always inspire me, with everything they do.
Their boundless energy, free thinking, infinite series of insightful questions, amazing
happiness, insatiable curiosity, gentle persistence, free sense of humor, sheer passion,
temper tantrums, and sharp minds keep me very young at heart and inspire me to
tackle big projects like this. You should strive, always, to remain a child.
I thank my wife, Jane, for convincing me to pursue this when Manning came
knocking, and for her unmatched skills in efficiently running our busy family.
Remarkably, she has made lots of time for me to work, write this book and still pursue
all my crazy hobbies, and I can see that this ability is very rare.
My parents, all four of them, raised me with the courage to always stretch myself in
what I try to tackle, but also with the discipline and persistence to finish what I start.
They taught me integrity: if you commit to do something, you do it well. Always underpromise and overdeliver. They also led by example, showing me that individuals can
do big things when they work hard. More importantly, they taught me that you should

spend your life doing the things you love. Life is far too short to do otherwise.

Erik Hatcher
First, and really only, heartfelt thanks go to none other than Mike McCandless. He has
pretty much single-handedly revised this book from its 1.0 release to the current spiffy
“3.0” state. Mike approaches Lucene, this book, and life in general enthusiastically,
with eagerness to tackle any task at hand. The first edition acknowledgments also very
much apply here, as these influences are timelessly felt.
I personally thank Otis for his efforts with this book. Although we’ve yet to meet in
person, Otis has been a joy to work with. He and I have gotten along well and have
agreed on the structure and content on this book throughout. Thanks to Java Java in
Charlottesville, Virginia, for keeping me wired and wireless; thanks, also, to Greenberry’s for staying open later than Java Java and keeping me out of trouble by not having internet access (update: they now have wi-fi, much to the dismay of my
productivity). The people I’ve surrounded myself with enrich my life more than anything. David Smith has been a life-long mentor, and his brilliance continues to challenge me; he gave me lots of food for thought regarding Lucene visualization (most of
which I’m still struggling to fully grasp, and I apologize that it didn’t make it into this
manuscript). Jay Zimmerman and the No Fluff, Just Stuff symposium circuit have
been dramatically influential for me. The regular NFJS speakers, including Dave
Thomas, Stuart Halloway, James Duncan Davidson, Jason Hunter, Ted Neward, Ben

Licensed to theresa smith <>

Manning lucene in action 2nd

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về