Trey Grainger
Timothy Potter
FOREWORD BY Yonik Seeley
MANNING
www.it-ebooks.info
Solr in Action
TREY GRAINGER
TIMOTHY POTTER
MANNING
SHELTER ISLAND
www.it-ebooks.info
For online information and ordering of this and other Manning books, please visit
www.manning.com. The publisher offers discounts on this book when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 261
Shelter Island, NY 11964
Email:
©2014 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
permission of the publisher.
Photographs in this book were created by Martin Evans and Jordan Hochenbaum, unless
otherwise noted. Illustrations were created by Martin Evans, Joshua Noble, and Jordan
Hochenbaum. Fritzing (fritzing.org) was used to create some of the circuit diagrams.
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine.
Manning Publications Co.
20 Baldwin Road
PO Box 261
Shelter Island, NY 11964
Development editors:
Copyeditor:
Proofreader:
Typesetter:
Cover designer:
Elizabeth Lexleigh, Susan Conant
Melinda Rankin
Elizabeth Martin
Dennis Dalinnik
Marija Tudor
ISBN: 9781617291029
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 – MAL – 19 18 17 16 15 14
www.it-ebooks.info
brief contents
PART 1
PART 2
MEET SOLR . .................................................................1
1
■
Introduction to Solr
3
2
■
Getting to know Solr
26
3
■
Key Solr concepts
4
■
Configuring Solr
5
■
Indexing 116
6
■
Text analysis
48
82
162
CORE SOLR CAPABILITIES ..........................................195
7
■
Performing queries and handling results 197
8
■
Faceted search
9
■
Hit highlighting 281
10
■
Query suggestions
11
■
Result grouping/field collapsing
12
■
Taking Solr to production
250
306
iii
www.it-ebooks.info
356
330
iv
PART 3
BRIEF CONTENTS
TAKING SOLR TO THE NEXT LEVEL .............................403
13
■
SolrCloud 405
14
■
Multilingual search
15
■
Complex query operations 501
16
■
Mastering relevancy
450
548
www.it-ebooks.info
contents
foreword xv
preface xvii
acknowledgments xix
about this book xxi
PART 1 MEET SOLR . .....................................................1
1
Introduction to Solr 3
1.1
Why do I need a search engine?
Managing text-centric data 4
Common search-engine use cases
1.2
4
7
What is Solr? 9
Information retrieval engine 11 Flexible schema
management 13 Java web application 13
Multiple indexes in one server 15 Extendable (plugins)
Scalable 15 Fault-tolerant 16
■
■
■
■
1.3
Why Solr?
17
Solr for the software architect 17 Solr for the system
administrator 18 Solr for the CEO 19
■
■
v
www.it-ebooks.info
15
CONTENTS
vi
1.4
Features overview
19
User-experience features 19
New features in Solr 4 23
1.5
2
■
Data-modeling features
21
Summary 24
Getting to know Solr 26
2.1
Getting started
27
Installing Solr 27 Starting the Solr example server 28
Understanding Solr home 32 Indexing the example
documents 33
■
■
2.2
Searching is what it’s all about
34
Exploring Solr’s query form 34 What comes back from Solr when
you search 38 Ranked retrieval 39 Paging and sorting 40
Expanded search features 41
■
■
2.3
2.4
2.5
3
■
Tour of the Solr administration console 43
Adapting the example to your needs 45
Summary 46
Key Solr concepts 48
3.1
Searching, matching, and finding content 49
What is a document? 49 The fundamental search problem 50
The inverted index 53 Terms, phrases, and Boolean logic 54
Finding sets of documents 56 Phrase queries and
term positions 59 Fuzzy matching 60 Quick recap 65
■
■
■
■
3.2
Relevancy
■
65
Default similarity 65 Term frequency 67
Inverse document frequency 68 Boosting 69
Normalization factors 69
■
■
3.3
Precision and Recall 71
Precision
3.4
72
■
Recall
Searching at scale
73
■
Striking the right balance
73
74
The denormalized document 75 Distributed searching 77
Clusters vs. servers 78 The limits of Solr 79
■
■
3.5
4
Summary 80
Configuring Solr 82
4.1
Overview of solrconfig.xml
85
Common XML data-structure and type elements 87
Applying configuration changes 87 Miscellaneous settings
■
www.it-ebooks.info
88
CONTENTS
4.2
vii
Query request handling
90
Request-handling overview 90 Search handler 93
Browse request handler for Solritas: an example 94
Extending query processing with search components 98
■
4.3
4.4
Managing searchers
103
New searcher overview
103
■
Warming a new searcher 104
Cache management 107
Cache fundamentals 107 Filter cache 109
Query result cache 112 Document cache 113
Field value cache 113
■
■
4.5
4.6
5
Remaining configuration options 114
Summary 114
Indexing 116
5.1
Example microblog search application
117
Representing content for searching 117
Overview of the Solr indexing process 119
5.2
Designing your schema
121
Document granularity 121 Unique key 122
Indexed fields 123 Stored fields 123
Preview of schema.xml 124
■
■
5.3
Defining fields in schema.xml
125
Required field attributes 126 Multivalued fields 127
Dynamic fields 128 Copy fields 131 Unique key field
■
■
5.4
Field types for structured nontext fields
String fields 134 Date fields 135
Advanced field type attributes 138
■
5.5
■
■
Sending documents to Solr for indexing
Update handler
■
■
Using the SolrJ
Other tools for
Transaction log
Index management 155
Index storage 155
5.8
141
■
147
Committing documents to the index 148
Atomic updates 152
5.7
133
Numeric fields 137
Indexing documents using XML or JSON 141
client library to add documents from Java 144
importing documents into Solr 146
5.6
133
■
Segment merging 158
Summary 160
www.it-ebooks.info
151
CONTENTS
viii
6
Text analysis 162
6.1
6.2
Analyzing microblog text
Basic text analysis 167
163
Analyzer 168 Tokenizer 168 Token filter 169
StandardTokenizer 169 Removing stop words with
StopFilterFactory 170 LowerCaseFilterFactory—lowercase
letters in terms 171 Testing your analysis with Solr’s
analysis form 172
■
■
■
■
■
6.3
Defining a custom field type for microblog text
174
Collapsing repeated letters with PatternReplaceCharFilterFactory 177
Preserving hashtags, mentions, and hyphenated terms 178
Removing diacritical marks using ASCIIFoldingFilterFactory 182
Stemming with KStemFilterFactory 182 Injecting synonyms
at query time with SynonymFilterFactory 183 Putting it
all together 184
■
■
6.4
Advanced text analysis
187
Advanced field attributes 187 Per-language text analysis
Extending text analysis using a Solr plug-in 190
■
6.5
189
Summary 194
PART 2 CORE SOLR CAPABILITIES ..............................195
7
Performing queries and handling results 197
7.1
The anatomy of a Solr request
Request handlers 198
Query parsers 206
7.2
■
Search components
Working with query parsers
Specifying a query parser 207
7.3
Queries and filters
207
■
Local params
207
■
Handling expensive filters 213
The default query parser (Lucene query parser)
Lucene query parser syntax
7.5
203
210
The fq and q parameters 210
7.4
198
215
215
Handling user queries (eDisMax query parser) 222
eDisMax query parser overview 222 eDisMax query
parameters 223 Searching across multiple fields 223
Boosting queries and phrases 224 Field aliasing 226
User-accessible fields 227 Minimum match 228
eDisMax benefits and drawbacks 230
■
■
■
■
www.it-ebooks.info
CONTENTS
7.6
ix
Other useful query parsers
232
Field query parser 232 Term and Raw query parsers 232
Function and Function Range query parsers 233 Nested queries
and the Nested query parser 233 Boost query parser 234
Prefix query parser 235 Spatial query parsers 235
Join query parser 236 Switch query parser 236
Surround query parser 236 Max Score query parser 237
Collapsing query parser 238
■
■
■
■
■
■
7.7
Returning results 238
Choosing a response format 238
Paging through results 243
7.8
Sorting results
■
8
240
Sorting by functions 247
Debugging query results 248
Returning debug information
7.10
Choosing fields to return
245
Sorting by fields 245
Fuzzy sorting 247
7.9
■
Summary
248
249
Faceted search 250
8.1
8.2
8.3
8.4
8.5
8.6
Navigating your content at a glance
Setting up test data 254
Field faceting 259
Query faceting 264
Range faceting 266
Filtering upon faceted values 269
251
Applying filters to your facets 269
Safely filtering on faceted values 273
8.7
Multiselect faceting, keys, and tags
Keys 275
8.8
8.9
9
■
275
Tags, excludes, and multiselect faceting
Beyond the basics
Summary 280
277
279
Hit highlighting 281
9.1
9.2
Overview of hit highlighting 282
How highlighting works 283
Set up a new Solr core for UFO sightings 284 Preprocess UFO
sightings before indexing 284 Exploring the UFO sightings
dataset 286 Hit highlighting out of the box 288
Nuts and bolts 290 Refining highlighter results 296
■
■
■
■
www.it-ebooks.info
CONTENTS
x
9.3
9.4
9.5
10
Improving performance using FastVectorHighlighter 300
PostingsHighlighter 302
Summary 304
Query suggestions 306
10.1
Spell-check
307
Indexing Wikipedia articles 307 Spell-check example
Spell-check search component 311
■
10.2
309
Autosuggesting query terms 318
Autosuggest request handler 318
Autosuggest search component 320
10.3
Suggesting document field values
321
Using n-grams for suggestions 321
N-gram-driven request handler 323
10.4
10.5
11
Suggesting queries based on user activity
Summary 329
324
Result grouping/field collapsing 330
11.1
11.2
11.3
11.4
Result grouping vs. field collapsing 331
Skipping duplicate documents 332
Returning multiple documents per group
Grouping by functions and queries 343
Grouping by function
11.5
11.6
343
■
339
Grouping by query 345
Paging and sorting grouped results 347
Grouping gotchas 348
Faceting upon result groups 349 Distributed result
grouping 352 Returning a flat list 352 Grouping on
multivalued and tokenized fields 352
Grouping performance 353
■
■
11.7
11.8
12
■
Efficient field collapsing with the collapsing
query parser 353
Summary 355
Taking Solr to production 356
12.1
12.2
Developing a Solr distribution
Deploying Solr 357
357
Building your Solr distribution
■
www.it-ebooks.info
358
Embedded Solr 359
CONTENTS
12.3
xi
Hardware and server configuration
359
RAM and SSDs 359 JVM settings 360
The index shuffle 361 Useful system tricks 365
■
■
12.4
12.5
Data acquisition strategies 367
Sharding and replication 371
Choosing to shard
12.6
12.7
371
■
Choosing to replicate
375
Solr core management 378
Managing clusters of servers 384
Load balancers and Solr health check 384
Generic vs. customized configuration 385
12.8
Querying and interacting with Solr
388
REST API 388 Available Solr client libraries
Using SolrJ from Java 389
■
12.9
Monitoring Solr’s performance
388
392
Solr’s Plugins / Stats page 393 Solr cache performance 396
Pulling stats from request handlers and MBeans 398
External monitoring options 399 Solr logs 400
Load testing 400
■
■
12.10
12.11
Upgrading between Solr versions 401
Summary 402
PART 3 TAKING SOLR TO THE NEXT LEVEL .................403
13
SolrCloud 405
13.1
Getting started with SolrCloud
Starting Solr in cloud mode 406
SolrCloud architecture 411
13.2
Core concepts
406
Motivation behind the
■
416
Collections vs. cores 416 ZooKeeper 417 Choosing the
number of shards and replicas 421 Cluster-state
management 422 Shard-leader election 423
Important SolrCloud configuration settings 424
■
■
■
■
13.3
Distributed indexing
427
Document shard assignment 428 Adding documents 429
Near real-time search 431 Node recovery process 433
■
■
13.4
Distributed search
433
Multistage query process 434
Distributed search limitations 436
www.it-ebooks.info
CONTENTS
xii
13.5
Collections API
436
Create a collection 436
13.6
■
Collection aliasing
Basic system-administration tasks
440
442
Configuration updates 443 Rolling restart 443
Restarting a failed node 444 Is node X active? 444
Adding a replica 444 Offsite backup 445
■
■
■
13.7
Advanced topics
446
Custom hashing 446
13.8
14
Summary
■
Shard splitting 447
449
Multilingual search 450
14.1
14.2
14.3
14.4
Why linguistic analysis matters 451
Stemming vs. lemmatization 452
Stemming in action 454
Handling edge cases 458
KeywordMarkerFilterFactory 459
StemmerOverrideFilterFactory 459
14.5
Available language libraries in Solr
460
Language-specific analyzer chains 460
Dictionary-based stemming (Hunspell) 463
14.6
Searching content in multiple languages
464
Separate field per language 464 Separate index
per language 470 Multiple languages in one field 473
Creating a field type to handle multiple languages per field 474
■
■
14.7
Language identification
485
Update processors for language identification 486
Dynamically assigning detected language analyzers
within a field 494
14.8
15
Summary
499
Complex query operations 501
15.1
Function queries
502
Function syntax 502 Searching on functions 504
Returning functions like fields 507 Sorting on functions 508
Available functions in Solr 509 Implementing a
custom function 515
■
■
■
15.2
Geospatial search
521
Searching near a single point 521
Advanced geospatial search 527
www.it-ebooks.info
CONTENTS
15.3
15.4
15.5
15.6
15.7
16
xiii
Pivot faceting 538
Referencing external data 541
Cross-document and cross-index joins
Big data analytics with Solr 546
Summary 547
543
Mastering relevancy 548
16.1
16.2
16.3
The impact of relevancy tuning 549
Debugging the relevancy calculation 550
Relevancy boosting 556
Per-field boosting 556 Per-term boosting 558
Payload boosting 559 Function boosting 560
Term-proximity boosting 562 Elevating the relevancy of
important documents 564
■
■
■
16.4
16.5
Pluggable Similarity class implementations 567
Personalized search and recommendations 569
Search vs. recommendations 570 Attribute-based
matching 571 Hierarchical matching 573
More Like This 574 Concept-based matching 579
Geographical matching 585 Collaborative filtering 586
Hybrid approaches 590
■
■
■
■
16.6
16.7
16.8
appendix A
appendix B
appendix C
Creating a personalized search experience
Running relevancy experiments 592
Summary 595
Working with the Solr codebase 596
Language-specific field type configurations
Useful data import configurations 610
index
616
www.it-ebooks.info
605
591
www.it-ebooks.info
foreword
Solr has had a long and successful history, but a major new chapter began recently
with the advent of Solr 4 and SolrCloud. This is the perfect time for Solr in Action. With
clear examples, enlightening diagrams, and coverage from key concepts through the
newest features, Solr in Action will have you successfully using Solr in no time!
Solr was born out of necessity in 2004, at CNET Networks (now CBS Interactive), to
replace a commercial search engine being discontinued by the vendor. Even though I
had no formal search background when I started writing Solr, it felt like a very natural
fit, because I have always enjoyed making software “go fast.” I viewed Solr more as an
alternate type of datastore designed around an inverted index than as a full-text search
engine, and that has helped Solr extend beyond the legacy enterprise search market.
By the end of 2005, Solr was powering the search and faceted navigation of a number of CNET sites, and soon it was made open source. Solr was contributed to the
Apache Software Foundation in January 2006 and became a subproject of the Lucene
PMC (with Lucene Java as its sibling). There had always been a large degree of overlap
with Lucene (the core full-text search library used by Solr) committers, and in 2010
the projects were merged. Separate Lucene and Solr downloads would still be available, but they would be developed by a single unified team. Solr’s version number
jumped to match that of Lucene, and the releases have since been synchronized.
The recent Solr 4 release is a major milestone, adding SolrCloud—the set of highly
scalable features including distributed indexing with no single points of failure. The
NoSQL feature set was also expanded to include transaction logs, update durability,
optimistic concurrency, and atomic updates. Solr in Action, written by longtime Solr
xv
www.it-ebooks.info
xvi
FOREWORD
power users and community members, Trey and Timothy, covers these important
recent Solr features and provides an excellent starting point for those new to Solr.
Solr is now used in more places than I could ever have imagined—from integrated
library systems to e-commerce platforms, analytics and business intelligence products,
content-management systems, internet searches, and more. It’s been rewarding to see
Solr grow from a few early adopters to a huge global community of helpful users and
active volunteers cooperatively pushing development forward.
Solr in Action gives you the knowledge and techniques you need to use Solr’s features
that have been under development since 2004. With Solr in Action in hand, you too are
now well equipped to join the global community and help take Solr to new heights!
YONIK SEELEY
CREATOR OF SOLR
www.it-ebooks.info
preface
In 2008, I was asked to take over leadership of CareerBuilder’s search technology
team. We were using the Microsoft FAST search platform at the time, but realized that
search was too important to the success of our business for us to continue relying on a
commercial vendor instead of developing the domain expertise internally. I immediately began investigating open source alternatives such as Solr, which seemed to provide most of the key features needed for our products. By the summer of 2009, we
decided that we were ready to bring our search expertise in-house and convert our systems to Solr.
The timing was great. Lucene, the open source search library upon which Solr is
built, had become a full top-level Apache project in February 2005, and Solr, which had
been contributed to the Apache Software Foundation in 2006, had become a top-level
Apache project in January of 2007. Both technologies were reaching critical mass and
would soon be merged (in March 2010) into a unified project.
By the summer of 2010, our entire platform was converted to Solr. In the process,
we increased the speed of our searches, significantly reduced the number of servers
necessary to support our search infrastructure, dropped expensive licensing fees,
increased platform stability, and in-sourced much of the search expertise for which we
had previously been dependent on a commercial vendor.
Little did we know at that time how much additional value we would gain by bringing search in-house. We have been able to build entirely new suites of search-based
products—from traditional keyword and semantic search, to big data analytics products,
to real-time recommendation engines—utilizing Solr as a scalable search architecture
xvii
www.it-ebooks.info
xviii
PREFACE
to handle billions of documents and millions of queries an hour across hundreds of
servers. We have entered the era of cloud services, elastic scalability, and an explosion
of data that we strive to make meaningful for society, and with Solr we are able to
tackle each of these challenges head-on.
When Manning approached me about writing Solr in Action, I was hesitant because
I knew it would be a large undertaking. My one requirement was that I needed a
strong coauthor, and that is exactly what I found in Timothy Potter. Tim also has years
of experience developing search-based solutions with Lucene and Solr. He has a
wealth of expertise building text analysis systems for social data and architecting realtime analytics solutions using Solr and other cutting-edge big data technologies. With
both of us having received so much help from the Solr community over the years and
with such a clear need for an example-driven guide to Solr, Tim and I are excited to
be able to provide Solr in Action to help the next generation of search engineers. It’s
the book we wish we’d had five years ago when we started with Solr, and we hope that
you find it to be useful, whether you are just getting introduced to Solr or are looking
to take your knowledge to the next level.
TREY GRAINGER
www.it-ebooks.info
acknowledgments
Much like Solr, this book would not have been possible without the support of a large
community of dedicated people:
■
■
■
■
■
■
■
Lucene/Solr committers who not only write amazing code but also provide
invaluable expertise and advice, all the while demonstrating patience with new
members of the community
Active Lucene/Solr community members who contribute code, update the wiki
and other documentation, and answer questions on the Lucene and Solr mailing lists
Yonik Seeley, original creator of Solr, who contributed the foreword to our book
Our Manning Early Access Program (MEAP) readers who posted comments in
the Author Online forum
The reviewers who provided valuable feedback throughout the development
process: Alexandre Madurell, Ammar Alrashed, Brandon Harper, Chris Nauroth,
Craig Smith, Edward Welker, Gregor Zurowski, John Viviano, Leo Cassarani,
Robert Petersen, Scott Anthony, Sopan Shewale, and Uma Maheshwar Rao
Gunuganti
Ivan Todorovic´ and John Guthrie who provided a detailed technical proofread
of the manuscript shortly before it went into production
Our Manning editors, Elizabeth Lexleigh, Susan Conant, Melinda Rankin,
Elizabeth Martin, and Janet Vail
xix
www.it-ebooks.info
ACKNOWLEDGMENTS
xx
■
■
Bert Bates at Manning for helping us improve the instructional quality of
our writing
Family and friends who supported us through the many hours of research
and writing
TREY GRAINGER
First and foremost, I would like to thank my amazing wife, Lindsay, for her support
and patience during the many long days and nights it took to write this book. Without
her understanding and help throughout the journey, this book would have never
been possible (especially with the birth of our daughter midway through the project).
I would also like to thank Paula and Steven Woolf for the countless hours they
spent watching Melodie so that I could push this project to completion. Finally, I
would like to thank the team at CareerBuilder—both the company leadership and my
Search team—for giving me the opportunity to work with such great people and to
build a cutting-edge search platform that benefits society in such a clear way.
TIMOTHY POTTER
I would like to thank Sharon Russom, my mother, for instilling a love of learning and
books early in my childhood, and David Potter, my father, for all of his support
throughout college and my career. This book would not have been possible without
the help of Lori Joy. Thank you for your support and for being understanding during the late evenings and missed weekends, and for being a sounding board early in
the writing process.
I also thank my former team at the Dachis Group. I could not have done this without their insightful questions about Solr and their giving me the opportunity to build
a large-scale search solution using Solr.
www.it-ebooks.info
about this book
Whether handling big data, building cloud-based services, or developing multitenant
web applications, it’s vital to have a fast, reliable search solution. Apache Solr is a scalable and ready-to-deploy open source full-text search engine powered by Lucene. It
offers key features like multilingual keyword searching, faceted search, intelligent
matching, content clustering, and relevancy weighting right out of the box.
Solr in Action is the definitive guide to implementing fast and scalable search using
Apache Solr. It uses well-documented examples ranging from basic keyword searching
to scaling a system for billions of documents and queries. With this book, you’ll gain a
deep understanding of how to implement core Solr capabilities such as faceted navigation through search results, matched snippet highlighting, field collapsing and
search results grouping, spell-checking, query autocomplete, querying by functions,
and more. You’ll also see how to take Solr to the next level, with deep coverage of
large-scale production use cases, sophisticated multilingual search, complex query
operations, and advanced relevancy tuning strategies.
Roadmap
Solr in Action is divided into three parts: “Meet Solr,” “Core Solr capabilities,” and “Taking Solr to the next level.” If you are new to Solr and to search in general, we strongly
recommend that you read the chapters in part 1 in order, as many of the concepts presented in these chapters build on each other.
The concepts covered in part 2 were chosen because they are common features of
most search applications. You can safely skip any chapter in part 2 that may not apply
xxi
www.it-ebooks.info
xxii
ABOUT THIS BOOK
to your current needs. For example, result grouping is a common feature in many
search engines, but if your data doesn’t require grouping, then you can safely skip
chapter 11.
The four chapters (13–16) in part 3 are the most challenging as they introduce
advanced topics, including multilingual search, running Solr in a large-scale cluster
environment, advanced data operations, and relevancy tuning.
Most of the chapters use hands-on activities to help you work through the material.
Our goal for each example was that it be easy to use but cover the chapter topic thoroughly. In many examples, we used data from real-world datasets so that you would get
exposure to working with realistic use cases.
Chapter 1 introduces the type of data and use cases Solr was designed to handle.
You’ll learn about the kinds of problems you can solve with Solr and gain an overview
of its key features. Solr 4 is a significant milestone for the Lucene/Solr project, so
even if you’re an expert on previous versions of Solr, we encourage you to read chapter 1 to get a sense for all the new and exciting features in Solr 4.
Chapter 2 shows how to install and run Solr on your local workstation. After starting Solr, we demonstrate how to index and query a set of example documents that
ship with Solr. We also take a brief tour of Solr’s web-based administration console.
Chapter 3 introduces general search theory and how Solr implements that theory in practice. Most interestingly, this chapter covers the inverted search index
and how relevancy scoring works to present the most relevant documents at the top
of search results. Even if you have worked with Solr in the past, we recommend
reading this chapter to refresh your understanding of the fundamental operations
in a search engine.
Chapter 4 shows the basics of Solr’s configuration, primarily focused on Solr’s
main configuration file: solrconfig.xml. Our aim in this chapter is to introduce the most
important configuration settings for Solr, particularly those that impact how Solr processes requests from client applications. The knowledge you gain in this chapter will
be applied throughout the rest of the book.
Chapter 5 teaches how Solr indexes documents, starting with a discussion of
another important configuration file: schema.xml. You’ll learn how to define fields to
represent structured data like numbers, dates, prices, and unique identifiers. We also
cover how update requests are processed and configured using solrconfig.xml.
Chapter 6 builds on the material in chapter 5 by showing how to index text fields
using text analysis. Solr was designed to efficiently search and rank documents requiring full-text search. Text analysis is an important part of the search process in that it
removes the linguistic variations between indexed text and queries.
At this point in the book, you’ll have a solid foundation and will be ready to put
Solr to work on your own search needs. As your knowledge of search and Solr
grows, so too will your need to go beyond basic keyword searching and implement
common search features such as advanced query parsing, hit highlighting, spellchecking, autosuggest, faceting, and result grouping.
www.it-ebooks.info
ABOUT THIS BOOK
xxiii
In chapter 7, we cover how to construct queries and how they are executed.
You’ll learn about Solr’s many query parsers, as well as how to sort, format, return,
and debug search results.
In chapter 8, you’ll learn about one of the most powerful and popular features of
Solr—faceting. Solr’s faceting provides tools to refine search criteria and helps users
discover more information by categorizing search results into subgroups.
Chapter 9 explains how to highlight query terms in search results in order to
improve the user experience with your search solution.
In chapter 10, we cover spell-checking and autosuggestions. Solr’s autosuggest features allow a user to start typing a few characters and receive a list of suggested queries
as they type.
Chapter 11 explores Solr’s result grouping and field collapsing support to help you
return an optimal mix of search results when your index includes many similar documents, such as multiple locations of the same restaurant in a city.
Chapter 12 helps you prepare to deploy Solr in a production environment. This
chapter will help you plan your hardware and resource needs, as well as whether you
need to consider sharding and replication to handle a large number of documents
and query requests.
Chapter 13 covers a set of distributed features known as SolrCloud. You’ll learn
how to run Solr in cloud mode so that you can scale your search application to support a large volume of users and documents. You’ll come away from this chapter having a solid understanding of how Solr achieves scalability and fault tolerance by
distributing indexes across multiple servers.
Chapter 14 builds upon the text analysis concepts covered in chapter 6 by teaching you how to handle multilingual text in your search engine. If you need to work
with non-English text or support multiple languages in the same index, this chapter
is a must-read.
Chapter 15 explores advanced query features, including function queries, geospatial search, multilevel faceting, and cross-document and cross-index joins.
In chapter 16, you’ll learn techniques for improving the relevancy of your results,
such as boosting, scoring based upon functions, alternate similarity algorithms, and
debugging relevancy scores. In addition, we provide an in-depth discussion of using
Solr for personalized search and recommendations.
There are three appendixes, which cover a number of subtopics from earlier chapters in greater depth. Appendix A focuses on working with the Solr codebase and how
you can create your own custom Solr distribution if you need features or bug fixes not
available in an official release. This is an extension of some of the material from the
beginning of chapter 12.
Appendix B lists, in table format, out-of-the box configurations for many of the
languages Solr supports. This material is an extended version of the language configurations covered in chapter 14.
www.it-ebooks.info
ABOUT THIS BOOK
xxiv
Appendix C highlights the Data Import Handler (DIH) in more detail (extending
coverage from chapters 10 and 12), demonstrating the steps necessary for importing a
number of large, publicly available datasets.
How to use this book
Solr in Action is designed to be accessible for any software engineer—no previous experience working with search engines is assumed. The topics covered rise in expertise
level throughout the book, and even the most seasoned Solr professionals are likely to
learn something from the last few chapters. The scope of the book is massive—coming in at over 600 pages—but the engaging and practical real-world examples and
careful balance between theory and practice make the book a real asset to anyone
using Solr —whether you are just getting started or have years of experience.
As mentioned above, the chapters in part 1 provide the foundation upon which
the rest of the book will be built, and they will be critical for anyone new to Solr.
These chapters should be read in sequence to give you the best overview of Solr and
search in general. If you are new to Solr, chapter 2 will show you how to start and use
Solr for the first time, and chapter 3 will provide the key search theory that the rest
of the book builds upon. Configuring your Solr server and setting up field types to
properly analyze your content round out the search topics needed to understand
Solr’s fundamentals.
Many of the chapters in part 2 can be skipped if your work does not include the
features discussed. In particular, chapters 9, 10, and 11 are largely standalone topics
that are not important for understanding later chapters, so you can skip them if you
are not planning on implementing hit highlighting, query suggestions, or result
grouping/field collapsing any time soon. Chapters 7 and 8 cover some of the most
commonly used features of many search applications, so you will want to at least skim
through them before putting the book away.
The remaining chapters cover some of the advanced topics surrounding Solr.
Tough challenges will be tackled, including scaling a cluster of servers, multilingual
search, complex query operations, and advanced relevancy techniques. While all
chapters in parts 2 and 3 build on part 1, chapter 13 (“SolrCloud”) additionally builds
on chapter 12 (“Taking Solr to production”), chapter 15 (“Complex query operations”) builds on chapters 7 (“Performing queries and handling results”) and 8 (“Faceted search”), and chapter 16 (“Mastering relevancy”) further builds on chapter 15.
In order to get the most benefit out of the book, be mindful not to skip any earlier
chapters that provide the necessary background for your understanding of these more
advanced topics.
Many of the chapters include executable examples that you can run as you read
along. These examples demonstrate new topics and provide you with the opportunity
for hands-on exploration of Solr’s capabilities—often through just hitting a running
Solr server from your web browser. While you do not have to run all of the examples
and can simply use them as reference configurations in many cases, running the
www.it-ebooks.info