Solr in action

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (22.51 MB, 666 trang )

Trey Grainger
Timothy Potter
FOREWORD BY Yonik Seeley

MANNING

www.it-ebooks.info

Solr in Action
TREY GRAINGER
TIMOTHY POTTER

MANNING
SHELTER ISLAND

www.it-ebooks.info

For online information and ordering of this and other Manning books, please visit
www.manning.com. The publisher offers discounts on this book when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.
20 Baldwin Road
PO Box 261
Shelter Island, NY 11964
Email:
©2014 by Manning Publications Co. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by means electronic, mechanical, photocopying, or otherwise, without prior written

permission of the publisher.
Photographs in this book were created by Martin Evans and Jordan Hochenbaum, unless
otherwise noted. Illustrations were created by Martin Evans, Joshua Noble, and Jordan
Hochenbaum. Fritzing (fritzing.org) was used to create some of the circuit diagrams.
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.
Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of
elemental chlorine.

Manning Publications Co.
20 Baldwin Road
PO Box 261
Shelter Island, NY 11964

Development editors:
Copyeditor:
Proofreader:
Typesetter:
Cover designer:

Elizabeth Lexleigh, Susan Conant
Melinda Rankin
Elizabeth Martin
Dennis Dalinnik
Marija Tudor

ISBN: 9781617291029
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 – MAL – 19 18 17 16 15 14

www.it-ebooks.info

brief contents
PART 1

PART 2

MEET SOLR . .................................................................1
1

■

Introduction to Solr

3

2

■

Getting to know Solr

26

3

■

Key Solr concepts

4

■

Configuring Solr

5

■

Indexing 116

6

■

Text analysis

48
82

162

CORE SOLR CAPABILITIES ..........................................195

7

■

Performing queries and handling results 197

8

■

Faceted search

9

■

Hit highlighting 281

10

■

Query suggestions

11

■

Result grouping/field collapsing

12

■

Taking Solr to production

250
306

iii

www.it-ebooks.info

356

330

iv

PART 3

BRIEF CONTENTS

TAKING SOLR TO THE NEXT LEVEL .............................403
13

■

SolrCloud 405

14

■

Multilingual search

15

■

Complex query operations 501

16

■

Mastering relevancy

450
548

www.it-ebooks.info

contents
foreword xv
preface xvii
acknowledgments xix
about this book xxi

PART 1 MEET SOLR . .....................................................1

1

Introduction to Solr 3
1.1

Why do I need a search engine?
Managing text-centric data 4
Common search-engine use cases

1.2

4

7

What is Solr? 9
Information retrieval engine 11 Flexible schema
management 13 Java web application 13
Multiple indexes in one server 15 Extendable (plugins)
Scalable 15 Fault-tolerant 16
■

■

■

■

1.3

Why Solr?

17

Solr for the software architect 17 Solr for the system
administrator 18 Solr for the CEO 19
■

■

v

www.it-ebooks.info

15

CONTENTS

vi

1.4

Features overview

19

User-experience features 19
New features in Solr 4 23

1.5

2

■

Data-modeling features

21

Summary 24

Getting to know Solr 26
2.1

Getting started

27

Installing Solr 27 Starting the Solr example server 28
Understanding Solr home 32 Indexing the example
documents 33
■

■

2.2

Searching is what it’s all about

34

Exploring Solr’s query form 34 What comes back from Solr when
you search 38 Ranked retrieval 39 Paging and sorting 40
Expanded search features 41
■

■

2.3
2.4
2.5

3

■

Tour of the Solr administration console 43
Adapting the example to your needs 45
Summary 46

Key Solr concepts 48
3.1

Searching, matching, and finding content 49
What is a document? 49 The fundamental search problem 50
The inverted index 53 Terms, phrases, and Boolean logic 54

Finding sets of documents 56 Phrase queries and
term positions 59 Fuzzy matching 60 Quick recap 65
■

■

■

■

3.2

Relevancy

■

65

Default similarity 65 Term frequency 67
Inverse document frequency 68 Boosting 69
Normalization factors 69
■

■

3.3

Precision and Recall 71
Precision

3.4

72

■

Recall

Searching at scale

73

■

Striking the right balance

73

74

The denormalized document 75 Distributed searching 77
Clusters vs. servers 78 The limits of Solr 79
■

■

3.5

4

Summary 80

Configuring Solr 82
4.1

Overview of solrconfig.xml

85

Common XML data-structure and type elements 87
Applying configuration changes 87 Miscellaneous settings
■

www.it-ebooks.info

88

CONTENTS

4.2

vii

Query request handling

90

Request-handling overview 90 Search handler 93
Browse request handler for Solritas: an example 94

Extending query processing with search components 98
■

4.3
4.4

Managing searchers

103

New searcher overview

103

■

Warming a new searcher 104

Cache management 107
Cache fundamentals 107 Filter cache 109
Query result cache 112 Document cache 113
Field value cache 113
■

■

4.5
4.6

5

Remaining configuration options 114
Summary 114

Indexing 116
5.1

Example microblog search application

117

Representing content for searching 117
Overview of the Solr indexing process 119

5.2

Designing your schema

121

Document granularity 121 Unique key 122
Indexed fields 123 Stored fields 123
Preview of schema.xml 124
■

■

5.3

Defining fields in schema.xml

125

Required field attributes 126 Multivalued fields 127
Dynamic fields 128 Copy fields 131 Unique key field
■

■

5.4

Field types for structured nontext fields
String fields 134 Date fields 135
Advanced field type attributes 138
■

5.5

■

■

Sending documents to Solr for indexing

Update handler

■

■

Using the SolrJ
Other tools for

Transaction log

Index management 155
Index storage 155

5.8

141
■

147

Committing documents to the index 148
Atomic updates 152

5.7

133

Numeric fields 137

Indexing documents using XML or JSON 141
client library to add documents from Java 144
importing documents into Solr 146

5.6

133

■

Segment merging 158

Summary 160
www.it-ebooks.info

151

CONTENTS

viii

6

Text analysis 162
6.1
6.2

Analyzing microblog text
Basic text analysis 167

163

Analyzer 168 Tokenizer 168 Token filter 169
StandardTokenizer 169 Removing stop words with
StopFilterFactory 170 LowerCaseFilterFactory—lowercase

letters in terms 171 Testing your analysis with Solr’s
analysis form 172
■

■

■

■

■

6.3

Defining a custom field type for microblog text

174

Collapsing repeated letters with PatternReplaceCharFilterFactory 177
Preserving hashtags, mentions, and hyphenated terms 178
Removing diacritical marks using ASCIIFoldingFilterFactory 182
Stemming with KStemFilterFactory 182 Injecting synonyms
at query time with SynonymFilterFactory 183 Putting it
all together 184
■

■

6.4

Advanced text analysis

187

Advanced field attributes 187 Per-language text analysis
Extending text analysis using a Solr plug-in 190
■

6.5

189

Summary 194

PART 2 CORE SOLR CAPABILITIES ..............................195

7

Performing queries and handling results 197
7.1

The anatomy of a Solr request
Request handlers 198
Query parsers 206

7.2

■

Search components

Working with query parsers
Specifying a query parser 207

7.3

Queries and filters

207
■

Local params

207

■

Handling expensive filters 213

The default query parser (Lucene query parser)
Lucene query parser syntax

7.5

203

210

The fq and q parameters 210

7.4

198

215

215

Handling user queries (eDisMax query parser) 222
eDisMax query parser overview 222 eDisMax query
parameters 223 Searching across multiple fields 223
Boosting queries and phrases 224 Field aliasing 226
User-accessible fields 227 Minimum match 228
eDisMax benefits and drawbacks 230
■

■

■

■

www.it-ebooks.info

CONTENTS

7.6

ix

Other useful query parsers

232

Field query parser 232 Term and Raw query parsers 232
Function and Function Range query parsers 233 Nested queries
and the Nested query parser 233 Boost query parser 234
Prefix query parser 235 Spatial query parsers 235
Join query parser 236 Switch query parser 236
Surround query parser 236 Max Score query parser 237
Collapsing query parser 238
■

■

■

■

■

■

7.7

Returning results 238
Choosing a response format 238
Paging through results 243

7.8

Sorting results

■

8

240

Sorting by functions 247

Debugging query results 248
Returning debug information

7.10

Choosing fields to return

245

Sorting by fields 245
Fuzzy sorting 247

7.9

■

Summary

248

249

Faceted search 250
8.1
8.2
8.3
8.4
8.5
8.6

Navigating your content at a glance
Setting up test data 254
Field faceting 259
Query faceting 264
Range faceting 266
Filtering upon faceted values 269

251

Applying filters to your facets 269
Safely filtering on faceted values 273

8.7

Multiselect faceting, keys, and tags
Keys 275

8.8

8.9

9

■

275

Tags, excludes, and multiselect faceting

Beyond the basics
Summary 280

277

279

Hit highlighting 281
9.1
9.2

Overview of hit highlighting 282
How highlighting works 283
Set up a new Solr core for UFO sightings 284 Preprocess UFO
sightings before indexing 284 Exploring the UFO sightings
dataset 286 Hit highlighting out of the box 288
Nuts and bolts 290 Refining highlighter results 296
■

■

■

■

www.it-ebooks.info

CONTENTS

x

9.3
9.4
9.5

10

Improving performance using FastVectorHighlighter 300
PostingsHighlighter 302
Summary 304

Query suggestions 306
10.1

Spell-check

307

Indexing Wikipedia articles 307 Spell-check example

Spell-check search component 311
■

10.2

309

Autosuggesting query terms 318
Autosuggest request handler 318
Autosuggest search component 320

10.3

Suggesting document field values

321

Using n-grams for suggestions 321
N-gram-driven request handler 323

10.4
10.5

11

Suggesting queries based on user activity
Summary 329

324

Result grouping/field collapsing 330
11.1
11.2
11.3
11.4

Result grouping vs. field collapsing 331
Skipping duplicate documents 332
Returning multiple documents per group
Grouping by functions and queries 343
Grouping by function

11.5
11.6

343

■

339

Grouping by query 345

Paging and sorting grouped results 347
Grouping gotchas 348
Faceting upon result groups 349 Distributed result
grouping 352 Returning a flat list 352 Grouping on
multivalued and tokenized fields 352
Grouping performance 353
■

■

11.7
11.8

12

■

Efficient field collapsing with the collapsing
query parser 353
Summary 355

Taking Solr to production 356
12.1
12.2

Developing a Solr distribution
Deploying Solr 357

357

Building your Solr distribution

■

www.it-ebooks.info

358

Embedded Solr 359

CONTENTS

12.3

xi

Hardware and server configuration

359

RAM and SSDs 359 JVM settings 360
The index shuffle 361 Useful system tricks 365
■

■

12.4
12.5

Data acquisition strategies 367
Sharding and replication 371
Choosing to shard

12.6
12.7

371

■

Choosing to replicate

375

Solr core management 378
Managing clusters of servers 384
Load balancers and Solr health check 384
Generic vs. customized configuration 385

12.8

Querying and interacting with Solr

388

REST API 388 Available Solr client libraries
Using SolrJ from Java 389
■

12.9

Monitoring Solr’s performance

388

392

Solr’s Plugins / Stats page 393 Solr cache performance 396
Pulling stats from request handlers and MBeans 398
External monitoring options 399 Solr logs 400
Load testing 400
■

■

12.10
12.11

Upgrading between Solr versions 401
Summary 402

PART 3 TAKING SOLR TO THE NEXT LEVEL .................403

13

SolrCloud 405
13.1

Getting started with SolrCloud
Starting Solr in cloud mode 406
SolrCloud architecture 411

13.2

Core concepts

406
Motivation behind the

■

416

Collections vs. cores 416 ZooKeeper 417 Choosing the
number of shards and replicas 421 Cluster-state
management 422 Shard-leader election 423
Important SolrCloud configuration settings 424
■

■

■

■

13.3

Distributed indexing

427

Document shard assignment 428 Adding documents 429
Near real-time search 431 Node recovery process 433
■

■

13.4

Distributed search

433

Multistage query process 434
Distributed search limitations 436

www.it-ebooks.info

CONTENTS

xii

13.5

Collections API

436

Create a collection 436

13.6

■

Collection aliasing

Basic system-administration tasks

440

442

Configuration updates 443 Rolling restart 443
Restarting a failed node 444 Is node X active? 444
Adding a replica 444 Offsite backup 445
■

■

■

13.7

Advanced topics

446

Custom hashing 446

13.8

14

Summary

■

Shard splitting 447

449

Multilingual search 450
14.1
14.2
14.3
14.4

Why linguistic analysis matters 451
Stemming vs. lemmatization 452
Stemming in action 454
Handling edge cases 458
KeywordMarkerFilterFactory 459
StemmerOverrideFilterFactory 459

14.5

Available language libraries in Solr

460

Language-specific analyzer chains 460
Dictionary-based stemming (Hunspell) 463

14.6

Searching content in multiple languages

464

Separate field per language 464 Separate index
per language 470 Multiple languages in one field 473
Creating a field type to handle multiple languages per field 474
■

■

14.7

Language identification

485

Update processors for language identification 486
Dynamically assigning detected language analyzers
within a field 494

14.8

15

Summary

499

Complex query operations 501

15.1

Function queries

502

Function syntax 502 Searching on functions 504
Returning functions like fields 507 Sorting on functions 508
Available functions in Solr 509 Implementing a
custom function 515
■

■

■

15.2

Geospatial search

521

Searching near a single point 521
Advanced geospatial search 527

www.it-ebooks.info

CONTENTS

15.3
15.4
15.5
15.6
15.7

16

xiii

Pivot faceting 538
Referencing external data 541
Cross-document and cross-index joins
Big data analytics with Solr 546
Summary 547

543

Mastering relevancy 548
16.1
16.2
16.3

The impact of relevancy tuning 549
Debugging the relevancy calculation 550
Relevancy boosting 556
Per-field boosting 556 Per-term boosting 558
Payload boosting 559 Function boosting 560
Term-proximity boosting 562 Elevating the relevancy of
important documents 564

■

■

■

16.4
16.5

Pluggable Similarity class implementations 567
Personalized search and recommendations 569
Search vs. recommendations 570 Attribute-based
matching 571 Hierarchical matching 573
More Like This 574 Concept-based matching 579
Geographical matching 585 Collaborative filtering 586
Hybrid approaches 590
■

■

■

■

16.6
16.7
16.8
appendix A
appendix B
appendix C

Creating a personalized search experience
Running relevancy experiments 592
Summary 595
Working with the Solr codebase 596
Language-specific field type configurations
Useful data import configurations 610
index

616

www.it-ebooks.info

605

591

www.it-ebooks.info

foreword
Solr has had a long and successful history, but a major new chapter began recently
with the advent of Solr 4 and SolrCloud. This is the perfect time for Solr in Action. With
clear examples, enlightening diagrams, and coverage from key concepts through the
newest features, Solr in Action will have you successfully using Solr in no time!
Solr was born out of necessity in 2004, at CNET Networks (now CBS Interactive), to
replace a commercial search engine being discontinued by the vendor. Even though I
had no formal search background when I started writing Solr, it felt like a very natural
fit, because I have always enjoyed making software “go fast.” I viewed Solr more as an

alternate type of datastore designed around an inverted index than as a full-text search
engine, and that has helped Solr extend beyond the legacy enterprise search market.
By the end of 2005, Solr was powering the search and faceted navigation of a number of CNET sites, and soon it was made open source. Solr was contributed to the
Apache Software Foundation in January 2006 and became a subproject of the Lucene
PMC (with Lucene Java as its sibling). There had always been a large degree of overlap
with Lucene (the core full-text search library used by Solr) committers, and in 2010
the projects were merged. Separate Lucene and Solr downloads would still be available, but they would be developed by a single unified team. Solr’s version number
jumped to match that of Lucene, and the releases have since been synchronized.
The recent Solr 4 release is a major milestone, adding SolrCloud—the set of highly
scalable features including distributed indexing with no single points of failure. The
NoSQL feature set was also expanded to include transaction logs, update durability,
optimistic concurrency, and atomic updates. Solr in Action, written by longtime Solr

xv

www.it-ebooks.info

xvi

FOREWORD

power users and community members, Trey and Timothy, covers these important
recent Solr features and provides an excellent starting point for those new to Solr.
Solr is now used in more places than I could ever have imagined—from integrated
library systems to e-commerce platforms, analytics and business intelligence products,
content-management systems, internet searches, and more. It’s been rewarding to see
Solr grow from a few early adopters to a huge global community of helpful users and
active volunteers cooperatively pushing development forward.
Solr in Action gives you the knowledge and techniques you need to use Solr’s features

that have been under development since 2004. With Solr in Action in hand, you too are
now well equipped to join the global community and help take Solr to new heights!
YONIK SEELEY
CREATOR OF SOLR

www.it-ebooks.info

preface
In 2008, I was asked to take over leadership of CareerBuilder’s search technology
team. We were using the Microsoft FAST search platform at the time, but realized that
search was too important to the success of our business for us to continue relying on a
commercial vendor instead of developing the domain expertise internally. I immediately began investigating open source alternatives such as Solr, which seemed to provide most of the key features needed for our products. By the summer of 2009, we
decided that we were ready to bring our search expertise in-house and convert our systems to Solr.
The timing was great. Lucene, the open source search library upon which Solr is
built, had become a full top-level Apache project in February 2005, and Solr, which had
been contributed to the Apache Software Foundation in 2006, had become a top-level
Apache project in January of 2007. Both technologies were reaching critical mass and
would soon be merged (in March 2010) into a unified project.
By the summer of 2010, our entire platform was converted to Solr. In the process,
we increased the speed of our searches, significantly reduced the number of servers
necessary to support our search infrastructure, dropped expensive licensing fees,
increased platform stability, and in-sourced much of the search expertise for which we
had previously been dependent on a commercial vendor.
Little did we know at that time how much additional value we would gain by bringing search in-house. We have been able to build entirely new suites of search-based
products—from traditional keyword and semantic search, to big data analytics products,
to real-time recommendation engines—utilizing Solr as a scalable search architecture

xvii

www.it-ebooks.info

xviii

PREFACE

to handle billions of documents and millions of queries an hour across hundreds of
servers. We have entered the era of cloud services, elastic scalability, and an explosion
of data that we strive to make meaningful for society, and with Solr we are able to
tackle each of these challenges head-on.
When Manning approached me about writing Solr in Action, I was hesitant because
I knew it would be a large undertaking. My one requirement was that I needed a
strong coauthor, and that is exactly what I found in Timothy Potter. Tim also has years
of experience developing search-based solutions with Lucene and Solr. He has a
wealth of expertise building text analysis systems for social data and architecting realtime analytics solutions using Solr and other cutting-edge big data technologies. With
both of us having received so much help from the Solr community over the years and
with such a clear need for an example-driven guide to Solr, Tim and I are excited to
be able to provide Solr in Action to help the next generation of search engineers. It’s
the book we wish we’d had five years ago when we started with Solr, and we hope that
you find it to be useful, whether you are just getting introduced to Solr or are looking
to take your knowledge to the next level.
TREY GRAINGER

www.it-ebooks.info

acknowledgments
Much like Solr, this book would not have been possible without the support of a large
community of dedicated people:

■

■

■
■

■

■

■

Lucene/Solr committers who not only write amazing code but also provide
invaluable expertise and advice, all the while demonstrating patience with new
members of the community
Active Lucene/Solr community members who contribute code, update the wiki
and other documentation, and answer questions on the Lucene and Solr mailing lists
Yonik Seeley, original creator of Solr, who contributed the foreword to our book
Our Manning Early Access Program (MEAP) readers who posted comments in
the Author Online forum
The reviewers who provided valuable feedback throughout the development
process: Alexandre Madurell, Ammar Alrashed, Brandon Harper, Chris Nauroth,
Craig Smith, Edward Welker, Gregor Zurowski, John Viviano, Leo Cassarani,
Robert Petersen, Scott Anthony, Sopan Shewale, and Uma Maheshwar Rao
Gunuganti
Ivan Todorovic´ and John Guthrie who provided a detailed technical proofread
of the manuscript shortly before it went into production
Our Manning editors, Elizabeth Lexleigh, Susan Conant, Melinda Rankin,
Elizabeth Martin, and Janet Vail

xix

www.it-ebooks.info

ACKNOWLEDGMENTS

xx
■

■

Bert Bates at Manning for helping us improve the instructional quality of
our writing
Family and friends who supported us through the many hours of research
and writing

TREY GRAINGER
First and foremost, I would like to thank my amazing wife, Lindsay, for her support
and patience during the many long days and nights it took to write this book. Without
her understanding and help throughout the journey, this book would have never
been possible (especially with the birth of our daughter midway through the project).
I would also like to thank Paula and Steven Woolf for the countless hours they
spent watching Melodie so that I could push this project to completion. Finally, I
would like to thank the team at CareerBuilder—both the company leadership and my
Search team—for giving me the opportunity to work with such great people and to
build a cutting-edge search platform that benefits society in such a clear way.
TIMOTHY POTTER
I would like to thank Sharon Russom, my mother, for instilling a love of learning and

books early in my childhood, and David Potter, my father, for all of his support
throughout college and my career. This book would not have been possible without
the help of Lori Joy. Thank you for your support and for being understanding during the late evenings and missed weekends, and for being a sounding board early in
the writing process.
I also thank my former team at the Dachis Group. I could not have done this without their insightful questions about Solr and their giving me the opportunity to build
a large-scale search solution using Solr.

www.it-ebooks.info

about this book
Whether handling big data, building cloud-based services, or developing multitenant
web applications, it’s vital to have a fast, reliable search solution. Apache Solr is a scalable and ready-to-deploy open source full-text search engine powered by Lucene. It
offers key features like multilingual keyword searching, faceted search, intelligent
matching, content clustering, and relevancy weighting right out of the box.
Solr in Action is the definitive guide to implementing fast and scalable search using
Apache Solr. It uses well-documented examples ranging from basic keyword searching
to scaling a system for billions of documents and queries. With this book, you’ll gain a
deep understanding of how to implement core Solr capabilities such as faceted navigation through search results, matched snippet highlighting, field collapsing and
search results grouping, spell-checking, query autocomplete, querying by functions,
and more. You’ll also see how to take Solr to the next level, with deep coverage of
large-scale production use cases, sophisticated multilingual search, complex query
operations, and advanced relevancy tuning strategies.

Roadmap
Solr in Action is divided into three parts: “Meet Solr,” “Core Solr capabilities,” and “Taking Solr to the next level.” If you are new to Solr and to search in general, we strongly
recommend that you read the chapters in part 1 in order, as many of the concepts presented in these chapters build on each other.
The concepts covered in part 2 were chosen because they are common features of
most search applications. You can safely skip any chapter in part 2 that may not apply

xxi

www.it-ebooks.info

xxii

ABOUT THIS BOOK

to your current needs. For example, result grouping is a common feature in many
search engines, but if your data doesn’t require grouping, then you can safely skip
chapter 11.
The four chapters (13–16) in part 3 are the most challenging as they introduce
advanced topics, including multilingual search, running Solr in a large-scale cluster
environment, advanced data operations, and relevancy tuning.
Most of the chapters use hands-on activities to help you work through the material.
Our goal for each example was that it be easy to use but cover the chapter topic thoroughly. In many examples, we used data from real-world datasets so that you would get
exposure to working with realistic use cases.
Chapter 1 introduces the type of data and use cases Solr was designed to handle.
You’ll learn about the kinds of problems you can solve with Solr and gain an overview
of its key features. Solr 4 is a significant milestone for the Lucene/Solr project, so
even if you’re an expert on previous versions of Solr, we encourage you to read chapter 1 to get a sense for all the new and exciting features in Solr 4.
Chapter 2 shows how to install and run Solr on your local workstation. After starting Solr, we demonstrate how to index and query a set of example documents that
ship with Solr. We also take a brief tour of Solr’s web-based administration console.
Chapter 3 introduces general search theory and how Solr implements that theory in practice. Most interestingly, this chapter covers the inverted search index
and how relevancy scoring works to present the most relevant documents at the top
of search results. Even if you have worked with Solr in the past, we recommend
reading this chapter to refresh your understanding of the fundamental operations
in a search engine.
Chapter 4 shows the basics of Solr’s configuration, primarily focused on Solr’s

main configuration file: solrconfig.xml. Our aim in this chapter is to introduce the most
important configuration settings for Solr, particularly those that impact how Solr processes requests from client applications. The knowledge you gain in this chapter will
be applied throughout the rest of the book.
Chapter 5 teaches how Solr indexes documents, starting with a discussion of
another important configuration file: schema.xml. You’ll learn how to define fields to
represent structured data like numbers, dates, prices, and unique identifiers. We also
cover how update requests are processed and configured using solrconfig.xml.
Chapter 6 builds on the material in chapter 5 by showing how to index text fields
using text analysis. Solr was designed to efficiently search and rank documents requiring full-text search. Text analysis is an important part of the search process in that it
removes the linguistic variations between indexed text and queries.
At this point in the book, you’ll have a solid foundation and will be ready to put
Solr to work on your own search needs. As your knowledge of search and Solr
grows, so too will your need to go beyond basic keyword searching and implement
common search features such as advanced query parsing, hit highlighting, spellchecking, autosuggest, faceting, and result grouping.

www.it-ebooks.info

ABOUT THIS BOOK

xxiii

In chapter 7, we cover how to construct queries and how they are executed.
You’ll learn about Solr’s many query parsers, as well as how to sort, format, return,
and debug search results.
In chapter 8, you’ll learn about one of the most powerful and popular features of
Solr—faceting. Solr’s faceting provides tools to refine search criteria and helps users
discover more information by categorizing search results into subgroups.
Chapter 9 explains how to highlight query terms in search results in order to
improve the user experience with your search solution.

In chapter 10, we cover spell-checking and autosuggestions. Solr’s autosuggest features allow a user to start typing a few characters and receive a list of suggested queries
as they type.
Chapter 11 explores Solr’s result grouping and field collapsing support to help you
return an optimal mix of search results when your index includes many similar documents, such as multiple locations of the same restaurant in a city.
Chapter 12 helps you prepare to deploy Solr in a production environment. This
chapter will help you plan your hardware and resource needs, as well as whether you
need to consider sharding and replication to handle a large number of documents
and query requests.
Chapter 13 covers a set of distributed features known as SolrCloud. You’ll learn
how to run Solr in cloud mode so that you can scale your search application to support a large volume of users and documents. You’ll come away from this chapter having a solid understanding of how Solr achieves scalability and fault tolerance by
distributing indexes across multiple servers.
Chapter 14 builds upon the text analysis concepts covered in chapter 6 by teaching you how to handle multilingual text in your search engine. If you need to work
with non-English text or support multiple languages in the same index, this chapter
is a must-read.
Chapter 15 explores advanced query features, including function queries, geospatial search, multilevel faceting, and cross-document and cross-index joins.
In chapter 16, you’ll learn techniques for improving the relevancy of your results,
such as boosting, scoring based upon functions, alternate similarity algorithms, and
debugging relevancy scores. In addition, we provide an in-depth discussion of using
Solr for personalized search and recommendations.
There are three appendixes, which cover a number of subtopics from earlier chapters in greater depth. Appendix A focuses on working with the Solr codebase and how
you can create your own custom Solr distribution if you need features or bug fixes not
available in an official release. This is an extension of some of the material from the
beginning of chapter 12.
Appendix B lists, in table format, out-of-the box configurations for many of the
languages Solr supports. This material is an extended version of the language configurations covered in chapter 14.

www.it-ebooks.info

ABOUT THIS BOOK

xxiv

Appendix C highlights the Data Import Handler (DIH) in more detail (extending
coverage from chapters 10 and 12), demonstrating the steps necessary for importing a
number of large, publicly available datasets.

How to use this book
Solr in Action is designed to be accessible for any software engineer—no previous experience working with search engines is assumed. The topics covered rise in expertise
level throughout the book, and even the most seasoned Solr professionals are likely to
learn something from the last few chapters. The scope of the book is massive—coming in at over 600 pages—but the engaging and practical real-world examples and
careful balance between theory and practice make the book a real asset to anyone
using Solr —whether you are just getting started or have years of experience.
As mentioned above, the chapters in part 1 provide the foundation upon which
the rest of the book will be built, and they will be critical for anyone new to Solr.
These chapters should be read in sequence to give you the best overview of Solr and
search in general. If you are new to Solr, chapter 2 will show you how to start and use
Solr for the first time, and chapter 3 will provide the key search theory that the rest
of the book builds upon. Configuring your Solr server and setting up field types to
properly analyze your content round out the search topics needed to understand
Solr’s fundamentals.
Many of the chapters in part 2 can be skipped if your work does not include the
features discussed. In particular, chapters 9, 10, and 11 are largely standalone topics
that are not important for understanding later chapters, so you can skip them if you
are not planning on implementing hit highlighting, query suggestions, or result
grouping/field collapsing any time soon. Chapters 7 and 8 cover some of the most
commonly used features of many search applications, so you will want to at least skim
through them before putting the book away.
The remaining chapters cover some of the advanced topics surrounding Solr.
Tough challenges will be tackled, including scaling a cluster of servers, multilingual

search, complex query operations, and advanced relevancy techniques. While all
chapters in parts 2 and 3 build on part 1, chapter 13 (“SolrCloud”) additionally builds
on chapter 12 (“Taking Solr to production”), chapter 15 (“Complex query operations”) builds on chapters 7 (“Performing queries and handling results”) and 8 (“Faceted search”), and chapter 16 (“Mastering relevancy”) further builds on chapter 15.
In order to get the most benefit out of the book, be mindful not to skip any earlier
chapters that provide the necessary background for your understanding of these more
advanced topics.
Many of the chapters include executable examples that you can run as you read
along. These examples demonstrate new topics and provide you with the opportunity
for hands-on exploration of Solr’s capabilities—often through just hitting a running
Solr server from your web browser. While you do not have to run all of the examples
and can simply use them as reference configurations in many cases, running the

www.it-ebooks.info

Solr in action

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về