Tải bản đầy đủ (.pdf) (719 trang)

Elasticsearch the definitive guide . A DISTRIBUTED REALTIME SEARCH AND ANALYTICS ENGINE

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (6.01 MB, 719 trang )

Elasticsearch: The Definitive Guide

If you’re a newcomer to both search and distributed systems, you’ll
quickly learn how to integrate Elasticsearch into your application. More
experienced users will pick up lots of advanced techniques. Throughout
the book, you’ll follow a problem-based approach to learn why, when, and
how to use Elasticsearch features.
■■

Understand how Elasticsearch interprets data in your
documents

■■

Index and query your data to take advantage of search
concepts such as relevance and word proximity

■■

Handle human language through the effective use of analyzers
and queries

■■

Summarize and group data to show overall trends, with
aggregations and analytics

■■

Use geo-points and geo-shapes—Elasticsearch’s approaches
to geolocation



■■

Model your data to take advantage of Elasticsearch’s horizontal
scalability

■■

Learn how to configure and monitor your cluster in production

book could easily be
“The
retitled as 'Understanding
search engines using
Elasticsearch.' Great job.
Way beyond just simply
using Elasticsearch.



—Ivan Brusic

Search Consultant

Clinton Gormley was the first user of Elasticsearch and wrote the Perl API back
in 2010. When Elasticsearch formed a company in 2012, he joined as a developer
and the maintainer of the Perl modules.

DATABA SES/ WEB


US $49.99

Twitter: @oreillymedia
facebook.com/oreilly
CAN $57.99

ISBN: 978-1-449-35854-9

Elasticsearch

The Definitive Guide
A DISTRIBUTED REAL-TIME SEARCH AND ANALYTICS ENGINE

Gormley
& Tong

Zachary Tong has been working with Elasticsearch since 2011, and has written
several tutorials to help beginners using the server. Zach is a developer at
Elasticsearch and maintains the PHP client.

Elasticsearch:
The Definitive Guide

Whether you need full-text search or real-time analytics of structured data—
or both—the Elasticsearch distributed search engine is an ideal way to put
your data to work. This practical guide not only shows you how to search,
analyze, and explore data with Elasticsearch, but also helps you deal with the
complexities of human language, geolocation, and relationships.

Clinton Gormley &

Zachary Tong


Elasticsearch: The Definitive Guide

If you’re a newcomer to both search and distributed systems, you’ll
quickly learn how to integrate Elasticsearch into your application. More
experienced users will pick up lots of advanced techniques. Throughout
the book, you’ll follow a problem-based approach to learn why, when, and
how to use Elasticsearch features.
■■

Understand how Elasticsearch interprets data in your
documents

■■

Index and query your data to take advantage of search
concepts such as relevance and word proximity

■■

Handle human language through the effective use of analyzers
and queries

■■

Summarize and group data to show overall trends, with
aggregations and analytics


■■

Use geo-points and geo-shapes—Elasticsearch’s approaches
to geolocation

■■

Model your data to take advantage of Elasticsearch’s horizontal
scalability

■■

Learn how to configure and monitor your cluster in production

book could easily be
“The
retitled as 'Understanding
search engines using
Elasticsearch.' Great job.
Way beyond just simply
using Elasticsearch.



—Ivan Brusic

Search Consultant

Clinton Gormley was the first user of Elasticsearch and wrote the Perl API back
in 2010. When Elasticsearch formed a company in 2012, he joined as a developer

and the maintainer of the Perl modules.

DATABA SES/ WEB

US $49.99

Twitter: @oreillymedia
facebook.com/oreilly
CAN $57.99

ISBN: 978-1-449-35854-9

Elasticsearch

The Definitive Guide
A DISTRIBUTED REAL-TIME SEARCH AND ANALYTICS ENGINE

Gormley
& Tong

Zachary Tong has been working with Elasticsearch since 2011, and has written
several tutorials to help beginners using the server. Zach is a developer at
Elasticsearch and maintains the PHP client.

Elasticsearch:
The Definitive Guide

Whether you need full-text search or real-time analytics of structured data—
or both—the Elasticsearch distributed search engine is an ideal way to put
your data to work. This practical guide not only shows you how to search,

analyze, and explore data with Elasticsearch, but also helps you deal with the
complexities of human language, geolocation, and relationships.

Clinton Gormley &
Zachary Tong


Elasticsearch: The Definitive Guide

Clinton Gormley and Zachary Tong


Elasticsearch: The Definitive Guide
by Clinton Gormley and Zachary Tong
Copyright © 2015 Elasticsearch. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (). For more information, contact our corporate/
institutional sales department: 800-998-9938 or

Editors: Mike Loukides and Brian Anderson
Production Editor: Shiny Kalapurakkel
Proofreader: Sharon Wilkey
Indexer: Ellen Troutman-Zaig

Interior Designer: David Futato
Cover Designer: Ellie Volkhausen
Illustrator: Rebecca Demarest


First Edition

January 2015:

Revision History for the First Edition
2015-01-16:

First Release

See for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Elasticsearch: The Definitive Guide, the
cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trade‐
mark claim, the designations have been printed in caps or initial caps.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.

978-1-449-35854-9
[LSI]


Table of Contents

Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi

Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii

Part I.

Getting Started

1. You Know, for Search…. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Installing Elasticsearch
Installing Marvel
Running Elasticsearch
Viewing Marvel and Sense
Talking to Elasticsearch
Java API
RESTful API with JSON over HTTP
Document Oriented
JSON
Finding Your Feet
Let’s Build an Employee Directory
Indexing Employee Documents
Retrieving a Document
Search Lite
Search with Query DSL
More-Complicated Searches
Full-Text Search
Phrase Search
Highlighting Our Searches
Analytics
Tutorial Conclusion

4

5
5
6
6
6
7
9
9
10
10
10
12
13
15
16
17
18
19
20
23
iii


Distributed Nature
Next Steps

23
24

2. Life Inside a Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

An Empty Cluster
Cluster Health
Add an Index
Add Failover
Scale Horizontally
Then Scale Some More
Coping with Failure

26
26
27
29
30
31
32

3. Data In, Data Out. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
What Is a Document?
Document Metadata
_index
_type
_id
Other Metadata
Indexing a Document
Using Our Own ID
Autogenerating IDs
Retrieving a Document
Retrieving Part of a Document
Checking Whether a Document Exists
Updating a Whole Document

Creating a New Document
Deleting a Document
Dealing with Conflicts
Optimistic Concurrency Control
Using Versions from an External System
Partial Updates to Documents
Using Scripts to Make Partial Updates
Updating a Document That May Not Yet Exist
Updates and Conflicts
Retrieving Multiple Documents
Cheaper in Bulk
Don’t Repeat Yourself
How Big Is Too Big?

36
37
37
37
38
38
38
38
39
40
41
42
42
43
44
45

47
49
50
51
52
53
54
56
60
60

4. Distributed Document Store. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Routing a Document to a Shard

iv

|

Table of Contents

61


How Primary and Replica Shards Interact
Creating, Indexing, and Deleting a Document
Retrieving a Document
Partial Updates to a Document
Multidocument Patterns
Why the Funny Format?


62
63
65
66
67
69

5. Searching—The Basic Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
The Empty Search
hits
took
shards
timeout
Multi-index, Multitype
Pagination
Search Lite
The _all Field
More Complicated Queries

72
73
73
73
74
74
75
76
77
78


6. Mapping and Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Exact Values Versus Full Text
Inverted Index
Analysis and Analyzers
Built-in Analyzers
When Analyzers Are Used
Testing Analyzers
Specifying Analyzers
Mapping
Core Simple Field Types
Viewing the Mapping
Customizing Field Mappings
Updating a Mapping
Testing the Mapping
Complex Core Field Types
Multivalue Fields
Empty Fields
Multilevel Objects
Mapping for Inner Objects
How Inner Objects are Indexed
Arrays of Inner Objects

80
81
84
84
85
86
87
87

88
89
89
91
92
93
93
93
94
94
95
95

Table of Contents

|

v


7. Full-Body Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Empty Search
Query DSL
Structure of a Query Clause
Combining Multiple Clauses
Queries and Filters
Performance Differences
When to Use Which
Most Important Queries and Filters
term Filter

terms Filter
range Filter
exists and missing Filters
bool Filter
match_all Query
match Query
multi_match Query
bool Query
Combining Queries with Filters
Filtering a Query
Just a Filter
A Query as a Filter
Validating Queries
Understanding Errors
Understanding Queries

97
98
99
99
100
101
101
102
102
102
102
103
103
103

104
104
105
105
106
107
107
108
108
109

8. Sorting and Relevance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Sorting
Sorting by Field Values
Multilevel Sorting
Sorting on Multivalue Fields
String Sorting and Multifields
What Is Relevance?
Understanding the Score
Understanding Why a Document Matched
Fielddata

111
112
113
113
114
115
116
119

119

9. Distributed Search Execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Query Phase
Fetch Phase
Search Options
preference

vi

| Table of Contents

122
123
125
125


timeout
routing
search_type
scan and scroll

126
126
127
127

10. Index Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Creating an Index

Deleting an Index
Index Settings
Configuring Analyzers
Custom Analyzers
Creating a Custom Analyzer
Types and Mappings
How Lucene Sees Documents
How Types Are Implemented
Avoiding Type Gotchas
The Root Object
Properties
Metadata: _source Field
Metadata: _all Field
Metadata: Document Identity
Dynamic Mapping
Customizing Dynamic Mapping
date_detection
dynamic_templates
Default Mapping
Reindexing Your Data
Index Aliases and Zero Downtime

131
132
132
133
134
135
137
137

138
138
140
140
141
142
144
145
147
147
148
149
150
151

11. Inside a Shard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Making Text Searchable
Immutability
Dynamically Updatable Indices
Deletes and Updates
Near Real-Time Search
refresh API
Making Changes Persistent
flush API
Segment Merging

154
155
155
158

159
160
161
165
166

Table of Contents

|

vii


optimize API

Part II.

168

Search in Depth

12. Structured Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Finding Exact Values
term Filter with Numbers
term Filter with Text
Internal Filter Operation
Combining Filters
Bool Filter
Nesting Boolean Filters
Finding Multiple Exact Values

Contains, but Does Not Equal
Equals Exactly
Ranges
Ranges on Dates
Ranges on Strings
Dealing with Null Values
exists Filter
missing Filter
exists/missing on Objects
All About Caching
Independent Filter Caching
Controlling Caching
Filter Order

173
174
175
178
179
179
181
182
183
184
185
186
187
187
188
190

191
192
192
193
194

13. Full-Text Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Term-Based Versus Full-Text
The match Query
Index Some Data
A Single-Word Query
Multiword Queries
Improving Precision
Controlling Precision
Combining Queries
Score Calculation
Controlling Precision
How match Uses bool
Boosting Query Clauses
Controlling Analysis

viii

|

Table of Contents

197
199
199

200
201
202
203
204
205
205
206
207
209


Default Analyzers
Configuring Analyzers in Practice
Relevance Is Broken!

211
213
214

14. Multifield Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Multiple Query Strings
Prioritizing Clauses
Single Query String
Know Your Data
Best Fields
dis_max Query
Tuning Best Fields Queries
tie_breaker
multi_match Query

Using Wildcards in Field Names
Boosting Individual Fields
Most Fields
Multifield Mapping
Cross-fields Entity Search
A Naive Approach
Problems with the most_fields Approach
Field-Centric Queries
Problem 1: Matching the Same Word in Multiple Fields
Problem 2: Trimming the Long Tail
Problem 3: Term Frequencies
Solution
Custom _all Fields
cross-fields Queries
Per-Field Boosting
Exact-Value Fields

217
218
219
220
221
222
223
224
225
226
227
227
228

231
231
232
232
233
233
234
235
235
236
238
239

15. Proximity Matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Phrase Matching
Term Positions
What Is a Phrase
Mixing It Up
Multivalue Fields
Closer Is Better
Proximity for Relevance
Improving Performance
Rescoring Results
Finding Associated Words

242
242
243
244
245

246
247
249
249
250

Table of Contents

|

ix


Producing Shingles
Multifields
Searching for Shingles
Performance

251
252
253
255

16. Partial Matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Postcodes and Structured Data
prefix Query
wildcard and regexp Queries
Query-Time Search-as-You-Type
Index-Time Optimizations
Ngrams for Partial Matching

Index-Time Search-as-You-Type
Preparing the Index
Querying the Field
Edge n-grams and Postcodes
Ngrams for Compound Words

258
259
260
262
264
264
265
265
267
270
271

17. Controlling Relevance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Theory Behind Relevance Scoring
Boolean Model
Term Frequency/Inverse Document Frequency (TF/IDF)
Vector Space Model
Lucene’s Practical Scoring Function
Query Normalization Factor
Query Coordination
Index-Time Field-Level Boosting
Query-Time Boosting
Boosting an Index
t.getBoost()

Manipulating Relevance with Query Structure
Not Quite Not
boosting Query
Ignoring TF/IDF
constant_score Query
function_score Query
Boosting by Popularity
modifier
factor
boost_mode
max_boost
Boosting Filtered Subsets

x

|

Table of Contents

275
276
276
279
282
283
284
286
286
287
288

288
289
290
291
291
293
294
296
298
299
301
301


filter Versus query
functions
score_mode
Random Scoring
The Closer, The Better
Understanding the price Clause
Scoring with Scripts
Pluggable Similarity Algorithms
Okapi BM25
Changing Similarities
Configuring BM25
Relevance Tuning Is the Last 10%

Part III.

302

303
303
303
305
308
308
310
310
313
314
315

Dealing with Human Language

18. Getting Started with Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Using Language Analyzers
Configuring Language Analyzers
Pitfalls of Mixing Languages
At Index Time
At Query Time
Identifying Language
One Language per Document
Foreign Words
One Language per Field
Mixed-Language Fields
Split into Separate Fields
Analyze Multiple Times
Use n-grams

320

321
323
323
324
324
325
326
327
329
329
329
330

19. Identifying Words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
standard Analyzer
standard Tokenizer
Installing the ICU Plug-in
icu_tokenizer
Tidying Up Input Text
Tokenizing HTML
Tidying Up Punctuation

333
334
335
335
337
337
338


20. Normalizing Tokens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
In That Case

341

Table of Contents

|

xi


You Have an Accent
Retaining Meaning
Living in a Unicode World
Unicode Case Folding
Unicode Character Folding
Sorting and Collations
Case-Insensitive Sorting
Differences Between Languages
Unicode Collation Algorithm
Unicode Sorting
Specifying a Language
Customizing Collations

342
343
346
347
349

350
351
353
353
354
355
358

21. Reducing Words to Their Root Form. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Algorithmic Stemmers
Using an Algorithmic Stemmer
Dictionary Stemmers
Hunspell Stemmer
Installing a Dictionary
Per-Language Settings
Creating a Hunspell Token Filter
Hunspell Dictionary Format
Choosing a Stemmer
Stemmer Performance
Stemmer Quality
Stemmer Degree
Making a Choice
Controlling Stemming
Preventing Stemming
Customizing Stemming
Stemming in situ
Is Stemming in situ a Good Idea

360
361

363
364
365
365
366
367
369
370
370
370
371
371
371
372
373
375

22. Stopwords: Performance Versus Precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
Pros and Cons of Stopwords
Using Stopwords
Stopwords and the Standard Analyzer
Maintaining Positions
Specifying Stopwords
Using the stop Token Filter
Updating Stopwords
Stopwords and Performance

xii

|


Table of Contents

378
379
379
380
380
381
383
383


and Operator
minimum_should_match
Divide and Conquer
Controlling Precision
Only High-Frequency Terms
More Control with Common Terms
Stopwords and Phrase Queries
Positions Data
Index Options
Stopwords
common_grams Token Filter
At Index Time
Unigram Queries
Bigram Phrase Queries
Two-Word Phrases
Stopwords and Relevance


383
384
385
386
387
388
388
389
389
390
391
392
393
393
394
394

23. Synonyms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
Using Synonyms
Formatting Synonyms
Expand or contract
Simple Expansion
Simple Contraction
Genre Expansion
Synonyms and The Analysis Chain
Case-Sensitive Synonyms
Multiword Synonyms and Phrase Queries
Use Simple Contraction for Phrase Queries
Synonyms and the query_string Query
Symbol Synonyms


396
397
398
398
399
400
401
401
402
404
405
405

24. Typoes and Mispelings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
Fuzziness
Fuzzy Query
Improving Performance
Fuzzy match Query
Scoring Fuzziness
Phonetic Matching

Part IV.

409
410
411
412
413
413


Aggregations

Table of Contents

|

xiii


25. High-Level Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
Buckets
Metrics
Combining the Two

420
420
420

26. Aggregation Test-Drive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Adding a Metric to the Mix
Buckets Inside Buckets
One Final Modification

426
427
429

27. Building Bar Charts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
28. Looking at Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437

Returning Empty Buckets
Extended Example
The Sky’s the Limit

439
441
443

29. Scoping Aggregations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
30. Filtering Queries and Aggregations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
Filtered Query
Filter Bucket
Post Filter
Recap

449
450
451
452

31. Sorting Multivalue Buckets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
Intrinsic Sorts
Sorting by a Metric
Sorting Based on “Deep” Metrics

453
454
455

32. Approximate Aggregations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457

Finding Distinct Counts
Understanding the Trade-offs
Optimizing for Speed
Calculating Percentiles
Percentile Metric
Percentile Ranks
Understanding the Trade-offs

458
460
461
462
464
467
469

33. Significant Terms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
significant_terms Demo
Recommending Based on Popularity

xiv

|

Table of Contents

472
474



Recommending Based on Statistics

478

34. Controlling Memory Use and Latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
Fielddata
Aggregations and Analysis
High-Cardinality Memory Implications
Limiting Memory Usage
Fielddata Size
Monitoring fielddata
Circuit Breaker
Fielddata Filtering
Doc Values
Enabling Doc Values
Preloading Fielddata
Eagerly Loading Fielddata
Global Ordinals
Index Warmers
Preventing Combinatorial Explosions
Depth-First Versus Breadth-First

481
483
486
487
488
489
490
491

493
494
494
495
496
498
500
502

35. Closing Thoughts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507

Part V.

Geolocation

36. Geo-Points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
Lat/Lon Formats
Filtering by Geo-Point
geo_bounding_box Filter
Optimizing Bounding Boxes
geo_distance Filter
Faster Geo-Distance Calculations
geo_distance_range Filter
Caching geo-filters
Reducing Memory Usage
Sorting by Distance
Scoring by Distance

511
512

513
514
515
516
517
517
519
520
522

37. Geohashes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
Mapping Geohashes
geohash_cell Filter

524
525

Table of Contents

|

xv


38. Geo-aggregations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
geo_distance Aggregation
geohash_grid Aggregation
geo_bounds Aggregation

527

530
532

39. Geo-shapes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
Mapping geo-shapes
precision
distance_error_pct
Indexing geo-shapes
Querying geo-shapes
Querying with Indexed Shapes
Geo-shape Filters and Caching

Part VI.

536
536
537
537
538
540
541

Modeling Your Data

40. Handling Relationships. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
Application-side Joins
Denormalizing Your Data
Field Collapsing
Denormalization and Concurrency
Renaming Files and Directories

Solving Concurrency Issues
Global Locking
Document Locking
Tree Locking

546
548
549
552
555
555
556
557
558

41. Nested Objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
Nested Object Mapping
Querying a Nested Object
Sorting by Nested Fields
Nested Aggregations
reverse_nested Aggregation
When to Use Nested Objects

563
564
565
567
568
570


42. Parent-Child Relationship. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
Parent-Child Mapping
Indexing Parents and Children
Finding Parents by Their Children
min_children and max_children
Finding Children by Their Parents

xvi

|

Table of Contents

572
572
573
575
575


Children Aggregation
Grandparents and Grandchildren
Practical Considerations
Memory Use
Global Ordinals and Latency
Multigenerations and Concluding Thoughts

576
577
579

579
580
580

43. Designing for Scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
The Unit of Scale
Shard Overallocation
Kagillion Shards
Capacity Planning
Replica Shards
Balancing Load with Replicas
Multiple Indices
Time-Based Data
Index per Time Frame
Index Templates
Retiring Data
Migrate Old Indices
Optimize Indices
Closing Old Indices
Archiving Old Indices
User-Based Data
Shared Index
Faking Index per User with Aliases
One Big User
Scale Is Not Infinite

583
585
586
587

588
589
590
592
592
593
594
595
595
596
596
597
597
600
601
602

Part VII. Administration, Monitoring, and Deployment
44. Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
Marvel for Monitoring
Cluster Health
Drilling Deeper: Finding Problematic Indices
Blocking for Status Changes
Monitoring Individual Nodes
indices Section
OS and Process Sections
JVM Section
Threadpool Section

607

608
609
611
612
613
616
617
620

Table of Contents

|

xvii


FS and Network Sections
Circuit Breaker
Cluster Stats
Index Stats
Pending Tasks
cat API

622
622
623
623
624
626


45. Production Deployment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
Hardware
Memory
CPUs
Disks
Network
General Considerations
Java Virtual Machine
Transport Client Versus Node Client
Configuration Management
Important Configuration Changes
Assign Names
Paths
Minimum Master Nodes
Recovery Settings
Prefer Unicast over Multicast
Don’t Touch These Settings!
Garbage Collector
Threadpools
Heap: Sizing and Swapping
Give Half Your Memory to Lucene
Don’t Cross 32 GB!
Swapping Is the Death of Performance
File Descriptors and MMap
Revisit This List Before Production

631
631
632
632

633
633
634
634
635
635
636
636
637
638
639
640
640
641
641
642
642
644
645
646

46. Post-Deployment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
Changing Settings Dynamically
Logging
Slowlog
Indexing Performance Tips
Test Performance Scientifically
Using and Sizing Bulk Requests
Storage
Segments and Merging


xviii

|

Table of Contents

647
648
648
649
650
650
651
651


Other
Rolling Restarts
Backing Up Your Cluster
Creating the Repository
Snapshotting All Open Indices
Snapshotting Particular Indices
Listing Information About Snapshots
Deleting Snapshots
Monitoring Snapshot Progress
Canceling a Snapshot
Restoring from a Snapshot
Monitoring Restore Operations
Canceling a Restore

Clusters Are Living, Breathing Creatures

653
654
655
655
656
657
657
658
658
661
661
662
663
664

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665

Table of Contents

|

xix



Foreword

One of the most nerve-wracking periods when releasing the first version of an open

source project occurs when the IRC channel is created. You are all alone, eagerly hop‐
ing and wishing for the first user to come along. I still vividly remember those days.
One of the first users that jumped on IRC was Clint, and how excited was I. Well…
for a brief period, until I found out that Clint was actually a Perl user, no less working
on a website that dealt with obituaries. I remember asking myself why couldn’t we get
someone from a more “hyped” community, like Ruby or Python (at the time), and a
slightly nicer use case.
How wrong I was. Clint ended up being instrumental to the success of Elasticsearch.
He was the first user to roll out Elasticsearch into production (version 0.4 no less!),
and the interaction with Clint was pivotal during the early days in shaping Elastic‐
search into what it is today. Clint has a unique insight into what is simple, and he is
very rarely wrong, which has a huge impact on various usability aspects of Elastic‐
search, from management, to API design, to day-to-day usability features. It was a no
brainer for us to reach out to Clint and ask if he would join our company immedi‐
ately after we formed it.
Another one of the first things we did when we formed the company was offer public
training. It’s hard to express how nervous we were about whether or not people
would even sign up for it.
We were wrong.
The trainings were and still are a rave success with waiting lists in all major cities.
One of the people who caught our eye was a young fellow, Zach, who came to one of
our trainings. We knew about Zach from his blog posts about using Elasticsearch
(and secretly envied his ability to explain complex concepts in a very simple manner)
and from a PHP client he wrote for the software. What we found out was that Zach
had actually paid to attend the Elasticsearch training out of his own pocket! You can’t

xxi


really ask for more than that, and we reached out to Zach and asked if he would join

our company as well.
Both Clint and Zach are pivotal to the success of Elasticsearch. They are wonderful
communicators who can explain Elasticsearch from its high-level simplicity, to its
(and Apache Lucene’s) low-level internal complexities. It’s a unique skill that we
dearly cherish here at Elasticsearch. Clint is also responsible for the Elasticsearch Perl
client, while Zach is responsible for the PHP one - both wonderful pieces of code.
And last, both play an instrumental role in most of what happens daily with the Elas‐
ticsearch project itself. One of the main reasons why Elasticsearch is so popular is its
ability to communicate empathy to its users, and Clint and Zach are both part of the
group that makes this a reality.

xxii

|

Foreword


Preface

The world is swimming in data. For years we have been simply overwhelmed by the
quantity of data flowing through and produced by our systems. Existing technology
has focused on how to store and structure warehouses full of data. That’s all well and
good—until you actually need to make decisions in real time informed by that data.
Elasticsearch is a distributed, scalable, real-time search and analytics engine. It ena‐
bles you to search, analyze, and explore your data, often in ways that you did not
anticipate at the start of a project. It exists because raw data sitting on a hard drive is
just not useful.
Whether you need full-text search, real-time analytics of structured data, or a combi‐
nation of the two, this book introduces you to the fundamental concepts required to

start working with Elasticsearch at a basic level. With these foundations laid, it will
move on to more-advanced search techniques, which you will need to shape the
search experience to fit your requirements.
Elasticsearch is not just about full-text search. We explain structured search, analyt‐
ics, the complexities of dealing with human language, geolocation, and relationships.
We will also discuss how best to model your data to take advantage of the horizontal
scalability of Elasticsearch, and how to configure and monitor your cluster when
moving to production.

Who Should Read This Book
This book is for anybody who wants to put their data to work. It doesn’t matter
whether you are starting a new project and have the flexibility to design the system
from the ground up, or whether you need to give new life to a legacy system. Elastic‐
search will help you to solve existing problems and open the way to new features that
you haven’t yet considered.
This book is suitable for novices and experienced users alike. We expect you to have
some programming background and, although not required, it would help to have
xxiii


×