Elasticsearch: The Definitive Guide
If you’re a newcomer to both search and distributed systems, you’ll
quickly learn how to integrate Elasticsearch into your application. More
experienced users will pick up lots of advanced techniques. Throughout
the book, you’ll follow a problem-based approach to learn why, when, and
how to use Elasticsearch features.
■■
Understand how Elasticsearch interprets data in your
documents
■■
Index and query your data to take advantage of search
concepts such as relevance and word proximity
■■
Handle human language through the effective use of analyzers
and queries
■■
Summarize and group data to show overall trends, with
aggregations and analytics
■■
Use geo-points and geo-shapes—Elasticsearch’s approaches
to geolocation
■■
Model your data to take advantage of Elasticsearch’s horizontal
scalability
■■
Learn how to configure and monitor your cluster in production
book could easily be
“The
retitled as 'Understanding
search engines using
Elasticsearch.' Great job.
Way beyond just simply
using Elasticsearch.
”
—Ivan Brusic
Search Consultant
Clinton Gormley was the first user of Elasticsearch and wrote the Perl API back
in 2010. When Elasticsearch formed a company in 2012, he joined as a developer
and the maintainer of the Perl modules.
DATABA SES/ WEB
US $49.99
Twitter: @oreillymedia
facebook.com/oreilly
CAN $57.99
ISBN: 978-1-449-35854-9
Elasticsearch
The Definitive Guide
A DISTRIBUTED REAL-TIME SEARCH AND ANALYTICS ENGINE
Gormley
& Tong
Zachary Tong has been working with Elasticsearch since 2011, and has written
several tutorials to help beginners using the server. Zach is a developer at
Elasticsearch and maintains the PHP client.
Elasticsearch:
The Definitive Guide
Whether you need full-text search or real-time analytics of structured data—
or both—the Elasticsearch distributed search engine is an ideal way to put
your data to work. This practical guide not only shows you how to search,
analyze, and explore data with Elasticsearch, but also helps you deal with the
complexities of human language, geolocation, and relationships.
Clinton Gormley &
Zachary Tong
Elasticsearch: The Definitive Guide
If you’re a newcomer to both search and distributed systems, you’ll
quickly learn how to integrate Elasticsearch into your application. More
experienced users will pick up lots of advanced techniques. Throughout
the book, you’ll follow a problem-based approach to learn why, when, and
how to use Elasticsearch features.
■■
Understand how Elasticsearch interprets data in your
documents
■■
Index and query your data to take advantage of search
concepts such as relevance and word proximity
■■
Handle human language through the effective use of analyzers
and queries
■■
Summarize and group data to show overall trends, with
aggregations and analytics
■■
Use geo-points and geo-shapes—Elasticsearch’s approaches
to geolocation
■■
Model your data to take advantage of Elasticsearch’s horizontal
scalability
■■
Learn how to configure and monitor your cluster in production
book could easily be
“The
retitled as 'Understanding
search engines using
Elasticsearch.' Great job.
Way beyond just simply
using Elasticsearch.
”
—Ivan Brusic
Search Consultant
Clinton Gormley was the first user of Elasticsearch and wrote the Perl API back
in 2010. When Elasticsearch formed a company in 2012, he joined as a developer
and the maintainer of the Perl modules.
DATABA SES/ WEB
US $49.99
Twitter: @oreillymedia
facebook.com/oreilly
CAN $57.99
ISBN: 978-1-449-35854-9
Elasticsearch
The Definitive Guide
A DISTRIBUTED REAL-TIME SEARCH AND ANALYTICS ENGINE
Gormley
& Tong
Zachary Tong has been working with Elasticsearch since 2011, and has written
several tutorials to help beginners using the server. Zach is a developer at
Elasticsearch and maintains the PHP client.
Elasticsearch:
The Definitive Guide
Whether you need full-text search or real-time analytics of structured data—
or both—the Elasticsearch distributed search engine is an ideal way to put
your data to work. This practical guide not only shows you how to search,
analyze, and explore data with Elasticsearch, but also helps you deal with the
complexities of human language, geolocation, and relationships.
Clinton Gormley &
Zachary Tong
Elasticsearch: The Definitive Guide
Clinton Gormley and Zachary Tong
Elasticsearch: The Definitive Guide
by Clinton Gormley and Zachary Tong
Copyright © 2015 Elasticsearch. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (). For more information, contact our corporate/
institutional sales department: 800-998-9938 or
Editors: Mike Loukides and Brian Anderson
Production Editor: Shiny Kalapurakkel
Proofreader: Sharon Wilkey
Indexer: Ellen Troutman-Zaig
Interior Designer: David Futato
Cover Designer: Ellie Volkhausen
Illustrator: Rebecca Demarest
First Edition
January 2015:
Revision History for the First Edition
2015-01-16:
First Release
See for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Elasticsearch: The Definitive Guide, the
cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trade‐
mark claim, the designations have been printed in caps or initial caps.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
978-1-449-35854-9
[LSI]
Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii
Part I.
Getting Started
1. You Know, for Search…. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Installing Elasticsearch
Installing Marvel
Running Elasticsearch
Viewing Marvel and Sense
Talking to Elasticsearch
Java API
RESTful API with JSON over HTTP
Document Oriented
JSON
Finding Your Feet
Let’s Build an Employee Directory
Indexing Employee Documents
Retrieving a Document
Search Lite
Search with Query DSL
More-Complicated Searches
Full-Text Search
Phrase Search
Highlighting Our Searches
Analytics
Tutorial Conclusion
4
5
5
6
6
6
7
9
9
10
10
10
12
13
15
16
17
18
19
20
23
iii
Distributed Nature
Next Steps
23
24
2. Life Inside a Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
An Empty Cluster
Cluster Health
Add an Index
Add Failover
Scale Horizontally
Then Scale Some More
Coping with Failure
26
26
27
29
30
31
32
3. Data In, Data Out. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
What Is a Document?
Document Metadata
_index
_type
_id
Other Metadata
Indexing a Document
Using Our Own ID
Autogenerating IDs
Retrieving a Document
Retrieving Part of a Document
Checking Whether a Document Exists
Updating a Whole Document
Creating a New Document
Deleting a Document
Dealing with Conflicts
Optimistic Concurrency Control
Using Versions from an External System
Partial Updates to Documents
Using Scripts to Make Partial Updates
Updating a Document That May Not Yet Exist
Updates and Conflicts
Retrieving Multiple Documents
Cheaper in Bulk
Don’t Repeat Yourself
How Big Is Too Big?
36
37
37
37
38
38
38
38
39
40
41
42
42
43
44
45
47
49
50
51
52
53
54
56
60
60
4. Distributed Document Store. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Routing a Document to a Shard
iv
|
Table of Contents
61
How Primary and Replica Shards Interact
Creating, Indexing, and Deleting a Document
Retrieving a Document
Partial Updates to a Document
Multidocument Patterns
Why the Funny Format?
62
63
65
66
67
69
5. Searching—The Basic Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
The Empty Search
hits
took
shards
timeout
Multi-index, Multitype
Pagination
Search Lite
The _all Field
More Complicated Queries
72
73
73
73
74
74
75
76
77
78
6. Mapping and Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Exact Values Versus Full Text
Inverted Index
Analysis and Analyzers
Built-in Analyzers
When Analyzers Are Used
Testing Analyzers
Specifying Analyzers
Mapping
Core Simple Field Types
Viewing the Mapping
Customizing Field Mappings
Updating a Mapping
Testing the Mapping
Complex Core Field Types
Multivalue Fields
Empty Fields
Multilevel Objects
Mapping for Inner Objects
How Inner Objects are Indexed
Arrays of Inner Objects
80
81
84
84
85
86
87
87
88
89
89
91
92
93
93
93
94
94
95
95
Table of Contents
|
v
7. Full-Body Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Empty Search
Query DSL
Structure of a Query Clause
Combining Multiple Clauses
Queries and Filters
Performance Differences
When to Use Which
Most Important Queries and Filters
term Filter
terms Filter
range Filter
exists and missing Filters
bool Filter
match_all Query
match Query
multi_match Query
bool Query
Combining Queries with Filters
Filtering a Query
Just a Filter
A Query as a Filter
Validating Queries
Understanding Errors
Understanding Queries
97
98
99
99
100
101
101
102
102
102
102
103
103
103
104
104
105
105
106
107
107
108
108
109
8. Sorting and Relevance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Sorting
Sorting by Field Values
Multilevel Sorting
Sorting on Multivalue Fields
String Sorting and Multifields
What Is Relevance?
Understanding the Score
Understanding Why a Document Matched
Fielddata
111
112
113
113
114
115
116
119
119
9. Distributed Search Execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Query Phase
Fetch Phase
Search Options
preference
vi
| Table of Contents
122
123
125
125
timeout
routing
search_type
scan and scroll
126
126
127
127
10. Index Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Creating an Index
Deleting an Index
Index Settings
Configuring Analyzers
Custom Analyzers
Creating a Custom Analyzer
Types and Mappings
How Lucene Sees Documents
How Types Are Implemented
Avoiding Type Gotchas
The Root Object
Properties
Metadata: _source Field
Metadata: _all Field
Metadata: Document Identity
Dynamic Mapping
Customizing Dynamic Mapping
date_detection
dynamic_templates
Default Mapping
Reindexing Your Data
Index Aliases and Zero Downtime
131
132
132
133
134
135
137
137
138
138
140
140
141
142
144
145
147
147
148
149
150
151
11. Inside a Shard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Making Text Searchable
Immutability
Dynamically Updatable Indices
Deletes and Updates
Near Real-Time Search
refresh API
Making Changes Persistent
flush API
Segment Merging
154
155
155
158
159
160
161
165
166
Table of Contents
|
vii
optimize API
Part II.
168
Search in Depth
12. Structured Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Finding Exact Values
term Filter with Numbers
term Filter with Text
Internal Filter Operation
Combining Filters
Bool Filter
Nesting Boolean Filters
Finding Multiple Exact Values
Contains, but Does Not Equal
Equals Exactly
Ranges
Ranges on Dates
Ranges on Strings
Dealing with Null Values
exists Filter
missing Filter
exists/missing on Objects
All About Caching
Independent Filter Caching
Controlling Caching
Filter Order
173
174
175
178
179
179
181
182
183
184
185
186
187
187
188
190
191
192
192
193
194
13. Full-Text Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Term-Based Versus Full-Text
The match Query
Index Some Data
A Single-Word Query
Multiword Queries
Improving Precision
Controlling Precision
Combining Queries
Score Calculation
Controlling Precision
How match Uses bool
Boosting Query Clauses
Controlling Analysis
viii
|
Table of Contents
197
199
199
200
201
202
203
204
205
205
206
207
209
Default Analyzers
Configuring Analyzers in Practice
Relevance Is Broken!
211
213
214
14. Multifield Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Multiple Query Strings
Prioritizing Clauses
Single Query String
Know Your Data
Best Fields
dis_max Query
Tuning Best Fields Queries
tie_breaker
multi_match Query
Using Wildcards in Field Names
Boosting Individual Fields
Most Fields
Multifield Mapping
Cross-fields Entity Search
A Naive Approach
Problems with the most_fields Approach
Field-Centric Queries
Problem 1: Matching the Same Word in Multiple Fields
Problem 2: Trimming the Long Tail
Problem 3: Term Frequencies
Solution
Custom _all Fields
cross-fields Queries
Per-Field Boosting
Exact-Value Fields
217
218
219
220
221
222
223
224
225
226
227
227
228
231
231
232
232
233
233
234
235
235
236
238
239
15. Proximity Matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Phrase Matching
Term Positions
What Is a Phrase
Mixing It Up
Multivalue Fields
Closer Is Better
Proximity for Relevance
Improving Performance
Rescoring Results
Finding Associated Words
242
242
243
244
245
246
247
249
249
250
Table of Contents
|
ix
Producing Shingles
Multifields
Searching for Shingles
Performance
251
252
253
255
16. Partial Matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Postcodes and Structured Data
prefix Query
wildcard and regexp Queries
Query-Time Search-as-You-Type
Index-Time Optimizations
Ngrams for Partial Matching
Index-Time Search-as-You-Type
Preparing the Index
Querying the Field
Edge n-grams and Postcodes
Ngrams for Compound Words
258
259
260
262
264
264
265
265
267
270
271
17. Controlling Relevance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Theory Behind Relevance Scoring
Boolean Model
Term Frequency/Inverse Document Frequency (TF/IDF)
Vector Space Model
Lucene’s Practical Scoring Function
Query Normalization Factor
Query Coordination
Index-Time Field-Level Boosting
Query-Time Boosting
Boosting an Index
t.getBoost()
Manipulating Relevance with Query Structure
Not Quite Not
boosting Query
Ignoring TF/IDF
constant_score Query
function_score Query
Boosting by Popularity
modifier
factor
boost_mode
max_boost
Boosting Filtered Subsets
x
|
Table of Contents
275
276
276
279
282
283
284
286
286
287
288
288
289
290
291
291
293
294
296
298
299
301
301
filter Versus query
functions
score_mode
Random Scoring
The Closer, The Better
Understanding the price Clause
Scoring with Scripts
Pluggable Similarity Algorithms
Okapi BM25
Changing Similarities
Configuring BM25
Relevance Tuning Is the Last 10%
Part III.
302
303
303
303
305
308
308
310
310
313
314
315
Dealing with Human Language
18. Getting Started with Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Using Language Analyzers
Configuring Language Analyzers
Pitfalls of Mixing Languages
At Index Time
At Query Time
Identifying Language
One Language per Document
Foreign Words
One Language per Field
Mixed-Language Fields
Split into Separate Fields
Analyze Multiple Times
Use n-grams
320
321
323
323
324
324
325
326
327
329
329
329
330
19. Identifying Words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
standard Analyzer
standard Tokenizer
Installing the ICU Plug-in
icu_tokenizer
Tidying Up Input Text
Tokenizing HTML
Tidying Up Punctuation
333
334
335
335
337
337
338
20. Normalizing Tokens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
In That Case
341
Table of Contents
|
xi
You Have an Accent
Retaining Meaning
Living in a Unicode World
Unicode Case Folding
Unicode Character Folding
Sorting and Collations
Case-Insensitive Sorting
Differences Between Languages
Unicode Collation Algorithm
Unicode Sorting
Specifying a Language
Customizing Collations
342
343
346
347
349
350
351
353
353
354
355
358
21. Reducing Words to Their Root Form. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Algorithmic Stemmers
Using an Algorithmic Stemmer
Dictionary Stemmers
Hunspell Stemmer
Installing a Dictionary
Per-Language Settings
Creating a Hunspell Token Filter
Hunspell Dictionary Format
Choosing a Stemmer
Stemmer Performance
Stemmer Quality
Stemmer Degree
Making a Choice
Controlling Stemming
Preventing Stemming
Customizing Stemming
Stemming in situ
Is Stemming in situ a Good Idea
360
361
363
364
365
365
366
367
369
370
370
370
371
371
371
372
373
375
22. Stopwords: Performance Versus Precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
Pros and Cons of Stopwords
Using Stopwords
Stopwords and the Standard Analyzer
Maintaining Positions
Specifying Stopwords
Using the stop Token Filter
Updating Stopwords
Stopwords and Performance
xii
|
Table of Contents
378
379
379
380
380
381
383
383
and Operator
minimum_should_match
Divide and Conquer
Controlling Precision
Only High-Frequency Terms
More Control with Common Terms
Stopwords and Phrase Queries
Positions Data
Index Options
Stopwords
common_grams Token Filter
At Index Time
Unigram Queries
Bigram Phrase Queries
Two-Word Phrases
Stopwords and Relevance
383
384
385
386
387
388
388
389
389
390
391
392
393
393
394
394
23. Synonyms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
Using Synonyms
Formatting Synonyms
Expand or contract
Simple Expansion
Simple Contraction
Genre Expansion
Synonyms and The Analysis Chain
Case-Sensitive Synonyms
Multiword Synonyms and Phrase Queries
Use Simple Contraction for Phrase Queries
Synonyms and the query_string Query
Symbol Synonyms
396
397
398
398
399
400
401
401
402
404
405
405
24. Typoes and Mispelings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
Fuzziness
Fuzzy Query
Improving Performance
Fuzzy match Query
Scoring Fuzziness
Phonetic Matching
Part IV.
409
410
411
412
413
413
Aggregations
Table of Contents
|
xiii
25. High-Level Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
Buckets
Metrics
Combining the Two
420
420
420
26. Aggregation Test-Drive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Adding a Metric to the Mix
Buckets Inside Buckets
One Final Modification
426
427
429
27. Building Bar Charts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
28. Looking at Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
Returning Empty Buckets
Extended Example
The Sky’s the Limit
439
441
443
29. Scoping Aggregations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
30. Filtering Queries and Aggregations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
Filtered Query
Filter Bucket
Post Filter
Recap
449
450
451
452
31. Sorting Multivalue Buckets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
Intrinsic Sorts
Sorting by a Metric
Sorting Based on “Deep” Metrics
453
454
455
32. Approximate Aggregations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
Finding Distinct Counts
Understanding the Trade-offs
Optimizing for Speed
Calculating Percentiles
Percentile Metric
Percentile Ranks
Understanding the Trade-offs
458
460
461
462
464
467
469
33. Significant Terms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
significant_terms Demo
Recommending Based on Popularity
xiv
|
Table of Contents
472
474
Recommending Based on Statistics
478
34. Controlling Memory Use and Latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
Fielddata
Aggregations and Analysis
High-Cardinality Memory Implications
Limiting Memory Usage
Fielddata Size
Monitoring fielddata
Circuit Breaker
Fielddata Filtering
Doc Values
Enabling Doc Values
Preloading Fielddata
Eagerly Loading Fielddata
Global Ordinals
Index Warmers
Preventing Combinatorial Explosions
Depth-First Versus Breadth-First
481
483
486
487
488
489
490
491
493
494
494
495
496
498
500
502
35. Closing Thoughts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
Part V.
Geolocation
36. Geo-Points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
Lat/Lon Formats
Filtering by Geo-Point
geo_bounding_box Filter
Optimizing Bounding Boxes
geo_distance Filter
Faster Geo-Distance Calculations
geo_distance_range Filter
Caching geo-filters
Reducing Memory Usage
Sorting by Distance
Scoring by Distance
511
512
513
514
515
516
517
517
519
520
522
37. Geohashes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
Mapping Geohashes
geohash_cell Filter
524
525
Table of Contents
|
xv
38. Geo-aggregations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
geo_distance Aggregation
geohash_grid Aggregation
geo_bounds Aggregation
527
530
532
39. Geo-shapes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
Mapping geo-shapes
precision
distance_error_pct
Indexing geo-shapes
Querying geo-shapes
Querying with Indexed Shapes
Geo-shape Filters and Caching
Part VI.
536
536
537
537
538
540
541
Modeling Your Data
40. Handling Relationships. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
Application-side Joins
Denormalizing Your Data
Field Collapsing
Denormalization and Concurrency
Renaming Files and Directories
Solving Concurrency Issues
Global Locking
Document Locking
Tree Locking
546
548
549
552
555
555
556
557
558
41. Nested Objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
Nested Object Mapping
Querying a Nested Object
Sorting by Nested Fields
Nested Aggregations
reverse_nested Aggregation
When to Use Nested Objects
563
564
565
567
568
570
42. Parent-Child Relationship. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
Parent-Child Mapping
Indexing Parents and Children
Finding Parents by Their Children
min_children and max_children
Finding Children by Their Parents
xvi
|
Table of Contents
572
572
573
575
575
Children Aggregation
Grandparents and Grandchildren
Practical Considerations
Memory Use
Global Ordinals and Latency
Multigenerations and Concluding Thoughts
576
577
579
579
580
580
43. Designing for Scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
The Unit of Scale
Shard Overallocation
Kagillion Shards
Capacity Planning
Replica Shards
Balancing Load with Replicas
Multiple Indices
Time-Based Data
Index per Time Frame
Index Templates
Retiring Data
Migrate Old Indices
Optimize Indices
Closing Old Indices
Archiving Old Indices
User-Based Data
Shared Index
Faking Index per User with Aliases
One Big User
Scale Is Not Infinite
583
585
586
587
588
589
590
592
592
593
594
595
595
596
596
597
597
600
601
602
Part VII. Administration, Monitoring, and Deployment
44. Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
Marvel for Monitoring
Cluster Health
Drilling Deeper: Finding Problematic Indices
Blocking for Status Changes
Monitoring Individual Nodes
indices Section
OS and Process Sections
JVM Section
Threadpool Section
607
608
609
611
612
613
616
617
620
Table of Contents
|
xvii
FS and Network Sections
Circuit Breaker
Cluster Stats
Index Stats
Pending Tasks
cat API
622
622
623
623
624
626
45. Production Deployment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
Hardware
Memory
CPUs
Disks
Network
General Considerations
Java Virtual Machine
Transport Client Versus Node Client
Configuration Management
Important Configuration Changes
Assign Names
Paths
Minimum Master Nodes
Recovery Settings
Prefer Unicast over Multicast
Don’t Touch These Settings!
Garbage Collector
Threadpools
Heap: Sizing and Swapping
Give Half Your Memory to Lucene
Don’t Cross 32 GB!
Swapping Is the Death of Performance
File Descriptors and MMap
Revisit This List Before Production
631
631
632
632
633
633
634
634
635
635
636
636
637
638
639
640
640
641
641
642
642
644
645
646
46. Post-Deployment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
Changing Settings Dynamically
Logging
Slowlog
Indexing Performance Tips
Test Performance Scientifically
Using and Sizing Bulk Requests
Storage
Segments and Merging
xviii
|
Table of Contents
647
648
648
649
650
650
651
651
Other
Rolling Restarts
Backing Up Your Cluster
Creating the Repository
Snapshotting All Open Indices
Snapshotting Particular Indices
Listing Information About Snapshots
Deleting Snapshots
Monitoring Snapshot Progress
Canceling a Snapshot
Restoring from a Snapshot
Monitoring Restore Operations
Canceling a Restore
Clusters Are Living, Breathing Creatures
653
654
655
655
656
657
657
658
658
661
661
662
663
664
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665
Table of Contents
|
xix
Foreword
One of the most nerve-wracking periods when releasing the first version of an open
source project occurs when the IRC channel is created. You are all alone, eagerly hop‐
ing and wishing for the first user to come along. I still vividly remember those days.
One of the first users that jumped on IRC was Clint, and how excited was I. Well…
for a brief period, until I found out that Clint was actually a Perl user, no less working
on a website that dealt with obituaries. I remember asking myself why couldn’t we get
someone from a more “hyped” community, like Ruby or Python (at the time), and a
slightly nicer use case.
How wrong I was. Clint ended up being instrumental to the success of Elasticsearch.
He was the first user to roll out Elasticsearch into production (version 0.4 no less!),
and the interaction with Clint was pivotal during the early days in shaping Elastic‐
search into what it is today. Clint has a unique insight into what is simple, and he is
very rarely wrong, which has a huge impact on various usability aspects of Elastic‐
search, from management, to API design, to day-to-day usability features. It was a no
brainer for us to reach out to Clint and ask if he would join our company immedi‐
ately after we formed it.
Another one of the first things we did when we formed the company was offer public
training. It’s hard to express how nervous we were about whether or not people
would even sign up for it.
We were wrong.
The trainings were and still are a rave success with waiting lists in all major cities.
One of the people who caught our eye was a young fellow, Zach, who came to one of
our trainings. We knew about Zach from his blog posts about using Elasticsearch
(and secretly envied his ability to explain complex concepts in a very simple manner)
and from a PHP client he wrote for the software. What we found out was that Zach
had actually paid to attend the Elasticsearch training out of his own pocket! You can’t
xxi
really ask for more than that, and we reached out to Zach and asked if he would join
our company as well.
Both Clint and Zach are pivotal to the success of Elasticsearch. They are wonderful
communicators who can explain Elasticsearch from its high-level simplicity, to its
(and Apache Lucene’s) low-level internal complexities. It’s a unique skill that we
dearly cherish here at Elasticsearch. Clint is also responsible for the Elasticsearch Perl
client, while Zach is responsible for the PHP one - both wonderful pieces of code.
And last, both play an instrumental role in most of what happens daily with the Elas‐
ticsearch project itself. One of the main reasons why Elasticsearch is so popular is its
ability to communicate empathy to its users, and Clint and Zach are both part of the
group that makes this a reality.
xxii
|
Foreword
Preface
The world is swimming in data. For years we have been simply overwhelmed by the
quantity of data flowing through and produced by our systems. Existing technology
has focused on how to store and structure warehouses full of data. That’s all well and
good—until you actually need to make decisions in real time informed by that data.
Elasticsearch is a distributed, scalable, real-time search and analytics engine. It ena‐
bles you to search, analyze, and explore your data, often in ways that you did not
anticipate at the start of a project. It exists because raw data sitting on a hard drive is
just not useful.
Whether you need full-text search, real-time analytics of structured data, or a combi‐
nation of the two, this book introduces you to the fundamental concepts required to
start working with Elasticsearch at a basic level. With these foundations laid, it will
move on to more-advanced search techniques, which you will need to shape the
search experience to fit your requirements.
Elasticsearch is not just about full-text search. We explain structured search, analyt‐
ics, the complexities of dealing with human language, geolocation, and relationships.
We will also discuss how best to model your data to take advantage of the horizontal
scalability of Elasticsearch, and how to configure and monitor your cluster when
moving to production.
Who Should Read This Book
This book is for anybody who wants to put their data to work. It doesn’t matter
whether you are starting a new project and have the flexibility to design the system
from the ground up, or whether you need to give new life to a legacy system. Elastic‐
search will help you to solve existing problems and open the way to new features that
you haven’t yet considered.
This book is suitable for novices and experienced users alike. We expect you to have
some programming background and, although not required, it would help to have
xxiii