Elasticsearch Server
Third Edition
Leverage Elasticsearch to create a robust,
fast, and flexible search solution with ease
Rafał Kuć
Marek Rogoziński
BIRMINGHAM - MUMBAI
Elasticsearch Server
Third Edition
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the authors, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2013
Second edition: February 2015
Third edition: February 2016
Production reference: 1230216
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-881-6
www.packtpub.com
Credits
Authors
Rafał Kuć
Project Coordinator
Nidhi Joshi
Marek Rogoziński
Proofreader
Reviewer
Safis Editing
Paige Cook
Indexer
Commissioning Editor
Rekha Nair
Nadeem Bagban
Graphics
Acquisition Editor
Jason Monteiro
Divya Poojari
Production Coordinator
Content Development Editor
Manu Joseph
Kirti Patil
Cover Work
Technical Editor
Utkarsha S. Kadam
Copy Editor
Alpha Singh
Manu Joseph
About the Authors
Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as
a consultant and software engineer at Sematext Group Inc. where he concentrates on
open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more
than 14 years of experience in various software domains—from banking software
to e–commerce products. He is mainly focused on Java; however, he is open to every
tool and programming language that might help him to achieve his goals easily and
quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his
knowledge and help people solve their Solr and Lucene problems. He is also a speaker
at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords,
ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days.
Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight.
When he came back to Lucene in late 2003, he revised his thoughts about the
framework and saw the potential in search technologies. Then Solr came and that
was it. He started working with Elasticsearch in the middle of 2010. At present,
Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest.
Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its
second edition, and the first and second editions of Mastering ElasticSearch, all
published by Packt Publishing.
Marek Rogoziński is a software architect and consultant with more than 10 years
of experience. His specialization concerns solutions based on open source search
engines, such as Solr and Elasticsearch, and the software stack for big data analytics
including Hadoop, Hbase, and Twitter Storm.
He is also a cofounder of the solr.pl site, which publishes information and tutorials
about Solr and Lucene libraries. He is the coauthor of ElasticSearch Server and its
second edition, and the first and second editions of Mastering ElasticSearch, all
published by Packt Publishing.
He is currently the chief technology officer and lead architect at ZenCard, a company
that processes and analyzes large quantities of payment transactions in real time,
allowing automatic and anonymous identification of retail customers on all retailer
channels (m-commerce/e-commerce/brick&mortar) and giving retailers a customer
retention and loyalty tool.
About the Reviewer
Paige Cook works as a software architect for Videa, part of the Cox Family of
Companies, and lives near Atlanta, Georgia. He has twenty years of experience
in software development, primarily with the Microsoft .NET Framework. His
career has been largely focused on building enterprise solutions for the media and
entertainment industry. He is especially interested in search technologies using
the Apache Lucene search engine and has experience with both Elasticsearch and
Apache Solr. Apart from his work, he enjoys DIY home projects and spending time
with his wife and two daughters.
www.PacktPub.com
eBooks, discount offers, and more
Did you know that Packt offers eBook versions of every book published, with PDF
and ePub files available? You can upgrade to the eBook version at www.PacktPub.com
and as a print book customer, you are entitled to a discount on the eBook copy. Get in
touch with us at for more details.
At www.PacktPub.com, you can also read a collection of free technical articles,
sign up for a range of free newsletters and receive exclusive discounts and offers
on Packt books and eBooks.
TM
/>
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital
book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
• Fully searchable across every book published by Packt
• Copy and paste, print, and bookmark content
• On demand and accessible via a web browser
Table of Contents
Preface
Chapter 1: Getting Started with Elasticsearch Cluster
Full text searching
The Lucene glossary and architecture
Input data analysis
Indexing and querying
Scoring and query relevance
The basics of Elasticsearch
Key concepts of Elasticsearch
xv
1
2
2
4
5
6
6
7
Index
7
Document
7
Document type
8
Mapping8
Key concepts of the Elasticsearch infrastructure
Nodes and clusters
Shards
Replicas
Gateway
Indexing and searching
Installing and configuring your cluster
Installing Java
Installing Elasticsearch
Running Elasticsearch
Shutting down Elasticsearch
The directory layout
Configuring Elasticsearch
The system-specific installation and configuration
Installing Elasticsearch on Linux
Configuring Elasticsearch as a system service on Linux
Elasticsearch as a system service on Windows
[i]
9
9
9
9
10
10
12
12
12
13
15
15
16
18
18
20
20
Table of Contents
Manipulating data with the REST API
Understanding the REST API
Storing data in Elasticsearch
Creating a new document
21
21
22
22
Retrieving documents
Updating documents
25
26
Deleting documents
Versioning
31
32
Dealing with non-existing documents
Adding partial documents
28
29
Usage example
Versioning from external systems
Searching with the URI request query
Sample data
URI search
Elasticsearch query response
Query analysis
URI query string parameters
32
33
34
34
35
38
40
41
The query
42
The default search field
42
Analyzer42
The default operator property
42
Query explanation
42
The fields returned
45
Sorting the results
45
The search timeout
45
The results window
46
Limiting per-shard results
46
Ignoring unavailable indices
46
The search type
46
Lowercasing term expansion
46
Wildcard and prefix analysis
47
Lucene query syntax
Summary
47
48
Chapter 2: Indexing Your Data
49
Elasticsearch indexing
Shards and replicas
50
50
Write consistency
52
Creating indices
52
Altering automatic index creation
Settings for a newly created index
Index deletion
53
55
55
Mappings configuration
Type determining mechanism
Disabling the type determining mechanism
[ ii ]
56
56
57
Table of Contents
Tuning the type determining mechanism for numeric types
Tuning the type determining mechanism for dates
58
59
Index structure mapping
61
Using analyzers
71
Different similarity models
76
Type and types definition
Fields
Core types
Multi fields
The IP address type
Token count type
62
63
63
70
70
71
Out-of-the-box analyzers
Defining your own analyzers
Default analyzers
72
73
76
Setting per-field similarity
Available similarity models
Batch indexing to speed up your indexing process
Preparing data for bulk indexing
Indexing the data
The _all field
The _source field
Additional internal fields
Introduction to segment merging
Segment merging
The need for segment merging
The merge policy
The merge scheduler
Throttling
Introduction to routing
Default indexing
Default searching
Routing
The routing parameters
Routing fields
Summary
Chapter 3: Searching Your Data
Querying Elasticsearch
The example data
A simple query
Paging and result size
Returning the version value
Limiting the score
Choosing the fields that we want to return
[ iii ]
77
78
80
81
82
84
85
86
86
87
87
87
88
89
89
90
90
92
93
94
95
97
97
98
100
101
103
104
105
Table of Contents
Source filtering
Using the script fields
106
108
Passing parameters to the script fields
Understanding the querying process
Query logic
Search type
Search execution preference
Search shards API
Basic queries
The term query
The terms query
The match all query
The type query
The exists query
The missing query
The common terms query
The match query
The Boolean match query
The phrase match query
The match phrase prefix query
The multi match query
The query string query
Running the query string query against multiple fields
The simple query string query
The identifiers query
The prefix query
The fuzzy query
The wildcard query
The range query
Regular expression query
The more like this query
Compound queries
The bool query
The dis_max query
The boosting query
The constant_score query
The indices query
Using span queries
A span
Span term query
Span first query
Span near query
[ iv ]
110
111
111
112
113
114
116
116
117
118
118
119
119
119
121
121
123
124
124
125
128
128
129
129
130
132
133
134
134
137
137
139
140
141
141
142
143
143
144
145
Table of Contents
Span or query
Span not query
Span within query
Span containing query
Span multi query
Performance considerations
Choosing the right query
The use cases
147
148
149
150
151
151
151
152
Limiting results to given tags
152
Searching for values in a range
Boosting some of the matched documents
Ignoring lower scoring partial queries
Using Lucene query syntax in queries
Handling user queries without errors
Autocomplete using prefixes
Finding terms similar to a given one
Matching phrases
Spans, spans everywhere
Summary
Chapter 4: Extending Your Querying Knowledge
Filtering your results
The context is the key
Explicit filtering with bool query
Highlighting
Getting started with highlighting
Field configuration
Under the hood
Forcing highlighter type
152
153
154
157
157
159
160
160
160
162
163
164
165
165
169
170
172
172
172
Configuring HTML tags
Controlling highlighted fragments
Global and local settings
Require matching
Custom highlighting query
The Postings highlighter
Validating your queries
Using the Validate API
Sorting data
Default sorting
Selecting fields used for sorting
173
175
175
176
179
180
183
183
186
186
187
Sorting mode
Specifying behavior for missing fields
Dynamic criteria
[v]
189
191
191
Table of Contents
Calculate scoring when sorting
Query rewrite
Prefix query as an example
Getting back to Apache Lucene
Query rewrite properties
Summary
Chapter 5: Extending Your Index Structure
Indexing tree-like structures
Data structure
Analysis
Indexing data that is not flat
Data
Objects
Arrays
Mappings
Final mappings
Sending the mappings to Elasticsearch
To be or not to be dynamic
Disabling object indexing
Using nested objects
Scoring and nested queries
Using the parent-child relationship
Index structure and data indexing
Child mappings
Parent mappings
The parent document
Child documents
Querying
192
193
193
195
197
199
201
201
202
203
204
204
205
205
206
207
207
208
209
209
213
213
214
214
214
215
215
216
Querying data in the child documents
Querying data in the parent documents
216
219
Performance considerations
Modifying your index structure with the update API
The mappings
221
221
222
Summary
225
Adding a new field to the existing index
Modifying fields of an existing index
Chapter 6: Make Your Search Better
Introduction to Apache Lucene scoring
When a document is matched
Default scoring formula
Relevancy matters
[ vi ]
222
223
227
227
228
228
229
Table of Contents
Scripting capabilities of Elasticsearch
Objects available during script execution
Script types
230
230
232
Querying with scripts
Scripting with parameters
Script languages
Using other than embedded languages
Using native code
235
236
237
237
238
Searching content in different languages
Handling languages differently
Handling multiple languages
Detecting the language of the document
Sample document
The mappings
Querying
243
243
243
244
244
245
247
Combining queries
Influencing scores with query boosts
The boost
Adding the boost to queries
Modifying the score
249
250
250
250
254
When does index-time boosting make sense?
Defining boosting in the mappings
Words with the same meaning
Synonym filter
262
263
263
263
In file scripts
Inline scripts
Indexed scripts
The factory implementation
Implementing the native script
The plugin definition
Installing the plugin
Running the script
Queries with an identified language
Queries with an unknown language
Constant score query
Boosting query
The function score query
Synonyms in the mappings
Synonyms stored on the file system
232
233
234
238
239
240
242
242
247
248
254
255
255
264
265
Defining synonym rules
265
Query or index-time synonym expansion
267
Using Apache Solr synonyms
Using WordNet synonyms
[ vii ]
265
267
Table of Contents
Understanding the explain information
Understanding field analysis
Explaining the query
Summary
Chapter 7: Aggregations for Data Analysis
Aggregations
General query structure
Inside the aggregations engine
Aggregation types
Metrics aggregations
Minimum, maximum, average, and sum
Field value statistics and extended statistics
Value count
Field cardinality
Percentiles
Percentile ranks
Top hits aggregation
Geo bounds aggregation
Scripted metrics aggregation
267
267
269
272
273
273
274
277
278
278
278
281
283
283
284
286
287
292
292
Buckets aggregations
294
Date histogram aggregation
312
Geo distance aggregations
Geohash grid aggregation
Global aggregation
Significant terms aggregation
313
315
315
316
Sampler aggregation
Children aggregation
Nested aggregation
Reverse nested aggregation
Nesting aggregations and ordering buckets
321
322
323
324
326
Filter aggregation
Filters aggregation
Terms aggregation
Range aggregation
Date range aggregation
IPv4 range aggregation
Missing aggregation
Histogram aggregation
294
296
298
301
305
308
309
310
Time zones
312
Choosing significant terms
Multiple value analysis
Buckets ordering
Pipeline aggregations
319
319
329
330
Available types
Referencing other aggregations
330
330
[ viii ]
Table of Contents
Gaps in the data
Pipeline aggregation types
330
331
Summary
Chapter 8: Beyond Full-text Searching
Percolator
The index
Percolator preparation
Getting deeper
Controlling the size of returned results
Percolator and score calculation
Combining percolators with other functionalities
Getting the number of matching queries
Indexed document percolation
Elasticsearch spatial capabilities
Mapping preparation for spatial searches
Example data
Additional geo_field properties
344
345
345
346
347
350
352
352
353
354
355
355
356
356
357
Sample queries
358
Arbitrary geo shapes
363
Distance-based sorting
Bounding box filtering
Limiting the distance
358
360
362
Point
Envelope
Polygon
Multipolygon
An example usage
Storing shapes in the index
364
364
365
365
366
367
Using suggesters
Available suggester types
Including suggestions
369
369
370
Suggester response
371
Term suggester
372
Phrase suggester
Completion suggester
374
376
The Scroll API
Problem definition
Scrolling to the rescue
Summary
388
388
388
390
Term suggester configuration options
Additional term suggester options
373
373
Custom weights
Context suggester
380
381
[ ix ]
Table of Contents
Chapter 9: Elasticsearch Cluster in Detail
Understanding node discovery
Discovery types
Node roles
Master node
Data node
Client node
Configuring node roles
391
392
392
392
393
393
393
394
Setting the cluster's name
Zen discovery
394
395
Adjusting HTTP transport settings
398
The gateway and recovery modules
The gateway
Recovery control
399
399
400
Templates and dynamic templates
Templates
405
405
Master election configuration
Configuring unicast
Fault detection ping settings
Cluster state updates control
Dealing with master unavailability
395
396
397
397
398
Disabling HTTP
HTTP port
HTTP host
Additional gateway recovery options
Indices recovery API
Delayed allocation
Index recovery prioritization
An example of a template
398
398
399
401
401
403
404
405
Dynamic templates
406
Elasticsearch plugins
The basics
Installing plugins
Removing plugins
Elasticsearch caches
Fielddata cache
408
409
409
411
412
412
The matching pattern
Field definitions
408
408
Fielddata size
Circuit breakers
412
413
Fielddata and doc values
Shard request cache
Enabling and configuring the shard request cache
Per request shard request cache disabling
Shard request cache usage monitoring
[x]
413
414
414
415
415
Table of Contents
Node query cache
Indexing buffers
When caches should be avoided
The update settings API
The cluster settings API
The indices settings API
Summary
Chapter 10: Administrating Your Cluster
Elasticsearch time machine
Creating a snapshot repository
Creating snapshots
Additional parameters
Restoring a snapshot
Cleaning up – deleting old snapshots
Monitoring your cluster's state and health
Cluster health API
Controlling information details
Additional parameters
416
416
417
417
418
418
419
421
421
422
424
425
425
427
427
428
429
429
Indices stats API
430
Nodes info API
433
Docs
Store
Indexing, get, and search
Additional information
431
431
431
432
Returned information
434
Nodes stats API
Cluster state API
Cluster stats API
Pending tasks API
Indices recovery API
Indices shard stores API
Indices segments API
Controlling the shard and replica allocation
Explicitly controlling allocation
434
435
436
436
437
439
439
440
440
The number of shards and replicas per node
Allocation throttling
445
445
Specifying node parameters
Configuration
Index creation
Excluding nodes from allocation
Requiring node attributes
Using the IP address for shard allocation
Disk-based shard allocation
[ xi ]
441
441
441
442
443
443
443
Table of Contents
Cluster-wide allocation
446
Manually moving shards and replicas
451
Allocation awareness
Forcing allocation awareness
Filtering
Moving shards
Canceling shard allocation
Forcing shard allocation
Multiple commands per HTTP request
Allowing operations on primary shards
447
449
449
451
452
452
453
453
Handling rolling restarts
Controlling cluster rebalancing
Understanding rebalance
Cluster being ready
The cluster rebalance settings
453
454
454
455
455
The Cat API
The basics
Using Cat API
456
456
458
The examples
459
Controlling when rebalancing will be allowed
Controlling the number of shards being moved between nodes concurrently
Controlling which shards may be rebalanced
Common arguments
455
455
456
458
Getting information about the master node
Getting information about the nodes
Retrieving recovery information for an index
Warming up
Defining a new warming query
Retrieving the defined warming queries
Deleting a warming query
Disabling the warming up functionality
Choosing queries for warming
Index aliasing and using it to simplify your everyday work
An alias
Creating an alias
Modifying aliases
Combining commands
Retrieving aliases
Removing aliases
Filtering aliases
Aliases and routing
Zero downtime reindexing and aliases
Summary
[ xii ]
459
460
460
460
461
462
463
464
464
465
466
466
467
467
468
469
469
470
470
471
Table of Contents
Chapter 11: Scaling by Example
Hardware
Physical servers or a cloud
CPU
RAM memory
Mass storage
The network
How many servers
Cost cutting
Preparing a single Elasticsearch node
The general preparations
Avoiding swapping
File descriptors
Virtual memory
The memory
Field data cache and breaking the circuit
Use doc values
RAM buffer for indexing
Index refresh rate
Thread pools
Horizontal expansion
Automatically creating the replicas
Redundancy and high availability
Cost and performance flexibility
Continuous upgrades
Multiple Elasticsearch instances on a single physical machine
Preventing a shard and its replicas from being on the same node
Designated node roles for larger clusters
Query aggregator nodes
Data nodes
Master eligible nodes
Preparing the cluster for high indexing and querying throughput
Indexing related advice
Index refresh rate
Thread pools tuning
Automatic store throttling
Handling time-based data
Multiple data paths
Data distribution
Bulk indexing
RAM buffer for indexing
Advice for high query rate scenarios
Shard request cache
Think about the queries
[ xiii ]
473
473
474
475
475
476
476
476
477
477
478
478
479
479
480
480
481
481
481
482
483
486
486
488
488
489
489
490
491
492
492
492
492
493
493
494
494
495
495
497
498
498
499
499
Table of Contents
Parallelize your queries
Field data cache and breaking the circuit
Keep size and shard size under control
Monitoring
Elasticsearch HQ
Marvel
SPM for Elasticsearch
Summary
Index
501
501
502
502
502
504
505
506
507
[ xiv ]
Preface
Welcome to Elasticsearch Server, Third Edition. This is the third instalment of the
book dedicated to yet another major release of Elasticsearch—this time version 2.2.
In the third edition, we have decided to go on a similar route that we took when we
wrote the second edition of the book. We not only updated the content to match the
new version of Elasticsearch, but also restructured the book by removing and adding
new sections and chapters. We read the suggestions we got from you—the readers
of the book, and we carefully tried to incorporate the suggestions and comments
received since the release of the first and second editions.
While reading this book, you will be taken on a journey to the wonderful world of
full-text search provided by the Elasticsearch server. We will start with a general
introduction to Elasticsearch, which covers how to start and run Elasticsearch, its
basic concepts, and how to index and search your data in the most basic way. This
book will also discuss the query language, so called Query DSL, that allows you
to create complicated queries and filter returned results. In addition to all of this,
you'll see how you can use the aggregation framework to calculate aggregated data
based on the results returned by your queries. We will implement the autocomplete
functionality together and learn how to use Elasticsearch spatial capabilities and
prospective search.
Finally, this book will show you Elasticsearch's administration API capabilities
with features such as shard placement control, cluster handling, and more, ending
with a dedicated chapter that will discuss Elasticsearch's preparation for small and
large deployments— both ones that concentrate on indexing and also ones that
concentrate on indexing.
[ xv ]
Preface
What this book covers
Chapter 1, Getting Started with Elasticsearch Cluster, covers what full-text searching is,
what Apache Lucene is, what text analysis is, how to run and configure Elasticsearch,
and finally, how to index and search your data in the most basic way.
Chapter 2, Indexing Your Data, shows how indexing works, how to prepare index
structure, what data types we are allowed to use, how to speed up indexing, what
segments are, how merging works, and what routing is.
Chapter 3, Searching Your Data, introduces the full-text search capabilities of
Elasticsearch by discussing how to query it, how the querying process works,
and what types of basic and compound queries are available. In addition to this,
we will show how to use position-aware queries in Elasticsearch.
Chapter 4, Extending Your Query Knowledge, shows how to efficiently narrow down
your search results by using filters, how highlighting works, how to sort your results,
and how query rewrite works.
Chapter 5, Extending Your Index Structure, shows how to index more complex data
structures. We learn how to index tree-like data types, how to index data with
relationships between documents, and how to modify index structure.
Chapter 6, Make Your Search Better, covers Apache Lucene scoring and how to
influence it in Elasticsearch, the scripting capabilities of Elasticsearch, and its
language analysis capabilities.
Chapter 7, Aggregations for Data Analysis, introduces you to the great world of data
analysis by showing you how to use the Elasticsearch aggregation framework.
We will discuss all types of aggregations—metrics, buckets, and the new pipeline
aggregations that have been introduced in Elasticsearch.
Chapter 8, Beyond Full-text Searching, discusses non full-text search-related
functionalities such as percolator—reversed search, and the geo-spatial capabilities
of Elasticsearch. This chapter also discusses suggesters, which allow us to build
a spellchecking functionality and an efficient autocomplete mechanism, and we
will show how to handle deep-paging efficiently.
Chapter 9, Elasticsearch Cluster in Detail, discusses nodes discovery mechanism,
recovery and gateway Elasticsearch modules, templates, caches, and settings
update API.
Chapter 10, Administrating Your Cluster, covers the Elasticsearch backup functionality,
rebalancing, and shards moving. In addition to this, you will learn how to use the
warm up functionality, use the Cat API, and work with aliases.
[ xvi ]