Tải bản đầy đủ (.pdf) (556 trang)

Elasticsearch Server 3rd

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.43 MB, 556 trang )


Elasticsearch Server
Third Edition

Leverage Elasticsearch to create a robust,
fast, and flexible search solution with ease

Rafał Kuć
Marek Rogoziński

BIRMINGHAM - MUMBAI


Elasticsearch Server
Third Edition

Copyright © 2016 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the authors, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.


First published: October 2013
Second edition: February 2015
Third edition: February 2016

Production reference: 1230216

Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78588-881-6
www.packtpub.com


Credits
Authors
Rafał Kuć

Project Coordinator
Nidhi Joshi

Marek Rogoziński
Proofreader
Reviewer

Safis Editing

Paige Cook
Indexer
Commissioning Editor


Rekha Nair

Nadeem Bagban
Graphics
Acquisition Editor

Jason Monteiro

Divya Poojari
Production Coordinator
Content Development Editor

Manu Joseph

Kirti Patil
Cover Work
Technical Editor
Utkarsha S. Kadam
Copy Editor
Alpha Singh

Manu Joseph


About the Authors
Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as

a consultant and software engineer at Sematext Group Inc. where he concentrates on
open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more

than 14 years of experience in various software domains—from banking software
to e–commerce products. He is mainly focused on Java; however, he is open to every
tool and programming language that might help him to achieve his goals easily and
quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his
knowledge and help people solve their Solr and Lucene problems. He is also a speaker
at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords,
ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days.
Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight.
When he came back to Lucene in late 2003, he revised his thoughts about the
framework and saw the potential in search technologies. Then Solr came and that
was it. He started working with Elasticsearch in the middle of 2010. At present,
Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest.
Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its
second edition, and the first and second editions of Mastering ElasticSearch, all
published by Packt Publishing.


Marek Rogoziński is a software architect and consultant with more than 10 years
of experience. His specialization concerns solutions based on open source search
engines, such as Solr and Elasticsearch, and the software stack for big data analytics
including Hadoop, Hbase, and Twitter Storm.
He is also a cofounder of the solr.pl site, which publishes information and tutorials
about Solr and Lucene libraries. He is the coauthor of ElasticSearch Server and its
second edition, and the first and second editions of Mastering ElasticSearch, all
published by Packt Publishing.
He is currently the chief technology officer and lead architect at ZenCard, a company
that processes and analyzes large quantities of payment transactions in real time,
allowing automatic and anonymous identification of retail customers on all retailer
channels (m-commerce/e-commerce/brick&mortar) and giving retailers a customer
retention and loyalty tool.



About the Reviewer
Paige Cook works as a software architect for Videa, part of the Cox Family of

Companies, and lives near Atlanta, Georgia. He has twenty years of experience
in software development, primarily with the Microsoft .NET Framework. His
career has been largely focused on building enterprise solutions for the media and
entertainment industry. He is especially interested in search technologies using
the Apache Lucene search engine and has experience with both Elasticsearch and
Apache Solr. Apart from his work, he enjoys DIY home projects and spending time
with his wife and two daughters.


www.PacktPub.com
eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF
and ePub files available? You can upgrade to the eBook version at www.PacktPub.com
and as a print book customer, you are entitled to a discount on the eBook copy. Get in
touch with us at for more details.
At www.PacktPub.com, you can also read a collection of free technical articles,
sign up for a range of free newsletters and receive exclusive discounts and offers
on Packt books and eBooks.
TM

/>
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital
book library. Here, you can search, access, and read Packt's entire library of books.


Why subscribe?

• Fully searchable across every book published by Packt
• Copy and paste, print, and bookmark content
• On demand and accessible via a web browser



Table of Contents
Preface
Chapter 1: Getting Started with Elasticsearch Cluster
Full text searching
The Lucene glossary and architecture
Input data analysis
Indexing and querying
Scoring and query relevance
The basics of Elasticsearch
Key concepts of Elasticsearch

xv
1
2
2
4
5
6
6
7

Index

7
Document
7
Document type
8
Mapping8

Key concepts of the Elasticsearch infrastructure
Nodes and clusters
Shards
Replicas
Gateway

Indexing and searching
Installing and configuring your cluster
Installing Java
Installing Elasticsearch
Running Elasticsearch
Shutting down Elasticsearch
The directory layout
Configuring Elasticsearch
The system-specific installation and configuration
Installing Elasticsearch on Linux
Configuring Elasticsearch as a system service on Linux
Elasticsearch as a system service on Windows
[i]

9

9

9
9
10

10
12
12
12
13
15
15
16
18

18
20
20


Table of Contents

Manipulating data with the REST API
Understanding the REST API
Storing data in Elasticsearch
Creating a new document

21
21
22


22

Retrieving documents
Updating documents

25
26

Deleting documents
Versioning

31
32

Dealing with non-existing documents
Adding partial documents

28
29

Usage example
Versioning from external systems

Searching with the URI request query
Sample data
URI search
Elasticsearch query response

Query analysis
URI query string parameters


32
33

34
34
35

38

40
41

The query
42
The default search field
42
Analyzer42
The default operator property
42
Query explanation
42
The fields returned
45
Sorting the results
45
The search timeout
45
The results window
46

Limiting per-shard results
46
Ignoring unavailable indices
46
The search type
46
Lowercasing term expansion
46
Wildcard and prefix analysis
47

Lucene query syntax
Summary

47
48

Chapter 2: Indexing Your Data

49

Elasticsearch indexing
Shards and replicas

50
50

Write consistency

52


Creating indices

52

Altering automatic index creation
Settings for a newly created index
Index deletion

53
55
55

Mappings configuration
Type determining mechanism

Disabling the type determining mechanism
[ ii ]

56
56

57


Table of Contents
Tuning the type determining mechanism for numeric types
Tuning the type determining mechanism for dates

58

59

Index structure mapping

61

Using analyzers

71

Different similarity models

76

Type and types definition
Fields
Core types
Multi fields
The IP address type
Token count type

62
63
63
70
70
71

Out-of-the-box analyzers
Defining your own analyzers

Default analyzers

72
73
76

Setting per-field similarity
Available similarity models

Batch indexing to speed up your indexing process
Preparing data for bulk indexing
Indexing the data
The _all field
The _source field
Additional internal fields
Introduction to segment merging
Segment merging
The need for segment merging
The merge policy
The merge scheduler
Throttling
Introduction to routing
Default indexing
Default searching
Routing
The routing parameters
Routing fields
Summary

Chapter 3: Searching Your Data


Querying Elasticsearch
The example data
A simple query
Paging and result size
Returning the version value
Limiting the score
Choosing the fields that we want to return
[ iii ]

77
78

80
81
82
84
85
86
86
87
87
87
88
89
89
90
90
92
93

94
95

97

97
98
100
101
103
104
105


Table of Contents

Source filtering
Using the script fields

106
108

Passing parameters to the script fields

Understanding the querying process
Query logic
Search type
Search execution preference
Search shards API
Basic queries

The term query
The terms query
The match all query
The type query
The exists query
The missing query
The common terms query
The match query
The Boolean match query
The phrase match query
The match phrase prefix query

The multi match query
The query string query

Running the query string query against multiple fields

The simple query string query
The identifiers query
The prefix query
The fuzzy query
The wildcard query
The range query
Regular expression query
The more like this query
Compound queries
The bool query
The dis_max query
The boosting query
The constant_score query

The indices query
Using span queries
A span
Span term query
Span first query
Span near query

[ iv ]

110

111
111
112
113
114
116
116
117
118
118
119
119
119
121
121
123
124

124

125

128

128
129
129
130
132
133
134
134
137
137
139
140
141
141
142
143
143
144
145


Table of Contents

Span or query
Span not query
Span within query

Span containing query
Span multi query
Performance considerations
Choosing the right query
The use cases

147
148
149
150
151
151
151
152

Limiting results to given tags

152

Searching for values in a range

Boosting some of the matched documents
Ignoring lower scoring partial queries
Using Lucene query syntax in queries
Handling user queries without errors
Autocomplete using prefixes
Finding terms similar to a given one
Matching phrases
Spans, spans everywhere


Summary

Chapter 4: Extending Your Querying Knowledge
Filtering your results
The context is the key
Explicit filtering with bool query
Highlighting
Getting started with highlighting
Field configuration
Under the hood
Forcing highlighter type

152

153
154
157
157
159
160
160
160

162

163

164
165
165

169
170
172
172

172

Configuring HTML tags
Controlling highlighted fragments
Global and local settings
Require matching
Custom highlighting query
The Postings highlighter
Validating your queries
Using the Validate API
Sorting data
Default sorting
Selecting fields used for sorting

173
175
175
176
179
180
183
183
186
186
187


Sorting mode

Specifying behavior for missing fields
Dynamic criteria
[v]

189

191
191


Table of Contents

Calculate scoring when sorting
Query rewrite
Prefix query as an example
Getting back to Apache Lucene
Query rewrite properties
Summary

Chapter 5: Extending Your Index Structure
Indexing tree-like structures
Data structure
Analysis
Indexing data that is not flat
Data
Objects
Arrays

Mappings

Final mappings
Sending the mappings to Elasticsearch

To be or not to be dynamic
Disabling object indexing
Using nested objects
Scoring and nested queries
Using the parent-child relationship
Index structure and data indexing
Child mappings
Parent mappings
The parent document
Child documents

Querying

192
193
193
195
197
199

201

201
202
203

204
204
205
205
206

207
207

208
209
209
213
213
214

214
214
215
215

216

Querying data in the child documents
Querying data in the parent documents

216
219

Performance considerations

Modifying your index structure with the update API
The mappings

221
221
222

Summary

225

Adding a new field to the existing index
Modifying fields of an existing index

Chapter 6: Make Your Search Better

Introduction to Apache Lucene scoring
When a document is matched
Default scoring formula
Relevancy matters
[ vi ]

222
223

227

227
228
228

229


Table of Contents

Scripting capabilities of Elasticsearch
Objects available during script execution
Script types

230
230
232

Querying with scripts
Scripting with parameters
Script languages
Using other than embedded languages
Using native code

235
236
237
237
238

Searching content in different languages
Handling languages differently
Handling multiple languages
Detecting the language of the document
Sample document

The mappings
Querying

243
243
243
244
244
245
247

Combining queries
Influencing scores with query boosts
The boost
Adding the boost to queries
Modifying the score

249
250
250
250
254

When does index-time boosting make sense?
Defining boosting in the mappings
Words with the same meaning
Synonym filter

262
263

263
263

In file scripts
Inline scripts
Indexed scripts

The factory implementation
Implementing the native script
The plugin definition
Installing the plugin
Running the script

Queries with an identified language
Queries with an unknown language

Constant score query
Boosting query
The function score query

Synonyms in the mappings
Synonyms stored on the file system

232
233
234

238
239
240

242
242

247
248

254
255
255

264
265

Defining synonym rules

265

Query or index-time synonym expansion

267

Using Apache Solr synonyms
Using WordNet synonyms

[ vii ]

265
267



Table of Contents

Understanding the explain information
Understanding field analysis
Explaining the query
Summary

Chapter 7: Aggregations for Data Analysis
Aggregations
General query structure
Inside the aggregations engine
Aggregation types
Metrics aggregations

Minimum, maximum, average, and sum
Field value statistics and extended statistics
Value count
Field cardinality
Percentiles
Percentile ranks
Top hits aggregation
Geo bounds aggregation
Scripted metrics aggregation

267
267
269
272

273

273
274
277
278
278

278
281
283
283
284
286
287
292
292

Buckets aggregations

294

Date histogram aggregation

312

Geo distance aggregations
Geohash grid aggregation
Global aggregation
Significant terms aggregation

313

315
315
316

Sampler aggregation
Children aggregation
Nested aggregation
Reverse nested aggregation
Nesting aggregations and ordering buckets

321
322
323
324
326

Filter aggregation
Filters aggregation
Terms aggregation
Range aggregation
Date range aggregation
IPv4 range aggregation
Missing aggregation
Histogram aggregation

294
296
298
301
305

308
309
310

Time zones

312

Choosing significant terms
Multiple value analysis

Buckets ordering

Pipeline aggregations

319
319

329

330

Available types
Referencing other aggregations

330
330

[ viii ]



Table of Contents
Gaps in the data
Pipeline aggregation types

330
331

Summary

Chapter 8: Beyond Full-text Searching
Percolator
The index
Percolator preparation
Getting deeper

Controlling the size of returned results
Percolator and score calculation
Combining percolators with other functionalities

Getting the number of matching queries
Indexed document percolation
Elasticsearch spatial capabilities
Mapping preparation for spatial searches
Example data
Additional geo_field properties

344

345


345
346
347
350

352
352
353

354
355
355
356
356

357

Sample queries

358

Arbitrary geo shapes

363

Distance-based sorting
Bounding box filtering
Limiting the distance


358
360
362

Point
Envelope
Polygon
Multipolygon
An example usage
Storing shapes in the index

364
364
365
365
366
367

Using suggesters
Available suggester types
Including suggestions

369
369
370

Suggester response

371


Term suggester

372

Phrase suggester
Completion suggester

374
376

The Scroll API
Problem definition
Scrolling to the rescue
Summary

388
388
388
390

Term suggester configuration options
Additional term suggester options

373
373

Custom weights
Context suggester

380

381

[ ix ]


Table of Contents

Chapter 9: Elasticsearch Cluster in Detail
Understanding node discovery
Discovery types
Node roles
Master node
Data node
Client node
Configuring node roles

391

392
392
392

393
393
393
394

Setting the cluster's name
Zen discovery


394
395

Adjusting HTTP transport settings

398

The gateway and recovery modules
The gateway
Recovery control

399
399
400

Templates and dynamic templates
Templates

405
405

Master election configuration
Configuring unicast
Fault detection ping settings
Cluster state updates control
Dealing with master unavailability

395
396
397

397
398

Disabling HTTP
HTTP port
HTTP host

Additional gateway recovery options
Indices recovery API
Delayed allocation
Index recovery prioritization

An example of a template

398
398
399

401
401
403
404

405

Dynamic templates

406

Elasticsearch plugins

The basics
Installing plugins
Removing plugins
Elasticsearch caches
Fielddata cache

408
409
409
411
412
412

The matching pattern
Field definitions

408
408

Fielddata size
Circuit breakers

412
413

Fielddata and doc values
Shard request cache

Enabling and configuring the shard request cache
Per request shard request cache disabling

Shard request cache usage monitoring
[x]

413
414

414
415
415


Table of Contents

Node query cache
Indexing buffers
When caches should be avoided
The update settings API
The cluster settings API
The indices settings API
Summary

Chapter 10: Administrating Your Cluster
Elasticsearch time machine
Creating a snapshot repository
Creating snapshots
Additional parameters

Restoring a snapshot
Cleaning up – deleting old snapshots
Monitoring your cluster's state and health

Cluster health API
Controlling information details
Additional parameters

416
416
417
417
418
418
419

421

421
422
424

425

425
427
427
428

429
429

Indices stats API


430

Nodes info API

433

Docs
Store
Indexing, get, and search
Additional information

431
431
431
432

Returned information

434

Nodes stats API
Cluster state API
Cluster stats API
Pending tasks API
Indices recovery API
Indices shard stores API
Indices segments API
Controlling the shard and replica allocation
Explicitly controlling allocation


434
435
436
436
437
439
439
440
440

The number of shards and replicas per node
Allocation throttling

445
445

Specifying node parameters
Configuration
Index creation
Excluding nodes from allocation
Requiring node attributes
Using the IP address for shard allocation
Disk-based shard allocation

[ xi ]

441
441
441
442

443
443
443


Table of Contents

Cluster-wide allocation

446

Manually moving shards and replicas

451

Allocation awareness
Forcing allocation awareness
Filtering

Moving shards
Canceling shard allocation
Forcing shard allocation
Multiple commands per HTTP request
Allowing operations on primary shards

447
449
449
451
452

452
453
453

Handling rolling restarts
Controlling cluster rebalancing
Understanding rebalance
Cluster being ready
The cluster rebalance settings

453
454
454
455
455

The Cat API
The basics
Using Cat API

456
456
458

The examples

459

Controlling when rebalancing will be allowed
Controlling the number of shards being moved between nodes concurrently

Controlling which shards may be rebalanced

Common arguments

455
455
456

458

Getting information about the master node
Getting information about the nodes
Retrieving recovery information for an index

Warming up
Defining a new warming query
Retrieving the defined warming queries
Deleting a warming query
Disabling the warming up functionality
Choosing queries for warming
Index aliasing and using it to simplify your everyday work
An alias
Creating an alias
Modifying aliases
Combining commands
Retrieving aliases
Removing aliases
Filtering aliases
Aliases and routing
Zero downtime reindexing and aliases

Summary
[ xii ]

459
460
460

460
461
462
463
464
464
465
466
466
467
467
468
469
469
470
470
471


Table of Contents

Chapter 11: Scaling by Example


Hardware
Physical servers or a cloud
CPU
RAM memory
Mass storage
The network
How many servers
Cost cutting
Preparing a single Elasticsearch node
The general preparations
Avoiding swapping
File descriptors
Virtual memory

The memory
Field data cache and breaking the circuit
Use doc values
RAM buffer for indexing
Index refresh rate
Thread pools
Horizontal expansion
Automatically creating the replicas
Redundancy and high availability
Cost and performance flexibility
Continuous upgrades
Multiple Elasticsearch instances on a single physical machine
Preventing a shard and its replicas from being on the same node

Designated node roles for larger clusters
Query aggregator nodes

Data nodes
Master eligible nodes

Preparing the cluster for high indexing and querying throughput
Indexing related advice
Index refresh rate
Thread pools tuning
Automatic store throttling
Handling time-based data
Multiple data paths
Data distribution
Bulk indexing
RAM buffer for indexing

Advice for high query rate scenarios
Shard request cache
Think about the queries

[ xiii ]

473

473
474
475
475
476
476
476
477

477
478

478
479
479

480
480
481
481
481
482
483
486
486
488
488
489

489

490

491
492
492

492
492


493
493
494
494
495
495
497
498

498

499
499


Table of Contents
Parallelize your queries
Field data cache and breaking the circuit
Keep size and shard size under control

Monitoring
Elasticsearch HQ
Marvel
SPM for Elasticsearch
Summary

Index

501

501
502

502
502
504
505
506

507

[ xiv ]


Preface
Welcome to Elasticsearch Server, Third Edition. This is the third instalment of the
book dedicated to yet another major release of Elasticsearch—this time version 2.2.
In the third edition, we have decided to go on a similar route that we took when we
wrote the second edition of the book. We not only updated the content to match the
new version of Elasticsearch, but also restructured the book by removing and adding
new sections and chapters. We read the suggestions we got from you—the readers
of the book, and we carefully tried to incorporate the suggestions and comments
received since the release of the first and second editions.
While reading this book, you will be taken on a journey to the wonderful world of
full-text search provided by the Elasticsearch server. We will start with a general
introduction to Elasticsearch, which covers how to start and run Elasticsearch, its
basic concepts, and how to index and search your data in the most basic way. This
book will also discuss the query language, so called Query DSL, that allows you
to create complicated queries and filter returned results. In addition to all of this,
you'll see how you can use the aggregation framework to calculate aggregated data

based on the results returned by your queries. We will implement the autocomplete
functionality together and learn how to use Elasticsearch spatial capabilities and
prospective search.
Finally, this book will show you Elasticsearch's administration API capabilities
with features such as shard placement control, cluster handling, and more, ending
with a dedicated chapter that will discuss Elasticsearch's preparation for small and
large deployments— both ones that concentrate on indexing and also ones that
concentrate on indexing.

[ xv ]


Preface

What this book covers

Chapter 1, Getting Started with Elasticsearch Cluster, covers what full-text searching is,
what Apache Lucene is, what text analysis is, how to run and configure Elasticsearch,
and finally, how to index and search your data in the most basic way.
Chapter 2, Indexing Your Data, shows how indexing works, how to prepare index
structure, what data types we are allowed to use, how to speed up indexing, what
segments are, how merging works, and what routing is.
Chapter 3, Searching Your Data, introduces the full-text search capabilities of
Elasticsearch by discussing how to query it, how the querying process works,
and what types of basic and compound queries are available. In addition to this,
we will show how to use position-aware queries in Elasticsearch.
Chapter 4, Extending Your Query Knowledge, shows how to efficiently narrow down
your search results by using filters, how highlighting works, how to sort your results,
and how query rewrite works.
Chapter 5, Extending Your Index Structure, shows how to index more complex data

structures. We learn how to index tree-like data types, how to index data with
relationships between documents, and how to modify index structure.
Chapter 6, Make Your Search Better, covers Apache Lucene scoring and how to
influence it in Elasticsearch, the scripting capabilities of Elasticsearch, and its
language analysis capabilities.
Chapter 7, Aggregations for Data Analysis, introduces you to the great world of data
analysis by showing you how to use the Elasticsearch aggregation framework.
We will discuss all types of aggregations—metrics, buckets, and the new pipeline
aggregations that have been introduced in Elasticsearch.
Chapter 8, Beyond Full-text Searching, discusses non full-text search-related
functionalities such as percolator—reversed search, and the geo-spatial capabilities
of Elasticsearch. This chapter also discusses suggesters, which allow us to build
a spellchecking functionality and an efficient autocomplete mechanism, and we
will show how to handle deep-paging efficiently.
Chapter 9, Elasticsearch Cluster in Detail, discusses nodes discovery mechanism,
recovery and gateway Elasticsearch modules, templates, caches, and settings
update API.
Chapter 10, Administrating Your Cluster, covers the Elasticsearch backup functionality,
rebalancing, and shards moving. In addition to this, you will learn how to use the
warm up functionality, use the Cat API, and work with aliases.
[ xvi ]


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×