
Mastering Elasticsearch 5.x
By :

In 2015, Elasticsearch, after acquiring Kibana, Logstash, Beats, and Found, re-branded the company name as Elastic. According to Shay Banon, the name change is part of an initiative to better align the company with the broad solutions it provides: future products, and new innovations created by Elastic's massive community of developers and enterprises that utilize the ELK stack for everything from real-time search, to sophisticated analytics, to building modern data applications.
But having several products under one hood resulted in discord among them during the release process and started creating confusion for the users. This resulted in the ELK stack being renamed to Elastic Stack and the company decided to keep releasing all components of the Elastic Stack together. This is so that they will all share the same version number for all the products to keep speed with your deployments, simplify compatibility testing, and make it even easier for developers to add new functionality across the stack.
The very first GA release under Elastic stack is 5.0.0, which will be covered throughout this book. Further, Elasticsearch keeps pace with Lucene version releases to incorporate bug fixes and the latest features into Elasticsearch. Elasticsearch 5.0 is based on Lucene 6, which is a major release from Lucene with some awesome new features and a focus on improving the search speed. We will discuss Lucene 6 in upcoming chapters to let you know how Elasticsearch is going to have some awesome improvements, both from search and storage points of view.
Elasticsearch 5.x has many improvements and has gone through a great refactoring, which caused removal/deprecation of some features. We will keep discussing the removed/improved/new features in upcoming chapters, but for now let's take an overview of the new and improved things in Elasticsearch.
Following are some of the most important features introduced in Elasticsearch version 5.0:
ingest
node and it is very lighter across the board. You can avoid Logstash for these tasks because the ingest
node is a Java based implementation of the Logstash filter and comes as a default in Elasticsearch itself._shrink
API to overcome this problem. This API allows you to shrink an existing index into a newer index with a fewer number of shards.We will cover the ingest
node and shrink
API in detail under Chapter 9, Data Transformation and Federated Search.
completion
suggester has undergone a complete rewrite. This means that the syntax and data structure for fields of type completion have changed, as have the syntax and response of the completion
suggester requests. The completion
suggester is now built on top of the first iteration of Lucene's new suggest
API._delete_by_query
REST endpoint.Apart from the features discussed just now, you can also benefit from all of the new features that came in Elasticsearch version 2.x. For those who have not had a look at the 2.x series, let's have a quick revamp of the new features which came with Elasticsearch under this series:
_reindex
API makes this task very easy and you do not need to worry about writing your own code to do the same. This API, at the simplest level, provides the ability to move data from one index to another but also provides a great control while re-indexing the documents, such as using scripts for data transformation and many other parameters. You can take a look at the reindex API at following URL https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-reindex.html.update_by_query
REST endpoint in version 2.x._task
REST endpoint, is used for retrieving information about the currently executing tasks on one or more nodes in the cluster. The following examples show the usage of the tasks
API:GET /_tasks GET /_tasks?nodes=nodeId1,nodeId2 GET /_tasks?nodes=nodeId1&actions=cluster;*
POST /_tasks/taskId1/_cancel
Profile
API is an awesome tool to debug the queries and get the insights to know why a certain query is slow and take steps to improve it. This API was released in the 2.2.0 version and provides detailed timing information about the execution of individual components in a search request. You just need to send profile
as true
with your query object to get this working for you. For example:curl -XGET 'localhost:9200/_search' -d '{ "profile": true, "query" : { "match" : { "message" : "query profiling test" } } }'
The change list is very long and covering all the change details is out of the scope of this book, since most of the changes are internal level changes which a user should not be worried about. However, we will cover the most important changes an existing Elasticsearch user must know.
Although this book is based on Elasticsearch version 5.0, it is very important for the reader to get to know the changes being made between versions 1.x to 2.x. If you are new to Elasticsearch and are not aware about older versions, you can skip this section.
Elasticsearch version 2.x was focused on resiliency, reliability, simplification, and features. This release was based on Apache Lucene 5.x and specifically improves query execution and spatial search.
Version 2.x also delivers considerable improvements in index recovery. Historically, Elasticsearch index recovery was extremely painful, whether as part of node maintenance or an upgrade. The bigger the cluster, the bigger the headache. Node failures or a reboot can trigger a shard reallocation storm, and entire shards are sometimes copied over the network, despite having whole data. Users have also reported more than a day of recovery time to restart a single node.
With 2.x, recovery of existing replica shards became almost instant, and there is more lenient reallocation, which avoids reshuffling and makes rolling upgrades much easier and faster. Auto-regulating feedback loops in recent updates also eliminates past worries about merge throttling and related settings.
Elasticsearch 2.x also solved many of the known issues that plagued previous versions, including:
Elasticsearch developers earlier assumed an index as a database and a type as a table. This allowed users to create multiple types inside the same index, but eventually became a major source of issues because of restrictions imposed by Lucene.
Fields that have the same name inside multiple types in a single index are mapped to a single field inside Lucene. Incorrect query outcomes and index corruption can result from a field in one document type being of an integer type while a field in another document type being of a string type. Several other issues can lead to mapping refactoring and major restrictions on handling mapping conflicts.
The following are the most significant changes imposed by Elasticsearch version 2.x:
.percolator
is an exception)index_analyzer
and _analyzer
parameters were removed from mapping definitions.ignore_conflicts
option of the put mappings API got removed and conflicts cannot be ignored anymore._id
or _type
, it will not work in version 2.x. You need to reindex your documents after dropping those fields.date_optional_time
to strict_date_optional_time
, which expects a four-digit year, and a two-digit month and day, (and optionally, a two-digit hour, minute, and second). So a dynamic index set as "2016-01-01"
will be stored inside Elasticsearch in "strict_date_optional_time||epoch_millis"
format. Please note that if you have been using Elasticsearch older than 1.x then your date range queries might get impacted because of this. For example, if in Elasticsearch 1.x, you have two documents indexed with one having the date as 2017-02-28T12:00:00.000Z
and the second having the date as 2017-03-01T11:59:59.000Z
, and if you are searching for documents between February 28, 2017 and March 1, 2017, the following query could return both the documents:{ "range": { "created_at": { "gte": "2017-02-28", "lte": "2017-03-01" } } }
But in version 2.0 onwards, the same query must use the complete date time to get the same results. For example.
{ "range": { "created_at": { "gte": "2017-02-28T00:00:00.000Z", "lte": "2017-03-01T11:59:59.000Z" } } }
In addition, you can also use the date match operation in combination with date rounding to get the same results as following query:
{ "range": { "doc.created_at": { "lte": "2017-02-28||+1d/d", "gte": "2017-02-28", "format": "strict_date_optional_time" } } }
Prior to version 2.0.0, Elasticsearch had two different objects for querying data: queries and filters. Each was different in functionality and performance.
Queries were used to find out how relevant a document was to a particular query by calculating a score for each document. Filters were used to match certain criteria and were cacheable to enable faster execution. This means that if a filter matched 1,000 documents, Elasticsearch, with the help of bloom filters, would cache those documents in memory to retrieve them quickly in case the same filter was executed again.
However, with the release of Lucene 5.0, which is used by Elasticsearch version 2.0.0, both queries and filters became the same internal object, taking care of both document relevance and matching.
So, an Elasticsearch query that used to look like the following:
{ "filtered" : { "query": { query definition }, "filter": { filter definition } } }
It should now be written like this in version 2.x:
{ "bool" : { "must": { query definition }, "filter": { filter definition } } }
Additionally, the confusion caused by choosing between a bool
filter and an and
/ or
filter has been addressed with the elimination of and
/ or
filters, and replaced by the bool
query syntax in the preceding example. Rather than the unnecessary caching and memory requirements that often resulted from a wrong filter, Elasticsearch now tracks and optimizes frequently used filters and doesn't cache for segments with less than 10,000 documents or 3% of the index.
Starting from 2.x, Elasticsearch now runs under the Java Security Manager enabled by default, which streamlines permissions after startup.
Elasticsearch has applied a durable-by-default approach to reliability and data duplication across multiple nodes. Documents are now synced to disk before indexing requests are acknowledged, and all file renames are now atomic to prevent partially written files.
On the networking side, based on extensive feedback from system administrators, Elasticsearch removed multicasting, and the default zen discovery has been changed to unicast. Elasticsearch also now binds to the localhost by default, preventing unconfigured nodes from joining public networks.
Before version 2.0.0, Elasticsearch used the SIGAR library for operating system-dependent statistics. But SIGAR is no longer maintained, and it has been replaced in Elasticsearch by a reliance on stats provided by JVM. Accordingly, we see various changes in the monitoring parameters of the node info
and node stats
APIs:
network.*
has been removed from nodes info
and nodes stats
.fs.*
.dev
and fs.*
.disk*
have been removed from nodes stats
.os.*
has been removed from nodes stats
, except for os.timestamp
, os.load_average
, os.mem.*
, and os.swap.*
.os.mem.total
and os.swap.total
have been removed from nodes info
._stats API, id_cache
parameter, which tells about parent-child data structure memory, usage has also been removed. The id_cache
can now be fetched from fielddata
.Elasticsearch 2.x did not see too many releases in comparison to the 1.x series. The last release under 2.x was 2.3.4 and since then Elasticsearch 5.0 was released. The following are the most important changes an existing Elasticsearch user must know before adapting to the latest releases.
Elasticsearch 5.x requires Java 8 so make sure to upgrade your Java versions before getting started with Elasticsearch.
From a user's perspective, changes under mappings are the most important changes to know because a wrong mapping will disallow index creation or can lead to unwanted search. Here are the most important changes under this category that you need to know.
The string type is removed in favor of the text and keyword data type. In earlier versions of Elasticsearch, the default mapping for string based fields looked like the following:
{ "content" : { "type" : "string" } }
Starting from version 5.0, the same will be created using the following syntax:
{ "content" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } }
This allows you to perform a full-text search on the original field name and to sort and run aggregations on the sub-keyword field.
Multi-fields are enabled by default for string-based fields and can cause extra overhead if a user is relying on dynamic mapping generation.
However, if you want to create specific mapping for string fields for full-text searches, it will be created as shown in the following example:
{ "content" : { "type" : "string" } }
Similarly, a not_analyzed
string field needs to be created using the following mapping:
{ "content" : { "type" : "keyword" } }
On all field data types (except for the deprecated string field), the index property now only accepts true
/false
instead of not_analyzed
/no
.
Earlier, the default data type for decimal fields used to be double but now it has been changed to float.
Numeric fields are now indexed with a completely different data structure, called the BKD tree. This is expected to require less disk space and be faster for range queries. You can read the details at the following link:
Similar to numeric fields, the geo_point
field now also uses the new BKD tree structure and field parameters for geo_point
fields are no longer supported: geohash
, geohash_prefix
, geohash_precision
, and lat_lon
. Geohashes are still supported from an API perspective, and can still be accessed using the .geohash
field extension, but they are no longer used to index geo point data.
For example, in previous versions of Elasticsearch, the mapping of a geo_point
field could look like the following:
"location":{ "type": "geo_point", "lat_lon": true, "geohash": true, "geohash_prefix": true, "geohash_precision": "1m" }
But, starting from Elasticsearch version 5.0, you can only create mapping of a geo_point
field as shown in the following:
"location":{ "type": "geo_point" }
The following are some very important additional changes you should be aware about:
Please note that if you are using OpenVZ virtualization on your servers, then you may find it difficult in setting the maximum map count for running Elasticsearch in the production mode, as this virtualization does not easily allow you to edit the kernel parameters. So you should either speak to your sysadmin to configure vm.max_map_count
correctly, or move to a platform where you can set it, for example kvm VPS.
_optimize
endpoint which was deprecated in 2.x is finally removed and has been replaced by the Force Merge API. For example, an optimize request in version 1.x...curl -XPOST 'http://localhost:9200/test/_optimize?max_num_segments=5'
...should be converted to:
curl -XPOST 'http://localhost:9200/test/_forcemerge?max_num_segments=5'
In addition to these changes, some major changes have been done in search, settings, allocation, merge, and scripting modules, along with cat and Java APIs, which we will cover in subsequent chapters.
Change the font size
Change margin width
Change background colour