2. Well Hello There! I am Arno Broekhof
Data Engineer ( full stack ) @Dataworkz
Working with elasticsearch since 2011
Dutch National Police
3. History of Elasticsearch
• Created by Shay Banon
• Compass
• Elasticsearch == Compass 3.0
• First release in February 2010
• Abstraction layer on top of Lucene
6. Not a database
• Persistency
• Consistency
• Security
• SELECT * FROM pet WHERE name LIKE 'b%';
• Total amount of data < 512GB
7. Shard Sizing
“Too Many Shards or the Gazillion Shards Problem”
• A shard is a Lucene index under the covers, which uses file handles, memory, and CPU cycles.
• Every search request needs to hit a copy of every shard in the index. That’s fine if every shard is
sitting on a different node, but not if many shards have to compete for the same resources.
• Term statistics, used to calculate relevance, are per shard. Having a small amount of data in many
shards leads to poor relevance.
8. How many shards?
• 1.000.000 documents
• Index of 256GB
• 6 nodes
• 1 node has 8 cores and 30GB Heap
256GB / ( 80% heap of 1 node ) = +/- 10 shards
curl -XGET http://localhost:9200/_cat/indices
9. Disable _source field
• The update, update_by_query, and reindex APIs.
• On the fly highlighting.
• The ability to reindex from one Elasticsearch index to another,
either to change mappings or analysis,
or to upgrade an index to a new major version.
• The ability to debug queries or aggregations
by viewing the original document used at index time.
• Potentially in the future, the ability to repair index corruption automatically.
10. How much indices
“remember that there is no rule that limits
your application to using only a single index.”
11. Dynamic Mappings
• Not everything needs to be searchable
"avatarLink": {
"type": "string",
"index": "not_analyzed",
"doc_values": true
},
• Use Explicit Mapping when possible
{
“job” : “Some job description”,
“date”: “1-10-2017”
}
{
“job” : “Some job description”,
“date”: “NO_DATE”
}
12. Where is my memory?
{
“aggs” : {
“players”: {
“terms”: {
“field”: “players”,
“size”: 10
}
}
},
“aggs”: {
“other”: {
“terms” : {
“field”: “players”,
“size”: 5
}
}
}
}
• The aggregation will return a list of the
top 10 players and a list of the
top five supporting players for each top player
• 50 results
• Minimal effort, Maximum memory
13. Where is my memory?
{
“aggs” : {
“players”: {
“terms”: {
“field”: “players”,
“size”: 10,
“collect_mode”: “breadth_first”
}
}
},
“aggs”: {
“other”: {
“terms” : {
“field”: “players”,
“size”: 5
}
}
}
}
• Use collect mode if possible
• Trims one level at a time
• Minimal change, Maximum performance
14. Where is my data?
public void insert(final JsonArray jsonArray) {
if (jsonArray.size() == 0) {
return;
}
BulkRequestBuilder bulkRequestBuilder = transportClient.prepareBulk();
this.setEsRefreshInterval("-1");
jsonArray.forEach(e -> {
String id = e.getAsJsonObject().get("name").toString();
bulkRequestBuilder.add(transportClient.prepareIndex(configuration.getEsIndex(),
configuration.getEsTypeName()).setSource(e.toString(),XContentType.JSON).setId(id));
});
BulkResponse bulkResponse = bulkRequestBuilder.get();
LOGGER.debug("bulk inserted {} items took: {} with failures: {}",
bulkResponse.getItems().length, bulkResponse.getTook(), bulkResponse.hasFailures());
}
15. Where is my data?
public void insert(final JsonArray jsonArray) {
if (jsonArray.size() == 0) {
return;
}
BulkRequestBuilder bulkRequestBuilder = transportClient.prepareBulk();
this.setEsRefreshInterval("-1");
jsonArray.forEach(e -> {
String id = e.getAsJsonObject().get("name").toString();
bulkRequestBuilder.add(transportClient.prepareIndex(configuration.getEsIndex(),
configuration.getEsTypeName()).setSource(e.toString(),XContentType.JSON).setId(id));
});
BulkResponse bulkResponse = bulkRequestBuilder.get();
LOGGER.debug("bulk inserted {} items took: {} with failures: {}",
bulkResponse.getItems().length, bulkResponse.getTook(), bulkResponse.hasFailures());
}
16. Query or Filter?
Queries —> should be used when performing a full-text search,
when scoring of results is required (think search results ranked by relevancy).
Filters —> are much faster than queries, mainly because they don’t score the results.
If you just want to return all of the products that are blue,
or that cost more than €50, use filters!
17. _type == _type
• Use unique types
• Why wordpress post_type == _type is a bad idea
• When deleting a post a document is identified both by its _id and _type
18. Search limits
• Default limits to 10
• Max results limits to 10.000
• If you want everything use the scroll api
19. We have a distributed search engine, nodes can fail!
• We have shards replica’s
• Single master
• Use dedicated masters
21. What brings the future?
• Java Transport Client is deprecated, REST is the way to go
• Cross Cluster Searches
• Index sorting during indexing
• Only one type can exist
• Better use of transaction logs
• Sparse Doc Values