3. ElasticSearch
● Elasticsearch is a search engine based on the Apache Lucene
library.
● Open Code Business Model
● Rest based
● Distributed
● Most Popular enterprise search engine
● Netflix, Linkedin, Amazon, Oracle and many big names
4. Elastic (ELK) Stack
The Beats are lightweight data shippers, written in
Go, that run on your servers to capture all sorts of
operational data (logs, metrics, or network packet
data). Beats send the operational data to
Elasticsearch, either directly or via Logstash
Logstash is a server-side data processing
pipeline that ingests data from a multitude of
sources, transforms it, and then sends it to your
favorite "stash."
Kibana is a browser-based analytics and
search dashboard for Elasticsearch.
Distributed RESTful search Engine
5. How do ElasticSearch and Lucene Differ
Just as a car (ES) and the engine (Lucene) of a car differ
ES makes use of Lucene to manage the indices.
Lucene is a Java library. You can include it in your project and refer to its functions using function calls.
Elasticsearch is a JSON Based, Distributed, web server built over Lucene. Though it's Lucene who is doing the actual work
beneath, Elasticsearch provides us a convenient layer over Lucene. Each shard that gets created in Elasticsearch is a separate
Lucene instance. So to summarize
1. Elasticsearch is built over Lucene and provides a JSON based REST API to refer to Lucene features.
2. Elasticsearch provides a distributed system on top of Lucene. A distributed system is not something Lucene is
aware of or built for. Elasticsearch provides this abstraction of distributed structure.
3. Elasticsearch provides other supporting features like thread-pool, queues, node/cluster monitoring API, data
monitoring API, Cluster management, etc.
7. Indexing
● Elasticsearch is able to achieve low
latency in responses because, instead of
searching the text directly, it searches in
an index instead.
● Document? The basic unit of data in ES
● Inverted Index (like at the back of a book)
○ Created by tokenizing the terms in
each document
○ Created a sorted list of all unique
terms (terms are normalized,
stemmed etc)
○ Assosciate list of documents where
the word can be found
○ Similar to the index at the back of a
book
Doc1: I am learning the cool stuff
Doc2: I am learning to learn
Inverted Index:
Am -> [Doc1, Doc2]
Cool -> [Doc1]
I -> [Doc1, Doc2]
Learn -> [Doc1, Doc2] // root for of learning
the -> [Doc1]
…
8. Retrieving
● Term Frequency (TF)
○ Frequency of term in given
document
● Document Frequency (DF)
○ Frequency of term in all
documents
● IDF (Inverse Document
Frequency)
○ IDF = 1 / DF
● Relevance
○ Relevance = TF * IDF
○ Relevance = TF / DF
Search Term: learn
TF1 = 1
TF2 = 2
IDF = ⅓
Rev1 = TF1 * IDF = ⅓
Rev2 = TF2 * IDF = ⅔
Rev2 > Rev1
10. Node Structure
● Index - Logical Namespace of collection of documents
● Shard - Horizontal Partition of an Index
○ Eg Documents 1-10 in one shard, 11-20 in other and so on.
○ In Elasticsearch, each Shard is a self-contained Lucene index in itself.
11. Cluster Structure
P1
R4
P2
R1
P3
R2
P4
R3
● Here we can see a cluster of 4
nodes
● Each node has 2 shards
● Primary and Replica shards
● For robustness and fault
tolerance, each shard is replicated
● Even if a node goes down, and a
primary shard is lost, a replica can
be made primary until recovery
● Number of replica shards has to be
set at the time of cluster creation
● Write operations on Primary and
repeated on replicas and read from
either
12. Types on Nodes
● Master Node
○ Cluster wide operations (creating and deleting indexes, keeping track of
index nodes, assigning shards, healthchecks etc)
● Data Node
○ Hold data and index
● Client Node
○ Load Balancer (neither data nor master nodes)
14. Breaking a shard into Segments
● For ES the basic unit of storage is a shard
● For Lucene the basic unit of storage is a segment
● Each segment is an inverted index
● New documents are added to new segment
● Segments are in memory and data is later persisted to
disk
● Segments are immutable
15. Coordination Stage
● shard_number = hash(document_id) % (num_of_primary_shards)
● All nodes know where a shard exists
● Document passed to node which contains particular shard_number
17. Translog and Memory Buffer
● Request written to translog
● Document added to memory buffer (which stores all the newly index documents)
● If the request is successful on the primary shard, the request is parallelly sent to the replica shards.
● In-sync shards which are always in sync with primary
● The client receives acknowledgement that the request was successful only after the translog is fsync’ed on all
primary and insync shards.
18. Refresh Operation
● In Elasticsearch, the _refresh operation is set to be executed every second by default.
● During this operation, the in-memory buffer contents is copied to a newly created segment in the memory.
● As a result, new data becomes available for search.
19. Flush Operation
● Flush essentially means that all the documents in the in-memory buffer are written to new Lucene
segments.
● These, along with all existing in-memory segments, are committed to the disk, which clears the
translog. This commit is essentially a Lucene commit.
21. Elasticsearch Delete
● Documents in Elasticsearch are immutable and hence, cannot be deleted or modified to
represent any changes.
● Every segment on disk has a .del file associated with it.
● When a delete request is sent, the document is not really deleted, but marked as deleted
in the .del file.
● This document may still match a search query but is filtered out of the results.
● When segments are merged, the documents marked as deleted in the .del file are not
included in the new merged segment.
22. Elasticsearch Update
● When a new document is created, Elasticsearch assigns a version number to that
document.
● Every change to the document results in a new version number.
● When an update is performed, the old version is marked as deleted in the .del file and
the new version is indexed in a new segment.
● The older version may still match a search query, however, it is filtered out from the
results.
24. ElasticSearch Read
● In this phase, the coordinating node routes the search request to all the shards
(primary or replica) in the index.
● The shards perform search independently and create a set of results sorted by
relevance score.
● All the shards return the document IDs of the matched documents and relevant
scores to the coordinating node.
● By default, each shard sends the top 10 results to the coordinating node
● The coordinating node sorts the results globally, and creates a list of the top 10 hits.
● The coordinating node then requests the original documents from all the shards.
All the shards enrich the documents and return them to the coordinating node.
● Results are aggregated and sent to the clients