Intro to elasticsearch

Your Data,
Your Search !
问志光
2016-06-27
1

Outline
 Information retrieval
 Indexing & Searching
 Elasticsearch
2

Information retrieval
 Information Retrieval(IR) is finding
material(usually documents) of an unstructured
nature(usually text) that statisfies an information
need from within large collections(usually stored
on computers).
 Search Engine is a software system that is
designed to search for information. It’s a kind of
implementation of IR.
3

What is search engine?
 A search engine is
 An index engine for documents
 A search engine on indexes
 A search engine is more powerful to do
searches:
It’s designed for it !
4

Problems ??
 How to store the data ?
 How to index the data ?
 How to search the data ?
9

How to store the data ?
INVERTED LIST
10

the follow two files
 File1: Students should be allowed to go
out with their friends, but not allowed
to drink beer.
 File2: My friend Jerry went to school to
see his students but found them drunk
which is not allowed.
12

Step 1: Tokenzier
 Split doc into words
 Remove the punctuation
 Remove stop word (the, a, this, that etc.)
“Students”，“allowed”，“go”，“their”，
“friends”，“allowed”，“drink”，“beer”，“My”，
“friend”，“Jerry”，“went”，“school”，“see”，
“his”，“students”，“found”，“them”，“drunk”，
“allowed”
13

Step2: Linguistic Processor
 Lowercase
 Stemming, cars -> car, etc.
 Lemmatizatio, drove -> drive, etc.
“student”，“allow”，“go”，“their”，“friend”
，“allow”，“drink”，“beer”，“my”，“friend”
，“jerry”，“go”，“school”，“see”，“his”，
“student”，“find”，“them”，“drink”，“allow”
Term
14

Step3: Index
Term Document ID
student 1
allow 1
go 1
their 1
friend 1
allow 1
… …
 Dict
 Sort
 Posting list
15

Step1: User search query
• Suppose you have the follow query：
lucene AND learned NOT hadoop
18

Step2: Lexical & Syntax Analysis
 Identify words and keywords
 Words: lucene, learned, hadoop
 Keywords: AND, NOT
 Building a syntax tree
lucene learned
hadoopAND
Not
19

Step3: Search
 Search in the Inverted List
 Sort, Conjunction, Disconjunction
 Scorer
20

full text search
RESTful API
real time,
Search and
analytics engine
open source
high availability
schema free
JSON over HTTP
Lucene based
distributed
RESTful API
ElasticSearch
21

Elastic Search
 Distributed and Highly Available Search Engine.
 Each index is fully sharded with a configurable number of shards.
 Each shard can have one or more replicas.
 Read / Search operations performed on either one of the replica shard.
 Multi Tenant with Multi Types.
 Support for more than one index.
 Support for more than one type per index.
 Index level configuration (number of shards, index storage, ...).
 Document oriented
 No need for upfront schema definition.
 Schema can be defined per type for customization of the indexing process.
 Various set of APIs
 HTTP RESTful API
 Native Java API.
 All APIs perform automatic node operation rerouting.
 (Near) Real Time Search.
 Reliable, Asynchronous Write Behind for long term persistency.
 Built on top of Lucene
 Each shard is a fully functional Lucene index
 All the power of Lucene easily exposed through simple configuration / plugins.
 Per operation consistency
 Single document level operations are atomic, consistent, isolated and durable.
 Open Source under the Apache License, version 2 ("ALv2")
22

Terminologies of Elastic Search
 Cluster
 Node
 Index
 Shard
23

Cluster
● A cluster is a collection of one or more
nodes (servers) that together holds your
entire data and provides federated indexing
and search capabilities across all nodes
● A cluster is identified by a unique name
which by default is "elasticsearch"
24

Node
● It is an elasticsearch instance (a java process)
● A node is created when a elasticsearch instance is
started
● A random Marvel Charater name is allocated by
default
25

Index
● An index is a collection of documents that have
somewhat similar characteristics. eg:customer data,
product catalog
● Very crucial while performing indexing, search, update,
and delete operations against the documents in it
● One can define as many indexes in one single cluster
26

Document
● It is the most basic unit of information which can be
indexed
● It is expressed in json (key:value) pair.
‘{“user”:”nullcon”}’
● Every Document gets associated with a type and a
unique id.
27

Shard
● Every index can be split into multiple shards to
be able to distribute data.
● The shard is the atomic part of an index, which
can be distributed over the cluster if you add
more nodes.
28

A terminology comparison
Relational database Elasticsearch
Database Index
Table Type
Row Document
Column Field
Schema Mapping
Index Everything is indexed
SQL Query DSL
SELECT * FROm tb … GET http://
UPDATE tb SET … PUT http://
31

Playing with Elasticsearch
REST API:
http://host:port/[index]/[type]/[_action/
id]
HTTP Methods: GET, POST,PUT,DELETE
32

Playing with Elasticsearch
• Search
– curl –XGET http://localhost:9200/my_index/test/_search
– curl –XGET http://localhost:9200/my_index/_search
– curl –XPUT http://localhost:9200/_search
• Meta Data
– curl –XPUT http://localhost:9200/my_index/_status
• Documents:
– curl –XPUT http://localhost:9200/my_index/test/1
– curl –XGET http://localhost:9200/my_index/test/1
– curl –XDELETE http://localhost:9200/my_index/test/1
33

Example: Index
Curl –XPUT http://localhost:9200/my_index/test/1 -d
‘{
"name": "joeywen",
"value": 100
}’
34

Example: Search
Curl –XGET http://localhost:9200/my_index/_search –d
‘{
“query”: {
“match_all”: {}
}
}’
Total number of docs
Relevance
Search time
Max score
35

Creating, indexing, or deleting a single document
36

Intro to elasticsearch

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Intro to elasticsearch

Similaire à Intro to elasticsearch (20)

Dernier

Dernier (20)

Intro to elasticsearch