99 Problems, But The Search Ain't One

99 Problems, But
The Search Ain’t One
Andrei Zmievski • PHP UK •!Feb 25, 2011

who am I?
curl http://localhost:9200/speaker/info/andrei

{“name”: “Andrei Zmievski”,
“projects”: [“PHP”, “PHP-GTK”, “Smarty”, “Unicode/i18n”],
“likes”: [“coding”, “beer”, “brewing”, “photography”],
“twitter”: “@a”,
“email”: “andrei@zmievski.org”}

what is elasticsearch?

a search engine for the NoSQL generation

domain-driven

distributed

RESTful

Hitchhiker’s Guide to the Galaxy (no, really)

document model

document-oriented

JSON-based

schema-free

engine

based on Lucene

multi-tenancy

distributed, out of the box

nomenclature

index

type

document

_id

node

1. index
!"#$%&'()*+%,--./00$1!2$,13-/45660!17803.92:9#0;%&<=
>
request

%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7==-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N=

>
response

%%%%?1:?/-#"9
%%%%?OB7<9P?/?!178?
%%%%?O-I.9?/?3.92:9#?
%%%%?OB<?/?;?
N

2. search
request

!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
response

%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
%%%%%%?OB<?%/%?5?E
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N

2. search
request

!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE total number of hits
%%?,B-3?%/%>
!!!!"#$#%&"!'!()
response

%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
%%%%%%?OB<?%/%?5?E
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N

2. search
request

!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
the index of the doc
response

%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
!!!!!!"*+,-./"!'!"0$,1")
%%%%%%?O-I.9?%/%?3.92:9#?E
%%%%%%?OB<?%/%?5?E
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N

2. search
request

!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
response

%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%> the type of the doc
%%%%%%?OB7<9P?%/%?!178?E
!!!!!!"*#23."!'!"43.%5.6")
%%%%%%?OB<?%/%?5?E
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N

2. search
request

!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
response

%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
!!!!!!"*+-"!'!"7") the id of the doc
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N

2. search
request

!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
response

%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
!!!!!!"*+-"!'!"7")
%%%%%%?O3!1#9?%/%6UV46LM64E the hit score
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N

2. search
request

!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
response

%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
!!!!!!"*+-"!'!"7")
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
the original source
8
!!!!",%9."'!":,-6.+!;9+.<45+")
!!!!"#%&5"'!"==!>6$?&.94)!?@#!#A.!B.%60A!:+,C#!D,.")
!!!!"&+5.4"'!E"0$-+,F")!"?..6")!"3A$#$F6%3A2"G)
!!!!"#H+##.6"'!"%")
!!!!"A.+FA#"'!(IJ
K%N%J%N%N

2. search
request

!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#

>%"#$$5"!'!L)
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E the execution time
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
response

%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
%%%%%%?OB<?%/%?5?E
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N

3. proﬁt

that’s up to you

distributed model

provides:

performance

resiliency (high-availability)

shards
a portion of the document space

each one is a separate Lucene index

thus, many per-index settings are available

document is sharded by its _id value

but can be assigned (routed) to a shard
deterministically

zero-conf discovery

zen (multicast and unicast)

cloud (EC2 via API)

auto-routing

master node:

maintains cluster state

reassigns shards if nodes leave/join cluster

any node can serve as the request router

the query is handled via scatter-gather mechanism

replicas

each shard can have 1 or more replicas

# of replicas can be updated dynamically after
index creation

replicas can be used for querying in parallel

shard allocation
node 1

start with a single node

shard allocation
node 1
person1
person2

PUT /person {
“index”: {
“number_of_shards”: 2,
“number_of_replicas”: 1
}}

shard allocation
node 1 node 2
person1 person1
person2 person2

start the second node

shard allocation
node 1 node 2 node 3 node 4
person1 person1
person2 person2

start 2 more nodes

document sharding
person1 person1
person2 person2

PUT /person/info/1
{…}

document sharding
person1 person1
person2 person2

PUT /person/info/1
hashed to shard 1 {…}

document sharding
person1 person1
person2 person2

replicated

PUT /person/info/1
{…}

document sharding
person1 person1
person2 person2

PUT /person/info/2
{…}

document sharding
person1 person1
person2 person2

hashed to shard 2
PUT /person/info/2
{…}

document sharding
person1 person1
person2 person2

replicated

PUT /person/info/2
{…}

scatter-gather
person1 person1
person2 person2

GET /person/_search?q=name:thomas

shard allocation
person1 person1
person2 person2

GET /person/_search?q=name:thomas

transactional model

per-document consistency

no need to commit/ﬂush

uses write-behind transaction log

write consistency (W) can be controlled

one, quorum, or all

(near) real-time search

1 second refresh rate by default

_refresh API also

index storage

node data considered transient

can be stored in local ﬁle system, JVM heap,
native OS memory, or FS & memory combination

persistent storage requires a gateway

gateways
persistent store for cluster state and indices

asynchronous, translog-based write strategy

allows full recovery if a cluster restart is needed

supported gateways:
local
shared FS
Hadoop via HDFS
S3

mapping
describes document structure to the search
engine

automatically created with sensible defaults

explicit mapping can be provided (generally, a
good idea)

can run into merge conﬂicts

mapping

important meta ﬁelds:

_source

_all

_boost

mapping types

simple:

string, integer/long, ﬂoat/double, boolean, and
null)

complex:

array, object

sample mapping
document

>?"39#?/%%%%%%?<9#B!:?E
%?-B-$9?/%%%%%?W17X-%(27B!?E
%?-2H3?/%%%%%%G?.#18B$B7H?E%?<9F"HHB7H?E%?.,.?JE
%?.13-W2-9?/%%?56;6&;5&55+;M/;Y/;5?E
%?.#B1#B-I?/%%5N

>?.13-?/%>
%%?.#1.9#-B93?%/%>
mapping

%%%%?"39#?/%>?-I.9?/%?3-#B7H?E%?B7<9P?/%?71-O272$IZ9<?NE
%%%%?@9332H9?/%>?-I.9?/%?3-#B7H?E%[F113-/%;UVNE
%%%%?-2H3?/%>?-I.9?/%?3-#B7H?E%?B7!$"<9OB7O2$$?/%?71?NE
%%%%?.13-W2-9?%/%>?-I.9?%/%?<2-9?E%[3-1#9/%[71NE
%%%%?.#B1#B-I?%/%>?-I.9?%/%?B7-9H9#?N
NNN

analyzers
break down (tokenize) and normalize ﬁelds during
indexing and query strings at search time

analyzer = tokenizer + token ﬁlters (0 or more)
*-27<2#<%A72$IZ9#%S
%%%*-27<2#<%+1:97BZ9#%]
%%%%%%%*-27<2#<%+1:97%^B$-9#%]
%%%%%%%_1K9#!239%+1:97%^B$-9#%]
%%%%%%%*-1.%+1:97%^B$-9#

analyzers
analyzers, tokenizers, and ﬁlters can be
customized
mapping elasticsearch.yml

B7<9P/
%%272$I3B3/
%%%%272$IZ9#/
%%%%%%.@&%,F/
%%%%%%%%-I.9/%!"3-1@
%%%%%%%%-1:97BZ9#/%3-27<2#<
%%%%%%%%8B$-9#/%G3-27<2#<E%$1K9#!239E%3-1.E
%%%%%%%%%%%%%%%%%23!BB81$<B7HE%.1#-9#*-9@J

`
?-B-$9?/%>?-I.9?/%?3-#B7H?E%?272$IZ9#?/%?9"$27H?NE
`

API conventions

append ?pretty=true to get readable JSON

boolean values: false/0/off = false, rest is true

JSONP support via callback parameter

API structure

http://host:port/[index]/[type]/[_action/id]

GET http://es:9200/_status

GET http://es:9200/twitter/_status

POST http://es:9200/twitter/tweet/1

GET http://es:9200/twitter/tweet/1

API structure
http://host:port/[index]/[type]/[_action/id]

GET http://es:9200/twitter/tweet/_search

GET http://es:9200/twitter/user/_search

GET http://es:9200/twitter/tweet,user/_search

GET http://es:9200/twitter,facebook/_search

GET http://es:9200/_search

_cluster API structure

GET /_cluster/health

GET /_cluster/health/index1,index2

GET /_cluster/nodes/stats

GET /_cluster/nodes/nodeId1,nodeId2/stats

API {core}
index search

bulk query

delete from/size paging

delete by query sort

get highlighting

count selective ﬁelds

API {indices}
create optimize

delete snapshot

open/close update settings

get/put/delete analyze
mapping
status
refresh
ﬂush

API {cluster}

health

state

nodes info

nodes stats

nodes shutdown

Query DSL
term / terms query_string

range default_operator

preﬁx analyzer

bool phrase_slop

fuzzy etc

wildcard

ﬁlters

share some similar features with queries (term,
range, etc)

why use a ﬁlter?

filters
faster than queries

cached (depends on the filter)

the cache is used for different queries against
the same filter

no scoring

more useful ones: term, terms, range, prefix, and,
or, not, exists, missing, query

facets

provide aggregated data based on the search
request

terms, histogram, date histogram, range,
statistical, and more

geo search

implemented as ﬁlters (and a facet)

geo_distance

geo_bounding_box

geo_polygon

interfaces
REST

including memcached

Java /!Groovy

Language clients (REST/Thrift):

pyes, PHP (standalone and symfony), Ruby, Perl

Flume sink implementation

elastica

similar to the other PHP ElasticSearch client

API naming is consistent with Zend Framework

can be extended for new ﬁlters, facets, etc

still under development

elastica
$es = new Elastica_Client('vm', 9200);
$index = new Elastica_Index($es, 'test');
$index->create(array(), true);
$type = new Elastica_Type($index, 'person');
$doc = new Elastica_Document(1, array('name' => 'Andrei Zmievski',
example

'email' => 'andrei@test.com',
'username' => 'andrei',
'bills' => array(2, 3, 5)));
$type->addDocument($doc);

$qs = new Elastica_Query_QueryString('andrei');
$query = new Elastica_Query($qs);
$resultSet = $type->search($query);
print $resultSet->count();

data import

ES is not the primary data store (usually)

to import/synchronize data:

write an agent (Gearman, message queues, etc)

use rivers (CouchDB, RabbitMQ, Twitter)

10 more features
versioning load balancing nodes

index aliases plugins

parent/child docs more_like_this

scripting multi_ﬁeld mapping

dynamic mapping percolation
templates

References

http://github.com/elasticsearch/elasticsearch

http://www.elasticsearch.org/community/forum

IRC: #elasticsearch on irc.freenode.net

twitter: @elasticsearch

HTTP://ZMIEVSKI.ORG/TALKS

99 Problems, But The Search Ain't One

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (10)

Dernier

Dernier (20)

99 Problems, But The Search Ain't One