2. About Me
ā¢ Software Engineer & Data Nerd since 1997
ā¢ Doing āmedia stuffā since 2012
ā¢ Principal @ AOL since 2014
ā¢ @ieatkillerbees
ā¢ http://samanthaquinones.com
5. What is Elasticsearch?
ā¢ Near real-time (documents are available for search quickly after
being indexed) search engine powered by Lucene
ā¢ Clustered for H/A and performance via federation with shards and
replicas
6. Whatās it Used For?
ā¢ Logging (we use Elasticsearch to centralize trafļ¬c logs, exception
logs, and audit logs)
ā¢ Content management and search
ā¢ Statistical analysis
7. Installing Elasticsearch
$ curl -L -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/
distribution/tar/elasticsearch/2.1.1/elasticsearch-2.1.1.tar.gz
$ tar -zxvf elasticsearch*
$ cd elasticsearch-2.1.1/bin
$ ./elasticsearch
8. Connecting to Elasticsearch
ā¢ Via Java, there are two native clients which connect to an ES
cluster on port 9300
ā¢ Most commonly, we access Elasticsearch via HTTP API
12. Index Madness
ā¢ Index is an overloaded term.
ā¢ As a verb, to index a document is store a document in an index.
This is analogous to an SQL INSERT operation.
ā¢ As a noun, an index is a collection of documents.
ā¢ Fields within a document have inverted indexes, similar to how a
column in an SQL table may have an index.
13. Indexing Our First Document
curl -X PUT "http://localhost:9200/test_document/test/1" -d '{ "name": "test_name" }ā
14. Retrieving Our First Document
curl -X GET "http://localhost:9200/test_document/test/1"
15. Letās Look at Some Stackoverflow Posts!
$ vi queries/bulk_insert_so_data.json
16. Bulk Insert
curl -X PUT "http://localhost:9200/_bulk" --data-binary "@queries/
bulk_insert_so_data.json"
22. Relevancy
ā¢ When searching (in query context), results are scored by a
relevancy algorithm
ā¢ Results are presented in order from highest to lowest score
25. Aggregations
ā¢ Run statistical operations over your data
ā¢ Also near real-time!
ā¢ Complex aggregations are abstracted away behind simple
interfacesā you donāt need to be a statistician
29. Under the Hood
ā¢ Elasticsearch is designed from the ground-up to run in a distributed
fashion.
ā¢ Indices (collections of documents) are partitioned in to shards.
ā¢ Shards can be stored on a single or multiple nodes.
ā¢ Shards are balanced across the cluster to improve performance
ā¢ Shards are replicated for redundancy and high availability
30. What is a Cluster?
ā¢ One or more nodes (servers) that work together toā¦
ā¢ serve a dataset that exceeds the capacity of a single serverā¦
ā¢ provide federated indexing (writes) and searching (reads)ā¦
ā¢ provide H/A through sharing and replication of data
31. What are Nodes?
ā¢ Individual servers within a cluster
ā¢ Can providing indexing and searching capabilities
32. What is an Index?
ā¢ An index is logically a collection of documents, roughly analogous
to a database in MySQL
ā¢ An index is in reality a namespace that points to one or more
physical shards which contain data
ā¢ When indexing a document, if the speciļ¬ed index does not exist, it
will be created automatically
33. What are Shards?
ā¢ Low-level units that hold a slice of available data
ā¢ A shard represents a single instance of lucene and is fully-
functional, self-contained search engine
ā¢ Shards are either primary or replicas and are assigned to nodes
34. What is Replication?
ā¢ Shards can have replicas
ā¢ Replicas primarily provide redundancy for when shards/nodes fail
ā¢ Replicas should not be allocated on the same node as the shard it
replicates
37. Cluster Health
curl -X GET āhttp://localhost:9200/_cluster/health"
curl -X GET "http://localhost:9200/_cat/health?v"
38. _cat API
ā¢ Display human-readable information about parts of the ES system
ā¢ Provides some limited documentation of functions
39. aliases
> $ http GET ':9200/_cat/aliases?v'
alias index filter routing.index routing.search
posts posts_561729df8ce4e * - -
posts.public posts_561729df8ce4e * - -
posts.write posts_561729df8ce4e - - -
Display all conļ¬gured aliases
40. allocation
> $ http GET ':9200/_cat/allocation?v'
shards disk.used disk.avail disk.total disk.percent host
33 2.6gb 21.8gb 24.4gb 10 host1
33 3gb 21.4gb 24.4gb 12 host2
34 2.6gb 21.8gb 24.4gb 10 host3
Show how many shards are allocated per node, with disk utilization info
41. count
> $ http GET ':9200/_cat/count?v'
epoch timestamp count
1453790185 06:36:25 182763
> $ http GET ā:9200/_cat/count/posts?vā
epoch timestamp count
1453790467 06:41:07 164169
> $ http GET ā:9200/_cat/count/posts.public?vā
epoch timestamp count
1453790472 06:41:12 164169=
Display a count of documents in the cluster, or a speciļ¬c index
42. fielddata
> $ http -b GET ':9200/_cat/fielddata?v'
id host ip node
total site_id published
7tjeJNY3TMajqRkmYsJyrA host1 10.97.183.146 node1 1.1mb 170.1kb 996.5kb
__xrpsKAQW6yyCY8luLQdQ host2 10.97.180.138 node2 1.6mb 329.3kb 1.3mb
bdoNNXHXRryj22YqjnqECw host3 10.97.181.190 node3 1.1mb 154.7kb 991.7kb
Shows how much memory is allocated to ļ¬elddata (metadata used for sorts)
43. health
> $ http -b GET ':9200/_cat/health?v'
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks
1453829723 17:35:23 ampehes_prod_cluster green 3 3 100 50 0 0 0 0
44. indices
> $ http -b GET 'eventhandler-prod.elasticsearch.amppublish.aws.aol.com:9200/_cat/indices?v'
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open posts_561729df8ce4e 5 1 468629 20905 4gb 2gb
green open slideshows 5 1 3893 6 86mb 43mb
45. master
> $ http -b GET ':9200/_cat/master?v'
id host ip node
7tjeJNY3TMajqRkmYsJyrA host1 10.97.183.146 node1
46. nodes
> $ http -b GET ':9200/_cat/nodes?v'
host ip heap.percent ram.percent load node.role master name
127.0.0.1 127.0.0.1 50 100 2.47 d * Mentus
47. pending tasks
% curl 'localhost:9200/_cat/pending_tasks?v'
insertOrder timeInQueue priority source
1685 855ms HIGH update-mapping [foo][t]
1686 843ms HIGH update-mapping [foo][t]
1693 753ms HIGH refresh-mapping [foo][[t]]
1688 816ms HIGH update-mapping [foo][t]
1689 802ms HIGH update-mapping [foo][t]
1690 787ms HIGH update-mapping [foo][t]
1691 773ms HIGH update-mapping [foo][t]
48. shards
> $ http -b GET ':9200/_cat/shards?v'
index shard prirep state docs store ip node
posts_561729df8ce4e 2 r STARTED 94019 410.5mb 10.97.180.138 host1
posts_561729df8ce4e 2 p STARTED 94019 412.7mb 10.97.181.190 host2
posts_561729df8ce4e 0 p STARTED 93307 413.6mb 10.97.183.146 host3
posts_561729df8ce4e 0 r STARTED 93307 415mb 10.97.180.138 host1
posts_561729df8ce4e 3 p STARTED 94182 407.1mb 10.97.183.146 host2
posts_561729df8ce4e 3 r STARTED 94182 403.4mb 10.97.180.138 host1
posts_561729df8ce4e 1 r STARTED 94130 447.1mb 10.97.180.138 host1
posts_561729df8ce4e 1 p STARTED 94130 447mb 10.97.181.190 host2
posts_561729df8ce4e 4 r STARTED 93299 421.5mb 10.97.183.146 host3
posts_561729df8ce4e 4 p STARTED 93299 398.8mb 10.97.181.190 host2
51. Document Model
ā¢ Documents represent objects
ā¢ By default, all ļ¬elds in all documents are analyzed, and indexed
52. Metadata
ā¢ _index - The index in which a document resides
ā¢ _type - The class of object that a document represents
ā¢ _id - The documentās unique identiļ¬er. Auto-generated when not
provided
53. Retrieving Documents
curl -X GET "http://localhost:9200/test_document/test/1"
curl -X HEAD āhttp://localhost:9200/test_document/test/1"
curl -X HEAD "http://localhost:9200/test_document/test/2"
54. Updating Documents
curl -X PUT "http://localhost:9200/test_document/test/1" -d '{
"name": "test_name",
"conference": "php benelux"
}'
curl -X GET "http://localhost:9200/test_document/test/1"
58. Bulk API
ā¢ Perform many operations in a single request
ā¢ Efļ¬cient batching of actions
ā¢ Bulk queries take the form of a stream of single-line JSON objects
that deļ¬ne actions and document bodies
59. Bulk Actions
ā¢ create - Index a document IFF it doesnāt exist already
ā¢ index - Index a document, replacing it if it exists
ā¢ update - Apply a partial update to a document
ā¢ delete - Delete a document
60. Bulk API Format
{ action: { metadata }}n
{ request body }n
{ action: { metadata }}n
{ request body }
61. Sizing Bulk Requests
ā¢ Balance quantity of documents with size of documents
ā¢ Docs list the sweet-spot between 5-15 MB per request
ā¢ AOL Analytics Cluster indexes 5000 documents per batch (approx
7MB)
62. Searching Documents
ā¢ Structured queries - queries against concrete ļ¬elds like ātitleā or
āscoreā which return speciļ¬c documents.
ā¢ Full-text queries - queries that ļ¬nd documents which match a search
query and return them sorted by relevance
63. Search Elements
ā¢ Mappings - Deļ¬nes how data in ļ¬elds are interpreted
ā¢ Analysis - How text is parsed and processed to make it searchable
ā¢ Query DSL - Elasticsearchās query language
64. About Queries
ā¢ Leaf Queries - Searches for a value in a given ļ¬eld. These queries
are standalone. Examples: match, range, term
ā¢ Compound Queries - Combinations of leaf queries and other
compound queries which combine operations together either
logically (e.g. bool queries) or alter their behavior (e.g. score
queries)
65. Empty Search
curl -X GET "http://localhost:9200/stack_overflow/_search"
curl -X POST "http://localhost:9200/stack_overflow/_search" -d
'{
"query": { "match_all": {} }
}'
66. Timing Out Searches
curl -X GET "http://localhost:9200/stack_overflow/_search?timeout=1s"
curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{
"timeout": "1s",
"query": { "match_all": {} }
}'
70. Pagination Concerns
ā¢ Since searches are distributed across multiple shards, paged
queries must be sorted at each shard, combined, and resorted
ā¢ The cost of paging in distributed data sets can increase
exponentially
ā¢ It is a wise practice to set limits to how many pages of results can
be returned
71. Full Text Queries
ā¢ match - Basic term matching query
ā¢ multi_match - Match which spans multiple ļ¬elds
ā¢ common_terms - Match query which preferences uncommon words
ā¢ query_string - Match documents using a search āmini-dslā
ā¢ simple_query_string - A simpler version of query_string that never
throws exceptions, suitable for exposing to users
72. Term Queries
ā¢ term - Search for an exact value
ā¢ terms - Search for an exact value in multiple ļ¬elds
ā¢ range - Find documents where a value is in a certain range
ā¢ exists - Find documents that have any non-null value in a ļ¬eld
ā¢ missing - Inversion of `exists`
ā¢ preļ¬x - Match terms that begin with a string
ā¢ wildcard - Match terms with a wildcard
ā¢ regexp - Match terms against a regular expression
ā¢ fuzzy - Match terms with conļ¬gurable fuzziness
73. Compound Queries
ā¢ constant_score - Wraps a query in ļ¬lter context, giving all results a constant score
ā¢ bool - Combines multiple leaf queries with `must`, `should`, `must_not` and `ļ¬lter` clauses
ā¢ dis_max - Similar to bool, but creates a union of subquery results scoring each document with the
maximum score of the query that produced it
ā¢ function_score - Modiļ¬es the scores of documents returned by a query . Useful for altering the
distribution of results based on recency, popularity, etc.
ā¢ boosting - Takes a `positive` and `negative` query, returning the results of `positive` while
reducing the scores of documents that also match `negative`
ā¢ ļ¬ltered - Combines a query clause in query context with one in ļ¬lter context
ā¢ limit - Perform the query over a limited number of documents in each shard
74. What are Mappings?
ā¢ Similar to schemas, they deļ¬ne the types of data found in ļ¬elds
ā¢ Determines how individual ļ¬elds are analyzed & stored
ā¢ Sets the format of date ļ¬elds
ā¢ Sets rules for mapping dynamic ļ¬elds
75. Mapping Types
ā¢ Indices have one or more mapping types which group documents
logically.
ā¢ Types contain meta ļ¬elds, which can be used to customize
metadata like _index, _id, _type, and _source
ā¢ Types can also list ļ¬elds that have consistent structure across types.
76. Data Types
ā¢ Scalar Values - string, long, double, boolean
ā¢ Special Scalars - date, ip
ā¢ Structural Types - object, nested
ā¢ Special Types - geo_shape, geo_point, completion
ā¢ Compound Types - string arrays, nested objects
77. Dynamic vs Explicit Mapping
ā¢ Dynamic ļ¬elds are not deļ¬ned prior to indexing
ā¢ Elasticsearch selects the most likely type for dynamic ļ¬elds, based
on conļ¬gurable rules
ā¢ Explicit ļ¬elds are deļ¬ned exactly prior to indexing
ā¢ Types cannot accept data that is the wrong type for an explicit
mapping
78. Shared Fields
ā¢ Fields that are deļ¬ned in multiple mapping types must be identical
if:
ā¢ They have the same name
ā¢ Live in the same index
ā¢ Map to the same ļ¬eld internally
80. Dynamic Mappings
ā¢ Mappings are generated when a type is created, if no mapping
was previously speciļ¬ed.
ā¢ Elasticsearch is good at identifying ļ¬elds much of the time, but itās
far from perfect!
ā¢ Fields can contain basic data-types, but importantly, mappings
optimize a ļ¬eld for either structured (exact) or full-text searching
81. Structured Data vs Full Text
ā¢ Exact values contain exact strings which are not subject to natural
language interpretation.
ā¢ Full-text values must be interpreted in the context of natural
language
83. Natural Language
ā¢ āusā can be interpreted differently in natural language
ā¢ Abbreviation for āUnited Statesā
ā¢ The English dative personal pronoun
ā¢ An alternative symbol for Āµs
ā¢ The French word us
84. Analyzing Text
ā¢ Elasticsearch is optimized for full text search
ā¢ Text is analyzed in a two-step process
ā¢ First, text is tokenized in to individual terms
ā¢ Second, terms are normalized through a ļ¬lter
85. Analyzers
ā¢ Analyzers perform the analysis process
ā¢ Character ļ¬lters clean up text, removing or modifying the text
ā¢ Tokenizers break the text down in to terms
ā¢ Token ļ¬lters modify, remove, or add terms
86. Standard Analyzer
ā¢ General purpose analyzer that works for most natural language.
ā¢ Splits text on word boundaries, removes punctuation, and
lowercases all tokens.
87. Standard Analyzer
curl -X GET "http://localhost:9200/_analyze?analyzer=standard&text="Reverse+text+with
+strrev($text)!""
89. Whitespace Analyzer
curl -X GET "http://localhost:9200/_analyze?analyzer=whitespace&text="Reverse+text+with
+strrev($text)!""
90. Keyword Analyzer
ā¢ Tokenizes the entire text as a single string.
ā¢ Used for things that should be kept whole, like ID numbers, postal
codes, etc
91. Keyword Analyzer
curl -X GET "http://localhost:9200/_analyze?analyzer=keyword&text="Reverse+text+with
+strrev($text)!""
92. Language Analyzers
ā¢ Analyzers optimized for speciļ¬c natural languages.
ā¢ Reduce tokens to stems (jumper, jumped ā jump)
93. Language Analyzers
curl -X GET "http://localhost:9200/_analyze?analyzer=english&text="Reverse+text+with
+strrev($text)!""
94. Analyzers
ā¢ Analyzers are applied when documents are indexed
ā¢ Analyzers are applied when a full-text search is performed against
a ļ¬eld, in order to produce the correct set of terms to search for
95. Character Filters
ā¢ html_strip - Removes HTML from text
ā¢ mapping - Filter based on a map of original ā new ( { āphā: āfā })
ā¢ pattern_replace - Similar to mapping, using regular expressions
96. Index Templates
ā¢ Template mappings that are applied to newly created indices
ā¢ Templates also contain index conļ¬guration information
ā¢ Powerful when combined with dated indices
97. Scoring
ā¢ Scoring is based on a boolean model and scoring function
ā¢ Boolean model applies AND/OR logic to an inverse index to
produce a list of matching documents
98. Term Frequency
ā¢ Terms that appear frequently in a document increase the
documentās relevancy score.
ā¢ term_frequency(term in document) = ānumber_of_appearances
99. Inverse Document Frequency
ā¢ Terms that appear in many documents reduce a documentās
relevancy score
ā¢ inverse_doc_frequency(term) = 1 + log(number_of_docs /
(frequency + 1))
100. Field Length Normalization
ā¢ Terms that appear in shorter ļ¬elds increase the relevancy of a
document.
ā¢ norm(document) = 1 / ānumber_of_terms
101. Example from the Docs
ā¢ Given the text āquick brown foxā the term āfoxā scoresā¦
ā¢ Term Frequency: 1.0
ā¢ Inverse Doc Frequency: 0.30685282
ā¢ Field Norm: 0.5
ā¢ Score: 0.15342641
107. Importing Energy Data
curl -X PUT "http://localhost:9200/energy_use" --data-binary "@queries/
mapping_energy.json"
curl -X PUT "http://localhost:9200/_bulk" --data-binary "@queries/
bulk_insert_energy_data.json"
curl -X GET "http://localhost:9200/energy_use/_search"