SlideShare a Scribd company logo
1 of 113
Download to read offline
Managing Your Content With
Elasticsearch
Samantha QuiƱones / @ieatkillerbees
About Me
ā€¢ Software Engineer & Data Nerd since 1997
ā€¢ Doing ā€œmedia stuffā€ since 2012
ā€¢ Principal @ AOL since 2014
ā€¢ @ieatkillerbees
ā€¢ http://samanthaquinones.com
What Weā€™ll Cover
ā€¢ Intro to Elasticsearch
ā€¢ CRUD
ā€¢ Creating Mappings
ā€¢ Analyzers
ā€¢ Basic Querying & Searching
ā€¢ Scoring & Relevance
ā€¢ Aggregations Basics
But Firstā€¦
ā€¢ Download - https://www.elastic.co/downloads/elasticsearch
ā€¢ Clone - https://github.com/squinones/elasticsearch-tutorial.git
What is Elasticsearch?
ā€¢ Near real-time (documents are available for search quickly after
being indexed) search engine powered by Lucene
ā€¢ Clustered for H/A and performance via federation with shards and
replicas
Whatā€™s it Used For?
ā€¢ Logging (we use Elasticsearch to centralize trafļ¬c logs, exception
logs, and audit logs)
ā€¢ Content management and search
ā€¢ Statistical analysis
Installing Elasticsearch
$ curl -L -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/
distribution/tar/elasticsearch/2.1.1/elasticsearch-2.1.1.tar.gz
$ tar -zxvf elasticsearch*
$ cd elasticsearch-2.1.1/bin
$ ./elasticsearch
Connecting to Elasticsearch
ā€¢ Via Java, there are two native clients which connect to an ES
cluster on port 9300
ā€¢ Most commonly, we access Elasticsearch via HTTP API
HTTP API
curl -X GET "http://localhost:9200/?pretty"
Data Format
ā€¢ Elasticsearch is a document-oriented database
ā€¢ All operations are performed against documents (object graphs
expressed as JSON)
Analogues
Elasticsearch MySQL MongoDB
Index Database Database
Type Table Collection
Document Row Document
Field Column Field
Index Madness
ā€¢ Index is an overloaded term.
ā€¢ As a verb, to index a document is store a document in an index.
This is analogous to an SQL INSERT operation.
ā€¢ As a noun, an index is a collection of documents.
ā€¢ Fields within a document have inverted indexes, similar to how a
column in an SQL table may have an index.
Indexing Our First Document
curl -X PUT "http://localhost:9200/test_document/test/1" -d '{ "name": "test_name" }ā€™
Retrieving Our First Document
curl -X GET "http://localhost:9200/test_document/test/1"
Letā€™s Look at Some Stackoverflow Posts!
$ vi queries/bulk_insert_so_data.json
Bulk Insert
curl -X PUT "http://localhost:9200/_bulk" --data-binary "@queries/
bulk_insert_so_data.json"
First Search
curl -X GET "http://localhost:9200/stack_overflow/_search"
Query String Searches
curl -X GET "http://localhost:9200/stack_overflow/_search?q=title:php"
Query DSL
curl -X POST "http://localhost:9200/stack_overflow/_search" -d
'{
"query" : {
"match" : {
"title" : "php"
}
}
}'
Compound Queries
curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{
"query" : {
"filtered": {
"query" : {
"match" : {
"title" : "(php OR python) AND (flask OR laravel)"
}
},
"filter": {
"range": {
"score": {
"gt": 3
}
}
}
}
}
}'
Full-Text Searching
curl -X POST "http://localhost:9200/stack_overflow/_search" -d
'{
"query" : {
"match" : {
"title" : "php loop"
}
}
}'
Relevancy
ā€¢ When searching (in query context), results are scored by a
relevancy algorithm
ā€¢ Results are presented in order from highest to lowest score
Phrase Searching
curl -X POST "http://localhost:9200/stack_overflow/_search" -d
'{
"query" : {
"match" : {
"title": {
"query": "for loop",
"type": "phrase"
}
}
}
}'
Highlighting Searches
curl -X POST "http://localhost:9200/stack_overflow/_search" -d
'{
"query" : {
"match" : {
"title": {
"query": "for loop",
"type": "phrase"
}
}
},
"highlight": {
"fields" : {
"title" : {}
}
}
}'
Aggregations
ā€¢ Run statistical operations over your data
ā€¢ Also near real-time!
ā€¢ Complex aggregations are abstracted away behind simple
interfacesā€” you donā€™t need to be a statistician
Analyzing Tags
curl -X POST "http://localhost:9200/stack_overflow/_search" -d
'{
"size": 0,
"aggs": {
"all_tags": {
"terms": {
"field": "tags",
"size": 0
}
}
}
}'
Nesting Aggregations
curl -X POST ā€œhttp://localhost:9200/stack_overflow/_search" -d
'{
"size": 0,
"aggs": {
"all_tags": {
"terms": {
"field": "tags",
"size": 0
},
"aggs": {
"avg_score": {
"avg": { "field": "score"}
}
}
}
}
}'
Break Time!
Under the Hood
ā€¢ Elasticsearch is designed from the ground-up to run in a distributed
fashion.
ā€¢ Indices (collections of documents) are partitioned in to shards.
ā€¢ Shards can be stored on a single or multiple nodes.
ā€¢ Shards are balanced across the cluster to improve performance
ā€¢ Shards are replicated for redundancy and high availability
What is a Cluster?
ā€¢ One or more nodes (servers) that work together toā€¦
ā€¢ serve a dataset that exceeds the capacity of a single serverā€¦
ā€¢ provide federated indexing (writes) and searching (reads)ā€¦
ā€¢ provide H/A through sharing and replication of data
What are Nodes?
ā€¢ Individual servers within a cluster
ā€¢ Can providing indexing and searching capabilities
What is an Index?
ā€¢ An index is logically a collection of documents, roughly analogous
to a database in MySQL
ā€¢ An index is in reality a namespace that points to one or more
physical shards which contain data
ā€¢ When indexing a document, if the speciļ¬ed index does not exist, it
will be created automatically
What are Shards?
ā€¢ Low-level units that hold a slice of available data
ā€¢ A shard represents a single instance of lucene and is fully-
functional, self-contained search engine
ā€¢ Shards are either primary or replicas and are assigned to nodes
What is Replication?
ā€¢ Shards can have replicas
ā€¢ Replicas primarily provide redundancy for when shards/nodes fail
ā€¢ Replicas should not be allocated on the same node as the shard it
replicates
Default Topology
ā€¢ 5 primary shards per index
ā€¢ 1 replica per shard
NODE
Clustering & Replication
NODE
R1 P2 P3 R2 R3P4 R5 P1 R4 P5
Cluster Health
curl -X GET ā€œhttp://localhost:9200/_cluster/health"
curl -X GET "http://localhost:9200/_cat/health?v"
_cat API
ā€¢ Display human-readable information about parts of the ES system
ā€¢ Provides some limited documentation of functions
aliases
> $ http GET ':9200/_cat/aliases?v'
alias index filter routing.index routing.search
posts posts_561729df8ce4e * - -
posts.public posts_561729df8ce4e * - -
posts.write posts_561729df8ce4e - - -
Display all conļ¬gured aliases
allocation
> $ http GET ':9200/_cat/allocation?v'
shards disk.used disk.avail disk.total disk.percent host
33 2.6gb 21.8gb 24.4gb 10 host1
33 3gb 21.4gb 24.4gb 12 host2
34 2.6gb 21.8gb 24.4gb 10 host3
Show how many shards are allocated per node, with disk utilization info
count
> $ http GET ':9200/_cat/count?v'
epoch timestamp count
1453790185 06:36:25 182763
> $ http GET ā€˜:9200/_cat/count/posts?vā€™
epoch timestamp count
1453790467 06:41:07 164169
> $ http GET ā€˜:9200/_cat/count/posts.public?vā€™
epoch timestamp count
1453790472 06:41:12 164169=
Display a count of documents in the cluster, or a speciļ¬c index
fielddata
> $ http -b GET ':9200/_cat/fielddata?v'
id host ip node
total site_id published
7tjeJNY3TMajqRkmYsJyrA host1 10.97.183.146 node1 1.1mb 170.1kb 996.5kb
__xrpsKAQW6yyCY8luLQdQ host2 10.97.180.138 node2 1.6mb 329.3kb 1.3mb
bdoNNXHXRryj22YqjnqECw host3 10.97.181.190 node3 1.1mb 154.7kb 991.7kb
Shows how much memory is allocated to ļ¬elddata (metadata used for sorts)
health
> $ http -b GET ':9200/_cat/health?v'
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks
1453829723 17:35:23 ampehes_prod_cluster green 3 3 100 50 0 0 0 0
indices
> $ http -b GET 'eventhandler-prod.elasticsearch.amppublish.aws.aol.com:9200/_cat/indices?v'
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open posts_561729df8ce4e 5 1 468629 20905 4gb 2gb
green open slideshows 5 1 3893 6 86mb 43mb
master
> $ http -b GET ':9200/_cat/master?v'
id host ip node
7tjeJNY3TMajqRkmYsJyrA host1 10.97.183.146 node1
nodes
> $ http -b GET ':9200/_cat/nodes?v'
host ip heap.percent ram.percent load node.role master name
127.0.0.1 127.0.0.1 50 100 2.47 d * Mentus
pending tasks
% curl 'localhost:9200/_cat/pending_tasks?v'
insertOrder timeInQueue priority source
1685 855ms HIGH update-mapping [foo][t]
1686 843ms HIGH update-mapping [foo][t]
1693 753ms HIGH refresh-mapping [foo][[t]]
1688 816ms HIGH update-mapping [foo][t]
1689 802ms HIGH update-mapping [foo][t]
1690 787ms HIGH update-mapping [foo][t]
1691 773ms HIGH update-mapping [foo][t]
shards
> $ http -b GET ':9200/_cat/shards?v'
index shard prirep state docs store ip node
posts_561729df8ce4e 2 r STARTED 94019 410.5mb 10.97.180.138 host1
posts_561729df8ce4e 2 p STARTED 94019 412.7mb 10.97.181.190 host2
posts_561729df8ce4e 0 p STARTED 93307 413.6mb 10.97.183.146 host3
posts_561729df8ce4e 0 r STARTED 93307 415mb 10.97.180.138 host1
posts_561729df8ce4e 3 p STARTED 94182 407.1mb 10.97.183.146 host2
posts_561729df8ce4e 3 r STARTED 94182 403.4mb 10.97.180.138 host1
posts_561729df8ce4e 1 r STARTED 94130 447.1mb 10.97.180.138 host1
posts_561729df8ce4e 1 p STARTED 94130 447mb 10.97.181.190 host2
posts_561729df8ce4e 4 r STARTED 93299 421.5mb 10.97.183.146 host3
posts_561729df8ce4e 4 p STARTED 93299 398.8mb 10.97.181.190 host2
segments
> $ http -b GET ':9200/_cat/segments?v'
index shard prirep ip segment generation docs.count docs.deleted size size.memory committed searchable version
compound
posts_561726fecd9c6 0 p 10.97.183.146 _a 10 24 0 227.7kb 69554 true true 4.10.4 true
posts_561726fecd9c6 0 p 10.97.183.146 _b 11 108 0 659.1kb 103242 true true 4.10.4 false
posts_561726fecd9c6 0 p 10.97.183.146 _c 12 7 0 90.7kb 54706 true true 4.10.4 true
posts_561726fecd9c6 0 p 10.97.183.146 _d 13 6 0 82.2kb 49706 true true 4.10.4 true
posts_561726fecd9c6 0 p 10.97.183.146 _e 14 8 0 119kb 67162 true true 4.10.4 true
posts_561726fecd9c6 0 p 10.97.183.146 _f 15 1 0 35.9kb 32122 true true 4.10.4 true
posts_561726fecd9c6 0 r 10.97.180.138 _a 10 24 0 227.7kb 69554 true true 4.10.4 true
posts_561726fecd9c6 0 r 10.97.180.138 _b 11 108 0 659.1kb 103242 true true 4.10.4 false
CRUD Operations
Document Model
ā€¢ Documents represent objects
ā€¢ By default, all ļ¬elds in all documents are analyzed, and indexed
Metadata
ā€¢ _index - The index in which a document resides
ā€¢ _type - The class of object that a document represents
ā€¢ _id - The documentā€™s unique identiļ¬er. Auto-generated when not
provided
Retrieving Documents
curl -X GET "http://localhost:9200/test_document/test/1"
curl -X HEAD ā€œhttp://localhost:9200/test_document/test/1"
curl -X HEAD "http://localhost:9200/test_document/test/2"
Updating Documents
curl -X PUT "http://localhost:9200/test_document/test/1" -d '{
"name": "test_name",
"conference": "php benelux"
}'
curl -X GET "http://localhost:9200/test_document/test/1"
Explicit Creates
curl -X PUT "http://localhost:9200/test_document/test/1/_create" -d '{
"name": "test_name",
"conference": "php benelux"
}'
Auto-Generated IDs
curl -X POST "http://localhost:9200/test_document/test" -d '{
"name": "test_name",
"conference": "php benelux"
}'
Deleting Documents
curl -X DELETE "http://localhost:9200/test_document/test/1"
Bulk API
ā€¢ Perform many operations in a single request
ā€¢ Efļ¬cient batching of actions
ā€¢ Bulk queries take the form of a stream of single-line JSON objects
that deļ¬ne actions and document bodies
Bulk Actions
ā€¢ create - Index a document IFF it doesnā€™t exist already
ā€¢ index - Index a document, replacing it if it exists
ā€¢ update - Apply a partial update to a document
ā€¢ delete - Delete a document
Bulk API Format
{ action: { metadata }}n
{ request body }n
{ action: { metadata }}n
{ request body }
Sizing Bulk Requests
ā€¢ Balance quantity of documents with size of documents
ā€¢ Docs list the sweet-spot between 5-15 MB per request
ā€¢ AOL Analytics Cluster indexes 5000 documents per batch (approx
7MB)
Searching Documents
ā€¢ Structured queries - queries against concrete ļ¬elds like ā€œtitleā€ or
ā€œscoreā€ which return speciļ¬c documents.
ā€¢ Full-text queries - queries that ļ¬nd documents which match a search
query and return them sorted by relevance
Search Elements
ā€¢ Mappings - Deļ¬nes how data in ļ¬elds are interpreted
ā€¢ Analysis - How text is parsed and processed to make it searchable
ā€¢ Query DSL - Elasticsearchā€™s query language
About Queries
ā€¢ Leaf Queries - Searches for a value in a given ļ¬eld. These queries
are standalone. Examples: match, range, term
ā€¢ Compound Queries - Combinations of leaf queries and other
compound queries which combine operations together either
logically (e.g. bool queries) or alter their behavior (e.g. score
queries)
Empty Search
curl -X GET "http://localhost:9200/stack_overflow/_search"
curl -X POST "http://localhost:9200/stack_overflow/_search" -d
'{
"query": { "match_all": {} }
}'
Timing Out Searches
curl -X GET "http://localhost:9200/stack_overflow/_search?timeout=1s"
curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{
"timeout": "1s",
"query": { "match_all": {} }
}'
Multi-Index/Type Searches
curl -X GET "http://localhost:9200/test_document,stack_overflow/_search"
Multi-Index Use Cases
ā€¢ Dated indices for logging
ā€¢ Roll-off indices for content-aging
ā€¢ Analytic roll-ups
Pagination
curl -X GET "http://localhost:9200/stack_overflow/_search?size=5&from=5"
curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{
"size": 5,
"from": 5,
"query": { "match_all": {} }
}'
Pagination Concerns
ā€¢ Since searches are distributed across multiple shards, paged
queries must be sorted at each shard, combined, and resorted
ā€¢ The cost of paging in distributed data sets can increase
exponentially
ā€¢ It is a wise practice to set limits to how many pages of results can
be returned
Full Text Queries
ā€¢ match - Basic term matching query
ā€¢ multi_match - Match which spans multiple ļ¬elds
ā€¢ common_terms - Match query which preferences uncommon words
ā€¢ query_string - Match documents using a search ā€œmini-dslā€
ā€¢ simple_query_string - A simpler version of query_string that never
throws exceptions, suitable for exposing to users
Term Queries
ā€¢ term - Search for an exact value
ā€¢ terms - Search for an exact value in multiple ļ¬elds
ā€¢ range - Find documents where a value is in a certain range
ā€¢ exists - Find documents that have any non-null value in a ļ¬eld
ā€¢ missing - Inversion of `exists`
ā€¢ preļ¬x - Match terms that begin with a string
ā€¢ wildcard - Match terms with a wildcard
ā€¢ regexp - Match terms against a regular expression
ā€¢ fuzzy - Match terms with conļ¬gurable fuzziness
Compound Queries
ā€¢ constant_score - Wraps a query in ļ¬lter context, giving all results a constant score
ā€¢ bool - Combines multiple leaf queries with `must`, `should`, `must_not` and `ļ¬lter` clauses
ā€¢ dis_max - Similar to bool, but creates a union of subquery results scoring each document with the
maximum score of the query that produced it
ā€¢ function_score - Modiļ¬es the scores of documents returned by a query . Useful for altering the
distribution of results based on recency, popularity, etc.
ā€¢ boosting - Takes a `positive` and `negative` query, returning the results of `positive` while
reducing the scores of documents that also match `negative`
ā€¢ ļ¬ltered - Combines a query clause in query context with one in ļ¬lter context
ā€¢ limit - Perform the query over a limited number of documents in each shard
What are Mappings?
ā€¢ Similar to schemas, they deļ¬ne the types of data found in ļ¬elds
ā€¢ Determines how individual ļ¬elds are analyzed & stored
ā€¢ Sets the format of date ļ¬elds
ā€¢ Sets rules for mapping dynamic ļ¬elds
Mapping Types
ā€¢ Indices have one or more mapping types which group documents
logically.
ā€¢ Types contain meta ļ¬elds, which can be used to customize
metadata like _index, _id, _type, and _source
ā€¢ Types can also list ļ¬elds that have consistent structure across types.
Data Types
ā€¢ Scalar Values - string, long, double, boolean
ā€¢ Special Scalars - date, ip
ā€¢ Structural Types - object, nested
ā€¢ Special Types - geo_shape, geo_point, completion
ā€¢ Compound Types - string arrays, nested objects
Dynamic vs Explicit Mapping
ā€¢ Dynamic ļ¬elds are not deļ¬ned prior to indexing
ā€¢ Elasticsearch selects the most likely type for dynamic ļ¬elds, based
on conļ¬gurable rules
ā€¢ Explicit ļ¬elds are deļ¬ned exactly prior to indexing
ā€¢ Types cannot accept data that is the wrong type for an explicit
mapping
Shared Fields
ā€¢ Fields that are deļ¬ned in multiple mapping types must be identical
if:
ā€¢ They have the same name
ā€¢ Live in the same index
ā€¢ Map to the same ļ¬eld internally
Examining Mappings
curl -X GET "http://localhost:9200/stack_overflow/post/_mapping"
Dynamic Mappings
ā€¢ Mappings are generated when a type is created, if no mapping
was previously speciļ¬ed.
ā€¢ Elasticsearch is good at identifying ļ¬elds much of the time, but itā€™s
far from perfect!
ā€¢ Fields can contain basic data-types, but importantly, mappings
optimize a ļ¬eld for either structured (exact) or full-text searching
Structured Data vs Full Text
ā€¢ Exact values contain exact strings which are not subject to natural
language interpretation.
ā€¢ Full-text values must be interpreted in the context of natural
language
Exact Value
ā€¢ ā€œsamantha@tembies.comā€ is an email address in all contexts
Natural Language
ā€¢ ā€œusā€ can be interpreted differently in natural language
ā€¢ Abbreviation for ā€œUnited Statesā€
ā€¢ The English dative personal pronoun
ā€¢ An alternative symbol for Āµs
ā€¢ The French word us
Analyzing Text
ā€¢ Elasticsearch is optimized for full text search
ā€¢ Text is analyzed in a two-step process
ā€¢ First, text is tokenized in to individual terms
ā€¢ Second, terms are normalized through a ļ¬lter
Analyzers
ā€¢ Analyzers perform the analysis process
ā€¢ Character ļ¬lters clean up text, removing or modifying the text
ā€¢ Tokenizers break the text down in to terms
ā€¢ Token ļ¬lters modify, remove, or add terms
Standard Analyzer
ā€¢ General purpose analyzer that works for most natural language.
ā€¢ Splits text on word boundaries, removes punctuation, and
lowercases all tokens.
Standard Analyzer
curl -X GET "http://localhost:9200/_analyze?analyzer=standard&text="Reverse+text+with
+strrev($text)!""
Whitespace Analyzer
ā€¢ Analyzer that splits on whitespace and lowercases all tokens
Whitespace Analyzer
curl -X GET "http://localhost:9200/_analyze?analyzer=whitespace&text="Reverse+text+with
+strrev($text)!""
Keyword Analyzer
ā€¢ Tokenizes the entire text as a single string.
ā€¢ Used for things that should be kept whole, like ID numbers, postal
codes, etc
Keyword Analyzer
curl -X GET "http://localhost:9200/_analyze?analyzer=keyword&text="Reverse+text+with
+strrev($text)!""
Language Analyzers
ā€¢ Analyzers optimized for speciļ¬c natural languages.
ā€¢ Reduce tokens to stems (jumper, jumped ā†’ jump)
Language Analyzers
curl -X GET "http://localhost:9200/_analyze?analyzer=english&text="Reverse+text+with
+strrev($text)!""
Analyzers
ā€¢ Analyzers are applied when documents are indexed
ā€¢ Analyzers are applied when a full-text search is performed against
a ļ¬eld, in order to produce the correct set of terms to search for
Character Filters
ā€¢ html_strip - Removes HTML from text
ā€¢ mapping - Filter based on a map of original ā†’ new ( { ā€œphā€: ā€œfā€ })
ā€¢ pattern_replace - Similar to mapping, using regular expressions
Index Templates
ā€¢ Template mappings that are applied to newly created indices
ā€¢ Templates also contain index conļ¬guration information
ā€¢ Powerful when combined with dated indices
Scoring
ā€¢ Scoring is based on a boolean model and scoring function
ā€¢ Boolean model applies AND/OR logic to an inverse index to
produce a list of matching documents
Term Frequency
ā€¢ Terms that appear frequently in a document increase the
documentā€™s relevancy score.
ā€¢ term_frequency(term in document) = āˆšnumber_of_appearances
Inverse Document Frequency
ā€¢ Terms that appear in many documents reduce a documentā€™s
relevancy score
ā€¢ inverse_doc_frequency(term) = 1 + log(number_of_docs /
(frequency + 1))
Field Length Normalization
ā€¢ Terms that appear in shorter ļ¬elds increase the relevancy of a
document.
ā€¢ norm(document) = 1 / āˆšnumber_of_terms
Example from the Docs
ā€¢ Given the text ā€œquick brown foxā€ the term ā€œfoxā€ scoresā€¦
ā€¢ Term Frequency: 1.0
ā€¢ Inverse Doc Frequency: 0.30685282
ā€¢ Field Norm: 0.5
ā€¢ Score: 0.15342641
Basic Relevancy
{
"size": 100,
"query": {
"filtered": {
"query": {
"match": {
"contents": "miley cyrus"
}
},
"filter": {
"and": [ { "terms": { "site_id": [ 698 ] } } ]
}
}
}
}
Non-Preferenced Result Recency
Recency-Adjusted Query
{
"query": {
"function_score": {
"functions": [
{
"gauss": {
"published": {
"origin": "now",
"scale": "10d",
"offset": "1d",
"decay": 0.3
}
}
}
],
"query": {
"filtered": {
"query": { "match": { "contents": "miley cyrus" } },
"filter": { "and": [ { "terms": { "site_id": [ 698 ] } } ] }
}
}
}
}
}
Preferenced Result Recency
Aggregations & Analytics
Importing Energy Data
curl -X PUT "http://localhost:9200/energy_use" --data-binary "@queries/
mapping_energy.json"
curl -X PUT "http://localhost:9200/_bulk" --data-binary "@queries/
bulk_insert_energy_data.json"
curl -X GET "http://localhost:9200/energy_use/_search"
Average Energy Use
curl -X POST "http://localhost:9200/energy_use/_search" -d '{
"size": 0,
"aggs": {
"average_laundry_use": {
"avg": {
"field": "laundry"
}
},
"average_kitchen_use": {
"avg": {
"field": "kitchen"
}
},
"average_heater_use": {
"avg": {
"field": "heater"
}
},
"average_other_use": {
"avg": {
"field": "other"
}
}
}
}'
Multiple Aggregations
curl -X POST ā€œhttp://localhost:9200/energy_use/_search" -d '{
"size": 0,
"aggs": {
"average_laundry_use": { "avg": { "field": "laundry" } },
"min_laundry_use": { "min": { "field": "laundry"} },
"max_laundry_use": { "max": { "field": "laundry"} }
}
}'
Nesting Aggregations
curl -X POST ā€œhttp://localhost:9200/energy_use/_search" -d '{
"size": 0,
"aggs": {
"by_date": {
"terms": { "field": "date" },
"aggs": {
"average_laundry_use": { "avg": { "field": "laundry" } },
"min_laundry_use": { "min": { "field": "laundry"} },
"max_laundry_use": { "max": { "field": "laundry"} }
}
}
}
}'
Stats/Extended Stats
curl -X POST "http://localhost:9200/energy_use/_search" -d '{
"size": 0,
"aggs": {
"by_date": {
"terms": { "field": "date" },
"aggs": {
"laundry_stats": { "extended_stats": { "field": "laundry" } }
}
}
}
}'
Bucket Aggregations
ā€¢ Date Histogram
ā€¢ Term/Terms
ā€¢ Geo*
ā€¢ Signiļ¬cant Terms
Questions?
Use Cases?
Exploration Ideas?
https://joind.in/talk/e2e4b

More Related Content

What's hot

What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...Rahul K Chauhan
Ā 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
Ā 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchBo Andersen
Ā 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
Ā 
Elastic Search
Elastic SearchElastic Search
Elastic SearchNavule Rao
Ā 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solrmacrochen
Ā 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
Ā 
The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsItamar
Ā 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchhypto
Ā 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search medcl
Ā 
Elasticsearch in 15 minutes
Elasticsearch in 15 minutesElasticsearch in 15 minutes
Elasticsearch in 15 minutesDavid Pilato
Ā 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseRobert Lujo
Ā 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013Roy Russo
Ā 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiRobert Calcavecchia
Ā 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchRuslan Zavacky
Ā 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
Ā 
SQL for Elasticsearch
SQL for ElasticsearchSQL for Elasticsearch
SQL for ElasticsearchJodok Batlogg
Ā 

What's hot (19)

What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...What I learnt: Elastic search & Kibana : introduction, installtion & configur...
What I learnt: Elastic search & Kibana : introduction, installtion & configur...
Ā 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
Ā 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
Ā 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
Ā 
Elastic Search
Elastic SearchElastic Search
Elastic Search
Ā 
Elastic search apache_solr
Elastic search apache_solrElastic search apache_solr
Elastic search apache_solr
Ā 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
Ā 
The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch plugins
Ā 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
Ā 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
Ā 
Elasticsearch in 15 minutes
Elasticsearch in 15 minutesElasticsearch in 15 minutes
Elasticsearch in 15 minutes
Ā 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document database
Ā 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
Ā 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
Ā 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Ā 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
Ā 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
Ā 
Elastic search
Elastic searchElastic search
Elastic search
Ā 
SQL for Elasticsearch
SQL for ElasticsearchSQL for Elasticsearch
SQL for Elasticsearch
Ā 

Similar to Managing Your Content with Elasticsearch

Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Prajal Kulkarni
Ā 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaPrajal Kulkarni
Ā 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache CamelClaus Ibsen
Ā 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
Ā 
Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto
Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto
Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto Docker, Inc.
Ā 
Networking in Kubernetes
Networking in KubernetesNetworking in Kubernetes
Networking in KubernetesMinhan Xia
Ā 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
Ā 
Couchdb: No SQL? No driver? No problem
Couchdb: No SQL? No driver? No problemCouchdb: No SQL? No driver? No problem
Couchdb: No SQL? No driver? No problemdelagoya
Ā 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command lineSharat Chikkerur
Ā 
GraphConnect 2014 SF: From Zero to Graph in 120: Scale
GraphConnect 2014 SF: From Zero to Graph in 120: ScaleGraphConnect 2014 SF: From Zero to Graph in 120: Scale
GraphConnect 2014 SF: From Zero to Graph in 120: ScaleNeo4j
Ā 
How I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine Yard
How I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine YardHow I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine Yard
How I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine YardSV Ruby on Rails Meetup
Ā 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Michael Renner
Ā 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Oleksiy Panchenko
Ā 
Service Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and KubernetesService Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and KubernetesSreenivas Makam
Ā 
Using Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 FlowUsing Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 FlowKarsten Dambekalns
Ā 
Deploy Rails Application by Capistrano
Deploy Rails Application by CapistranoDeploy Rails Application by Capistrano
Deploy Rails Application by CapistranoTasawr Interactive
Ā 
Can we run the Whole Web on Apache Sling?
Can we run the Whole Web on Apache Sling?Can we run the Whole Web on Apache Sling?
Can we run the Whole Web on Apache Sling?Bertrand Delacretaz
Ā 

Similar to Managing Your Content with Elasticsearch (20)

Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.Null Bachaav - May 07 Attack Monitoring workshop.
Null Bachaav - May 07 Attack Monitoring workshop.
Ā 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and Kibana
Ā 
Introduction to Apache Camel
Introduction to Apache CamelIntroduction to Apache Camel
Introduction to Apache Camel
Ā 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Ā 
Elastic Search
Elastic SearchElastic Search
Elastic Search
Ā 
Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto
Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto
Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto
Ā 
Networking in Kubernetes
Networking in KubernetesNetworking in Kubernetes
Networking in Kubernetes
Ā 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Ā 
Couchdb: No SQL? No driver? No problem
Couchdb: No SQL? No driver? No problemCouchdb: No SQL? No driver? No problem
Couchdb: No SQL? No driver? No problem
Ā 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
Ā 
GraphConnect 2014 SF: From Zero to Graph in 120: Scale
GraphConnect 2014 SF: From Zero to Graph in 120: ScaleGraphConnect 2014 SF: From Zero to Graph in 120: Scale
GraphConnect 2014 SF: From Zero to Graph in 120: Scale
Ā 
How I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine Yard
How I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine YardHow I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine Yard
How I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine Yard
Ā 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
Ā 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Ā 
Service Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and KubernetesService Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and Kubernetes
Ā 
TIAD : Automating the modern datacenter
TIAD : Automating the modern datacenterTIAD : Automating the modern datacenter
TIAD : Automating the modern datacenter
Ā 
ACM BPM and elasticsearch AMIS25
ACM BPM and elasticsearch AMIS25ACM BPM and elasticsearch AMIS25
ACM BPM and elasticsearch AMIS25
Ā 
Using Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 FlowUsing Document Databases with TYPO3 Flow
Using Document Databases with TYPO3 Flow
Ā 
Deploy Rails Application by Capistrano
Deploy Rails Application by CapistranoDeploy Rails Application by Capistrano
Deploy Rails Application by Capistrano
Ā 
Can we run the Whole Web on Apache Sling?
Can we run the Whole Web on Apache Sling?Can we run the Whole Web on Apache Sling?
Can we run the Whole Web on Apache Sling?
Ā 

More from Samantha QuiƱones

Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsSamantha QuiƱones
Ā 
Supercharging Content Delivery with Varnish
Supercharging Content Delivery with VarnishSupercharging Content Delivery with Varnish
Supercharging Content Delivery with VarnishSamantha QuiƱones
Ā 
TDD: Team-Driven Development
TDD: Team-Driven DevelopmentTDD: Team-Driven Development
TDD: Team-Driven DevelopmentSamantha QuiƱones
Ā 

More from Samantha QuiƱones (6)

Hacking The Human Interface
Hacking The Human InterfaceHacking The Human Interface
Hacking The Human Interface
Ā 
Conference Speaking 101
Conference Speaking 101Conference Speaking 101
Conference Speaking 101
Ā 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time Metrics
Ā 
Supercharging Content Delivery with Varnish
Supercharging Content Delivery with VarnishSupercharging Content Delivery with Varnish
Supercharging Content Delivery with Varnish
Ā 
Demystifying the REST API
Demystifying the REST APIDemystifying the REST API
Demystifying the REST API
Ā 
TDD: Team-Driven Development
TDD: Team-Driven DevelopmentTDD: Team-Driven Development
TDD: Team-Driven Development
Ā 

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
Ā 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
Ā 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
Ā 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
Ā 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
Ā 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
Ā 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
Ā 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
Ā 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
Ā 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
Ā 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
Ā 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
Ā 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
Ā 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
Ā 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
Ā 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
Ā 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
Ā 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Ā 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
Ā 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
Ā 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Ā 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Ā 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Ā 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Ā 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Ā 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
Ā 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
Ā 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Ā 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Ā 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
Ā 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Ā 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Ā 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Ā 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
Ā 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Ā 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
Ā 

Managing Your Content with Elasticsearch

  • 1. Managing Your Content With Elasticsearch Samantha QuiƱones / @ieatkillerbees
  • 2. About Me ā€¢ Software Engineer & Data Nerd since 1997 ā€¢ Doing ā€œmedia stuffā€ since 2012 ā€¢ Principal @ AOL since 2014 ā€¢ @ieatkillerbees ā€¢ http://samanthaquinones.com
  • 3. What Weā€™ll Cover ā€¢ Intro to Elasticsearch ā€¢ CRUD ā€¢ Creating Mappings ā€¢ Analyzers ā€¢ Basic Querying & Searching ā€¢ Scoring & Relevance ā€¢ Aggregations Basics
  • 4. But Firstā€¦ ā€¢ Download - https://www.elastic.co/downloads/elasticsearch ā€¢ Clone - https://github.com/squinones/elasticsearch-tutorial.git
  • 5. What is Elasticsearch? ā€¢ Near real-time (documents are available for search quickly after being indexed) search engine powered by Lucene ā€¢ Clustered for H/A and performance via federation with shards and replicas
  • 6. Whatā€™s it Used For? ā€¢ Logging (we use Elasticsearch to centralize trafļ¬c logs, exception logs, and audit logs) ā€¢ Content management and search ā€¢ Statistical analysis
  • 7. Installing Elasticsearch $ curl -L -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/ distribution/tar/elasticsearch/2.1.1/elasticsearch-2.1.1.tar.gz $ tar -zxvf elasticsearch* $ cd elasticsearch-2.1.1/bin $ ./elasticsearch
  • 8. Connecting to Elasticsearch ā€¢ Via Java, there are two native clients which connect to an ES cluster on port 9300 ā€¢ Most commonly, we access Elasticsearch via HTTP API
  • 9. HTTP API curl -X GET "http://localhost:9200/?pretty"
  • 10. Data Format ā€¢ Elasticsearch is a document-oriented database ā€¢ All operations are performed against documents (object graphs expressed as JSON)
  • 11. Analogues Elasticsearch MySQL MongoDB Index Database Database Type Table Collection Document Row Document Field Column Field
  • 12. Index Madness ā€¢ Index is an overloaded term. ā€¢ As a verb, to index a document is store a document in an index. This is analogous to an SQL INSERT operation. ā€¢ As a noun, an index is a collection of documents. ā€¢ Fields within a document have inverted indexes, similar to how a column in an SQL table may have an index.
  • 13. Indexing Our First Document curl -X PUT "http://localhost:9200/test_document/test/1" -d '{ "name": "test_name" }ā€™
  • 14. Retrieving Our First Document curl -X GET "http://localhost:9200/test_document/test/1"
  • 15. Letā€™s Look at Some Stackoverflow Posts! $ vi queries/bulk_insert_so_data.json
  • 16. Bulk Insert curl -X PUT "http://localhost:9200/_bulk" --data-binary "@queries/ bulk_insert_so_data.json"
  • 17. First Search curl -X GET "http://localhost:9200/stack_overflow/_search"
  • 18. Query String Searches curl -X GET "http://localhost:9200/stack_overflow/_search?q=title:php"
  • 19. Query DSL curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title" : "php" } } }'
  • 20. Compound Queries curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "filtered": { "query" : { "match" : { "title" : "(php OR python) AND (flask OR laravel)" } }, "filter": { "range": { "score": { "gt": 3 } } } } } }'
  • 21. Full-Text Searching curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title" : "php loop" } } }'
  • 22. Relevancy ā€¢ When searching (in query context), results are scored by a relevancy algorithm ā€¢ Results are presented in order from highest to lowest score
  • 23. Phrase Searching curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title": { "query": "for loop", "type": "phrase" } } } }'
  • 24. Highlighting Searches curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query" : { "match" : { "title": { "query": "for loop", "type": "phrase" } } }, "highlight": { "fields" : { "title" : {} } } }'
  • 25. Aggregations ā€¢ Run statistical operations over your data ā€¢ Also near real-time! ā€¢ Complex aggregations are abstracted away behind simple interfacesā€” you donā€™t need to be a statistician
  • 26. Analyzing Tags curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "size": 0, "aggs": { "all_tags": { "terms": { "field": "tags", "size": 0 } } } }'
  • 27. Nesting Aggregations curl -X POST ā€œhttp://localhost:9200/stack_overflow/_search" -d '{ "size": 0, "aggs": { "all_tags": { "terms": { "field": "tags", "size": 0 }, "aggs": { "avg_score": { "avg": { "field": "score"} } } } } }'
  • 29. Under the Hood ā€¢ Elasticsearch is designed from the ground-up to run in a distributed fashion. ā€¢ Indices (collections of documents) are partitioned in to shards. ā€¢ Shards can be stored on a single or multiple nodes. ā€¢ Shards are balanced across the cluster to improve performance ā€¢ Shards are replicated for redundancy and high availability
  • 30. What is a Cluster? ā€¢ One or more nodes (servers) that work together toā€¦ ā€¢ serve a dataset that exceeds the capacity of a single serverā€¦ ā€¢ provide federated indexing (writes) and searching (reads)ā€¦ ā€¢ provide H/A through sharing and replication of data
  • 31. What are Nodes? ā€¢ Individual servers within a cluster ā€¢ Can providing indexing and searching capabilities
  • 32. What is an Index? ā€¢ An index is logically a collection of documents, roughly analogous to a database in MySQL ā€¢ An index is in reality a namespace that points to one or more physical shards which contain data ā€¢ When indexing a document, if the speciļ¬ed index does not exist, it will be created automatically
  • 33. What are Shards? ā€¢ Low-level units that hold a slice of available data ā€¢ A shard represents a single instance of lucene and is fully- functional, self-contained search engine ā€¢ Shards are either primary or replicas and are assigned to nodes
  • 34. What is Replication? ā€¢ Shards can have replicas ā€¢ Replicas primarily provide redundancy for when shards/nodes fail ā€¢ Replicas should not be allocated on the same node as the shard it replicates
  • 35. Default Topology ā€¢ 5 primary shards per index ā€¢ 1 replica per shard
  • 36. NODE Clustering & Replication NODE R1 P2 P3 R2 R3P4 R5 P1 R4 P5
  • 37. Cluster Health curl -X GET ā€œhttp://localhost:9200/_cluster/health" curl -X GET "http://localhost:9200/_cat/health?v"
  • 38. _cat API ā€¢ Display human-readable information about parts of the ES system ā€¢ Provides some limited documentation of functions
  • 39. aliases > $ http GET ':9200/_cat/aliases?v' alias index filter routing.index routing.search posts posts_561729df8ce4e * - - posts.public posts_561729df8ce4e * - - posts.write posts_561729df8ce4e - - - Display all conļ¬gured aliases
  • 40. allocation > $ http GET ':9200/_cat/allocation?v' shards disk.used disk.avail disk.total disk.percent host 33 2.6gb 21.8gb 24.4gb 10 host1 33 3gb 21.4gb 24.4gb 12 host2 34 2.6gb 21.8gb 24.4gb 10 host3 Show how many shards are allocated per node, with disk utilization info
  • 41. count > $ http GET ':9200/_cat/count?v' epoch timestamp count 1453790185 06:36:25 182763 > $ http GET ā€˜:9200/_cat/count/posts?vā€™ epoch timestamp count 1453790467 06:41:07 164169 > $ http GET ā€˜:9200/_cat/count/posts.public?vā€™ epoch timestamp count 1453790472 06:41:12 164169= Display a count of documents in the cluster, or a speciļ¬c index
  • 42. fielddata > $ http -b GET ':9200/_cat/fielddata?v' id host ip node total site_id published 7tjeJNY3TMajqRkmYsJyrA host1 10.97.183.146 node1 1.1mb 170.1kb 996.5kb __xrpsKAQW6yyCY8luLQdQ host2 10.97.180.138 node2 1.6mb 329.3kb 1.3mb bdoNNXHXRryj22YqjnqECw host3 10.97.181.190 node3 1.1mb 154.7kb 991.7kb Shows how much memory is allocated to ļ¬elddata (metadata used for sorts)
  • 43. health > $ http -b GET ':9200/_cat/health?v' epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks 1453829723 17:35:23 ampehes_prod_cluster green 3 3 100 50 0 0 0 0
  • 44. indices > $ http -b GET 'eventhandler-prod.elasticsearch.amppublish.aws.aol.com:9200/_cat/indices?v' health status index pri rep docs.count docs.deleted store.size pri.store.size green open posts_561729df8ce4e 5 1 468629 20905 4gb 2gb green open slideshows 5 1 3893 6 86mb 43mb
  • 45. master > $ http -b GET ':9200/_cat/master?v' id host ip node 7tjeJNY3TMajqRkmYsJyrA host1 10.97.183.146 node1
  • 46. nodes > $ http -b GET ':9200/_cat/nodes?v' host ip heap.percent ram.percent load node.role master name 127.0.0.1 127.0.0.1 50 100 2.47 d * Mentus
  • 47. pending tasks % curl 'localhost:9200/_cat/pending_tasks?v' insertOrder timeInQueue priority source 1685 855ms HIGH update-mapping [foo][t] 1686 843ms HIGH update-mapping [foo][t] 1693 753ms HIGH refresh-mapping [foo][[t]] 1688 816ms HIGH update-mapping [foo][t] 1689 802ms HIGH update-mapping [foo][t] 1690 787ms HIGH update-mapping [foo][t] 1691 773ms HIGH update-mapping [foo][t]
  • 48. shards > $ http -b GET ':9200/_cat/shards?v' index shard prirep state docs store ip node posts_561729df8ce4e 2 r STARTED 94019 410.5mb 10.97.180.138 host1 posts_561729df8ce4e 2 p STARTED 94019 412.7mb 10.97.181.190 host2 posts_561729df8ce4e 0 p STARTED 93307 413.6mb 10.97.183.146 host3 posts_561729df8ce4e 0 r STARTED 93307 415mb 10.97.180.138 host1 posts_561729df8ce4e 3 p STARTED 94182 407.1mb 10.97.183.146 host2 posts_561729df8ce4e 3 r STARTED 94182 403.4mb 10.97.180.138 host1 posts_561729df8ce4e 1 r STARTED 94130 447.1mb 10.97.180.138 host1 posts_561729df8ce4e 1 p STARTED 94130 447mb 10.97.181.190 host2 posts_561729df8ce4e 4 r STARTED 93299 421.5mb 10.97.183.146 host3 posts_561729df8ce4e 4 p STARTED 93299 398.8mb 10.97.181.190 host2
  • 49. segments > $ http -b GET ':9200/_cat/segments?v' index shard prirep ip segment generation docs.count docs.deleted size size.memory committed searchable version compound posts_561726fecd9c6 0 p 10.97.183.146 _a 10 24 0 227.7kb 69554 true true 4.10.4 true posts_561726fecd9c6 0 p 10.97.183.146 _b 11 108 0 659.1kb 103242 true true 4.10.4 false posts_561726fecd9c6 0 p 10.97.183.146 _c 12 7 0 90.7kb 54706 true true 4.10.4 true posts_561726fecd9c6 0 p 10.97.183.146 _d 13 6 0 82.2kb 49706 true true 4.10.4 true posts_561726fecd9c6 0 p 10.97.183.146 _e 14 8 0 119kb 67162 true true 4.10.4 true posts_561726fecd9c6 0 p 10.97.183.146 _f 15 1 0 35.9kb 32122 true true 4.10.4 true posts_561726fecd9c6 0 r 10.97.180.138 _a 10 24 0 227.7kb 69554 true true 4.10.4 true posts_561726fecd9c6 0 r 10.97.180.138 _b 11 108 0 659.1kb 103242 true true 4.10.4 false
  • 51. Document Model ā€¢ Documents represent objects ā€¢ By default, all ļ¬elds in all documents are analyzed, and indexed
  • 52. Metadata ā€¢ _index - The index in which a document resides ā€¢ _type - The class of object that a document represents ā€¢ _id - The documentā€™s unique identiļ¬er. Auto-generated when not provided
  • 53. Retrieving Documents curl -X GET "http://localhost:9200/test_document/test/1" curl -X HEAD ā€œhttp://localhost:9200/test_document/test/1" curl -X HEAD "http://localhost:9200/test_document/test/2"
  • 54. Updating Documents curl -X PUT "http://localhost:9200/test_document/test/1" -d '{ "name": "test_name", "conference": "php benelux" }' curl -X GET "http://localhost:9200/test_document/test/1"
  • 55. Explicit Creates curl -X PUT "http://localhost:9200/test_document/test/1/_create" -d '{ "name": "test_name", "conference": "php benelux" }'
  • 56. Auto-Generated IDs curl -X POST "http://localhost:9200/test_document/test" -d '{ "name": "test_name", "conference": "php benelux" }'
  • 57. Deleting Documents curl -X DELETE "http://localhost:9200/test_document/test/1"
  • 58. Bulk API ā€¢ Perform many operations in a single request ā€¢ Efļ¬cient batching of actions ā€¢ Bulk queries take the form of a stream of single-line JSON objects that deļ¬ne actions and document bodies
  • 59. Bulk Actions ā€¢ create - Index a document IFF it doesnā€™t exist already ā€¢ index - Index a document, replacing it if it exists ā€¢ update - Apply a partial update to a document ā€¢ delete - Delete a document
  • 60. Bulk API Format { action: { metadata }}n { request body }n { action: { metadata }}n { request body }
  • 61. Sizing Bulk Requests ā€¢ Balance quantity of documents with size of documents ā€¢ Docs list the sweet-spot between 5-15 MB per request ā€¢ AOL Analytics Cluster indexes 5000 documents per batch (approx 7MB)
  • 62. Searching Documents ā€¢ Structured queries - queries against concrete ļ¬elds like ā€œtitleā€ or ā€œscoreā€ which return speciļ¬c documents. ā€¢ Full-text queries - queries that ļ¬nd documents which match a search query and return them sorted by relevance
  • 63. Search Elements ā€¢ Mappings - Deļ¬nes how data in ļ¬elds are interpreted ā€¢ Analysis - How text is parsed and processed to make it searchable ā€¢ Query DSL - Elasticsearchā€™s query language
  • 64. About Queries ā€¢ Leaf Queries - Searches for a value in a given ļ¬eld. These queries are standalone. Examples: match, range, term ā€¢ Compound Queries - Combinations of leaf queries and other compound queries which combine operations together either logically (e.g. bool queries) or alter their behavior (e.g. score queries)
  • 65. Empty Search curl -X GET "http://localhost:9200/stack_overflow/_search" curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "query": { "match_all": {} } }'
  • 66. Timing Out Searches curl -X GET "http://localhost:9200/stack_overflow/_search?timeout=1s" curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "timeout": "1s", "query": { "match_all": {} } }'
  • 67. Multi-Index/Type Searches curl -X GET "http://localhost:9200/test_document,stack_overflow/_search"
  • 68. Multi-Index Use Cases ā€¢ Dated indices for logging ā€¢ Roll-off indices for content-aging ā€¢ Analytic roll-ups
  • 69. Pagination curl -X GET "http://localhost:9200/stack_overflow/_search?size=5&from=5" curl -X POST "http://localhost:9200/stack_overflow/_search" -d '{ "size": 5, "from": 5, "query": { "match_all": {} } }'
  • 70. Pagination Concerns ā€¢ Since searches are distributed across multiple shards, paged queries must be sorted at each shard, combined, and resorted ā€¢ The cost of paging in distributed data sets can increase exponentially ā€¢ It is a wise practice to set limits to how many pages of results can be returned
  • 71. Full Text Queries ā€¢ match - Basic term matching query ā€¢ multi_match - Match which spans multiple ļ¬elds ā€¢ common_terms - Match query which preferences uncommon words ā€¢ query_string - Match documents using a search ā€œmini-dslā€ ā€¢ simple_query_string - A simpler version of query_string that never throws exceptions, suitable for exposing to users
  • 72. Term Queries ā€¢ term - Search for an exact value ā€¢ terms - Search for an exact value in multiple ļ¬elds ā€¢ range - Find documents where a value is in a certain range ā€¢ exists - Find documents that have any non-null value in a ļ¬eld ā€¢ missing - Inversion of `exists` ā€¢ preļ¬x - Match terms that begin with a string ā€¢ wildcard - Match terms with a wildcard ā€¢ regexp - Match terms against a regular expression ā€¢ fuzzy - Match terms with conļ¬gurable fuzziness
  • 73. Compound Queries ā€¢ constant_score - Wraps a query in ļ¬lter context, giving all results a constant score ā€¢ bool - Combines multiple leaf queries with `must`, `should`, `must_not` and `ļ¬lter` clauses ā€¢ dis_max - Similar to bool, but creates a union of subquery results scoring each document with the maximum score of the query that produced it ā€¢ function_score - Modiļ¬es the scores of documents returned by a query . Useful for altering the distribution of results based on recency, popularity, etc. ā€¢ boosting - Takes a `positive` and `negative` query, returning the results of `positive` while reducing the scores of documents that also match `negative` ā€¢ ļ¬ltered - Combines a query clause in query context with one in ļ¬lter context ā€¢ limit - Perform the query over a limited number of documents in each shard
  • 74. What are Mappings? ā€¢ Similar to schemas, they deļ¬ne the types of data found in ļ¬elds ā€¢ Determines how individual ļ¬elds are analyzed & stored ā€¢ Sets the format of date ļ¬elds ā€¢ Sets rules for mapping dynamic ļ¬elds
  • 75. Mapping Types ā€¢ Indices have one or more mapping types which group documents logically. ā€¢ Types contain meta ļ¬elds, which can be used to customize metadata like _index, _id, _type, and _source ā€¢ Types can also list ļ¬elds that have consistent structure across types.
  • 76. Data Types ā€¢ Scalar Values - string, long, double, boolean ā€¢ Special Scalars - date, ip ā€¢ Structural Types - object, nested ā€¢ Special Types - geo_shape, geo_point, completion ā€¢ Compound Types - string arrays, nested objects
  • 77. Dynamic vs Explicit Mapping ā€¢ Dynamic ļ¬elds are not deļ¬ned prior to indexing ā€¢ Elasticsearch selects the most likely type for dynamic ļ¬elds, based on conļ¬gurable rules ā€¢ Explicit ļ¬elds are deļ¬ned exactly prior to indexing ā€¢ Types cannot accept data that is the wrong type for an explicit mapping
  • 78. Shared Fields ā€¢ Fields that are deļ¬ned in multiple mapping types must be identical if: ā€¢ They have the same name ā€¢ Live in the same index ā€¢ Map to the same ļ¬eld internally
  • 79. Examining Mappings curl -X GET "http://localhost:9200/stack_overflow/post/_mapping"
  • 80. Dynamic Mappings ā€¢ Mappings are generated when a type is created, if no mapping was previously speciļ¬ed. ā€¢ Elasticsearch is good at identifying ļ¬elds much of the time, but itā€™s far from perfect! ā€¢ Fields can contain basic data-types, but importantly, mappings optimize a ļ¬eld for either structured (exact) or full-text searching
  • 81. Structured Data vs Full Text ā€¢ Exact values contain exact strings which are not subject to natural language interpretation. ā€¢ Full-text values must be interpreted in the context of natural language
  • 82. Exact Value ā€¢ ā€œsamantha@tembies.comā€ is an email address in all contexts
  • 83. Natural Language ā€¢ ā€œusā€ can be interpreted differently in natural language ā€¢ Abbreviation for ā€œUnited Statesā€ ā€¢ The English dative personal pronoun ā€¢ An alternative symbol for Āµs ā€¢ The French word us
  • 84. Analyzing Text ā€¢ Elasticsearch is optimized for full text search ā€¢ Text is analyzed in a two-step process ā€¢ First, text is tokenized in to individual terms ā€¢ Second, terms are normalized through a ļ¬lter
  • 85. Analyzers ā€¢ Analyzers perform the analysis process ā€¢ Character ļ¬lters clean up text, removing or modifying the text ā€¢ Tokenizers break the text down in to terms ā€¢ Token ļ¬lters modify, remove, or add terms
  • 86. Standard Analyzer ā€¢ General purpose analyzer that works for most natural language. ā€¢ Splits text on word boundaries, removes punctuation, and lowercases all tokens.
  • 87. Standard Analyzer curl -X GET "http://localhost:9200/_analyze?analyzer=standard&text="Reverse+text+with +strrev($text)!""
  • 88. Whitespace Analyzer ā€¢ Analyzer that splits on whitespace and lowercases all tokens
  • 89. Whitespace Analyzer curl -X GET "http://localhost:9200/_analyze?analyzer=whitespace&text="Reverse+text+with +strrev($text)!""
  • 90. Keyword Analyzer ā€¢ Tokenizes the entire text as a single string. ā€¢ Used for things that should be kept whole, like ID numbers, postal codes, etc
  • 91. Keyword Analyzer curl -X GET "http://localhost:9200/_analyze?analyzer=keyword&text="Reverse+text+with +strrev($text)!""
  • 92. Language Analyzers ā€¢ Analyzers optimized for speciļ¬c natural languages. ā€¢ Reduce tokens to stems (jumper, jumped ā†’ jump)
  • 93. Language Analyzers curl -X GET "http://localhost:9200/_analyze?analyzer=english&text="Reverse+text+with +strrev($text)!""
  • 94. Analyzers ā€¢ Analyzers are applied when documents are indexed ā€¢ Analyzers are applied when a full-text search is performed against a ļ¬eld, in order to produce the correct set of terms to search for
  • 95. Character Filters ā€¢ html_strip - Removes HTML from text ā€¢ mapping - Filter based on a map of original ā†’ new ( { ā€œphā€: ā€œfā€ }) ā€¢ pattern_replace - Similar to mapping, using regular expressions
  • 96. Index Templates ā€¢ Template mappings that are applied to newly created indices ā€¢ Templates also contain index conļ¬guration information ā€¢ Powerful when combined with dated indices
  • 97. Scoring ā€¢ Scoring is based on a boolean model and scoring function ā€¢ Boolean model applies AND/OR logic to an inverse index to produce a list of matching documents
  • 98. Term Frequency ā€¢ Terms that appear frequently in a document increase the documentā€™s relevancy score. ā€¢ term_frequency(term in document) = āˆšnumber_of_appearances
  • 99. Inverse Document Frequency ā€¢ Terms that appear in many documents reduce a documentā€™s relevancy score ā€¢ inverse_doc_frequency(term) = 1 + log(number_of_docs / (frequency + 1))
  • 100. Field Length Normalization ā€¢ Terms that appear in shorter ļ¬elds increase the relevancy of a document. ā€¢ norm(document) = 1 / āˆšnumber_of_terms
  • 101. Example from the Docs ā€¢ Given the text ā€œquick brown foxā€ the term ā€œfoxā€ scoresā€¦ ā€¢ Term Frequency: 1.0 ā€¢ Inverse Doc Frequency: 0.30685282 ā€¢ Field Norm: 0.5 ā€¢ Score: 0.15342641
  • 102. Basic Relevancy { "size": 100, "query": { "filtered": { "query": { "match": { "contents": "miley cyrus" } }, "filter": { "and": [ { "terms": { "site_id": [ 698 ] } } ] } } } }
  • 104. Recency-Adjusted Query { "query": { "function_score": { "functions": [ { "gauss": { "published": { "origin": "now", "scale": "10d", "offset": "1d", "decay": 0.3 } } } ], "query": { "filtered": { "query": { "match": { "contents": "miley cyrus" } }, "filter": { "and": [ { "terms": { "site_id": [ 698 ] } } ] } } } } } }
  • 107. Importing Energy Data curl -X PUT "http://localhost:9200/energy_use" --data-binary "@queries/ mapping_energy.json" curl -X PUT "http://localhost:9200/_bulk" --data-binary "@queries/ bulk_insert_energy_data.json" curl -X GET "http://localhost:9200/energy_use/_search"
  • 108. Average Energy Use curl -X POST "http://localhost:9200/energy_use/_search" -d '{ "size": 0, "aggs": { "average_laundry_use": { "avg": { "field": "laundry" } }, "average_kitchen_use": { "avg": { "field": "kitchen" } }, "average_heater_use": { "avg": { "field": "heater" } }, "average_other_use": { "avg": { "field": "other" } } } }'
  • 109. Multiple Aggregations curl -X POST ā€œhttp://localhost:9200/energy_use/_search" -d '{ "size": 0, "aggs": { "average_laundry_use": { "avg": { "field": "laundry" } }, "min_laundry_use": { "min": { "field": "laundry"} }, "max_laundry_use": { "max": { "field": "laundry"} } } }'
  • 110. Nesting Aggregations curl -X POST ā€œhttp://localhost:9200/energy_use/_search" -d '{ "size": 0, "aggs": { "by_date": { "terms": { "field": "date" }, "aggs": { "average_laundry_use": { "avg": { "field": "laundry" } }, "min_laundry_use": { "min": { "field": "laundry"} }, "max_laundry_use": { "max": { "field": "laundry"} } } } } }'
  • 111. Stats/Extended Stats curl -X POST "http://localhost:9200/energy_use/_search" -d '{ "size": 0, "aggs": { "by_date": { "terms": { "field": "date" }, "aggs": { "laundry_stats": { "extended_stats": { "field": "laundry" } } } } } }'
  • 112. Bucket Aggregations ā€¢ Date Histogram ā€¢ Term/Terms ā€¢ Geo* ā€¢ Signiļ¬cant Terms