2. I’ve done things
Used Elasticsearch since v.0.18 (2011)
Been on-call for production systems using Elasticsearch since 2013
Paired it with (mostly) Python, also Ruby and Javascript
Used it as the sole place to hold data
Also used it in a more usual way - paired with a database
3. Elasticsearch is
a really fast and easily scalable
Open source
Distributed
RESTful
Search and Analytics
Engine
Part of an ecosystem of tools for analytics
(massage, store and graph data)
12. Search
through
Natural
Language
~30 minutes to prototype
Ingredients
The text you want to search through
The searches you want to do (queries)
Elasticsearch
Preparation
Put text into Elasticsearch. No schema or
configuration necessary (for basics).
Put queries into Elasticsearch
1. Get results
Let me show you quickly.
13. Logs
~60 minutes to prototype
Put logs in. Run aggregations.
Get insight into app and traffic.
The Elastic Stack is geared towards
this with multiple products tackling
log formats, ingestion and analysis.
14. Custom
Dashboards
~180 minutes to prototype
Put data in. Run aggregations.
Get insight.
Plays really well with D3 and other
common visualisation libraries.
Can also use Kibana + Elasticsearch
16. Do you have a nail? Elasticsearch is a
hammerES is not great at:
● Relational
integrity
● Transactions
Problems you should not try to solve with ES:
● Calculate inventory
● Grand totals
● Rollback-able stuff
● User accounts
18. I was your host
and would love feedback
Emanuil Tolev
emanuil@cottagelabs.com
@emanuil_tolev on Twitter
Link to slides: http://tinyurl.com/es-intro-slides
Really, really good intro blog post to ES with use cases and further reading,
like securing your Elasticsearch: http://tinyurl.com/es-intro-blog .
US State map came from http://greasethewheels.org/cpi/ , actually a US corruption research paper.
Notes de l'éditeur
Am a consultant, specialising in performance and robust technical architecture. The right tools for the right problems, etc.
Work in a loose partnership of other consultants and freelancers called Cottage Labs.
About to use it a lot more with RDBMS
Open source - 1-2 of the usual positives. Strong resilient community in this case.
Distributed - stuff can go down and the system rebalances itself automatically.
Restful - Very easy to use - only need a browser. Very good, simple HTTP API speaking in JSON.
Note Search vs. Analytics distinction
The Elastic Stack is more than Elasticsearch, but out of scope here.
Indexing (= putting data in)
Querying (= find a needle in haystack). Includes things like searching, fuzzy searching, autocompletion and instant searches (train apps).
Aggregating (= analysing data and counting things)
Throw data at it: ES will guess data types and enforce them for you. You can’t save a number into a field that ES has learned is a date. Of course, you can also be much more careful and thorough - use Mappings.
ES will always analyse by default. Is it possible that we might not always want that?
Advanced: asciifolding, tokenisation, find a document by its translation, and more.
Index-time analysis and analysers
Common pitfall: avoiding analysis for exact string matches
Paging and sorting directly in the URL, or in JSON: ?sort ?size
Queries: match, terms, geo, More Like This (takes doc as input to return similar docs)
Types: matrix, metrics, bucket, pipeline
Buckets are very useful, especially Terms buckets.
Aggregations are cached with some very clever algorithms and great cache management by default, ensuring both low resource use and no stale results.
Say we have a field called “us_state” in some data we’ve got.
A Terms aggregation over that data will tell us the unique US state codes which are present in our data. If it’s a comprehensive dataset, we’ll essentially just get a list of the US states. Not that useful, right. But, you can nest aggregations so you have sub-aggregations. Which means, we could ask
Show a Terms aggregation drilling further and further down into some category. Fashion may be a good metaphore, e.g. All Stock -> Shoes -> Ladies’ -> Red -> Size 6.5 TODO replace with housing example
Bucketing: all the buckets criteria are evaluated on every document in the context and when a criterion matches, the document is considered to "fall in" the relevant bucket. By the end of the aggregation process, we’ll end up with a list of buckets - each one with a set of documents that "belong" to it.
Metric: Aggregations that keep track and compute metrics over a set of documents. Min, max, avg, sum, ranking, geo bounds and geo centroid. (If asked) Geo bounds gives you the box containing all locations. Geo centroid gives you the center given other points.
Matrix: operate on multiple fields and produce a matrix result based on the values. Experimental. Statistics (variance, covariance, correlation).
Pipeline: Aggregations that aggregate the output of other aggregations and their associated metrics. More advanced.
Just an example. Example aggregation using geo centroid and the number of, say, museums in the USA - the exact data is not important. But now, let’s see what bucketing the documents by US state gives us.
So this is what “bucketing” is. You’ll find it very useful for building intuitive analytics dashboards and user interfaces that deal with search and discovery.
I’ll give you a sneak peek of what the data, the request and the response might look like. The Elastic example is museums in Europe.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geocentroid-aggregation.html
Predefined aggregations available. Logstash capable of understanding many log formats, and you can add custom ones.
Why the ugly dashboard?
Dashboards should be useful first, pretty … later.
Netflix built an open source application metrics project based on Java and ES. Called Servo
Searching a large number of descriptions for the best match for a specific phrase (e.g. property search, say “no pets”) and returning the best results
Faceting: get a breakdown of the types of dwelling that forbid pets :(
“Did you mean …?” suggestions
Auto-completing a search box based on partially typed words based on previously issued searches while accounting for mis-spellings
Searching text for words that sound like another word
Product and information suggestions: “People who were interested in / bought this also look at…”
Not great at:
Instant availability in search results after indexing
High cardinality & high precision analysis
Problems you should not try to solve:
Very limited resource projects (embedded devices, tiny websites)
Elasticsearch is generally fantastic at providing approximate answers from data, such as scoring the results by quality.
While Elasticsearch can perform exact matching and statistical calculations, its primary task of search is an inherently approximate task.
Finding approximate answers is a property that separates Elasticsearch from more traditional databases.
That being said, traditional relational databases excel at precision and data integrity.
The Elastic website has a lot of blogs and videos on user stories, including top senior dogs from Netflix, Rightmove, banks, supercomputer and AI people, fighting Ebola, the BBC and many more!
It was a pleasure!
I hope you had fun. Please leave a comment on the meetup page or send me an email with feedback.