Right Money Management App For Your Financial Goals
Elastic pivorak
1. E L A S T I C S E A R C H
M A K E
Y O U R
S O F T W A R E
S M A R T E R !
O L E K S I Y P A N C H E N K O / # P I V O R A K / 2 0 1 5
2. MY NAME IS…
Oleksiy Panchenko
Software engineer, Lohika
E-mail: oleksij@gmail.com
Twitter: oleskiyp
LinkedIn:
https://ua.linkedin.com/in/opanchenko
3. AGENDA
• Introduction. What is it all about?
• Jump start Elastic. Demo time
• Architecture and deployment. Why is
Elasticsearch elastic?
• Case studies. 4 real-life projects
• Query API in depth + Demo
• Using Elastic in Rails applications. Approaches
and tools
• Kinda summary
• Q & A
6. HOW TO MAKE YOUR SITE
SEARCHABLE?
http://www.imbusstop.com/wp-content/uploads/2015/02/websites.png
7. • Google search
• Why not to use plain vanilla SQL? RDBMS rocks!
select *
from books
join authors
on …
where …
• Sphinx (hello Craigslist, Habrahabr, The Pirate Bay, 1C);
Xapian
• Lucene Family: Apache Lucene, Elasticsearch, Apache
Apache Solr, Amazon Cloudsearch, …
8. WHO HAS EVER USED
ELASTICSEARCH/SOLR/SPHINX?
http://dolhomeschoolcenter.com/wp-content/uploads/2013/02/FAQ.png
9. LUCENE AS A CORE
• Lucene = Low-level Java library (JAR) which
implements search functionality
• Lucene stores its index as a local binary file
• Can be used in both web and standalone
applications (desktop, mobile)
• Implemented in Java, ports to other languages
available
• Initial version: 1999
• Apache project since 2001
• Latest stable release: 5.3.1 (September 24, 2015)
10. LUCENE AS A CORE
• Lucene was originally written in
1999 by Doug Cutting (creator
(creator of Hadoop and Nutch;
http://www.china-cloud.com/uploads/allimg/121018/54-12101P92R1U7.jpg
12. TIME TO TALK ABOUT
ELASTICSEARCH
https://www.elastic.co/products/elasticsearch
Near Real-Time Data (NRT)
Full-Text Search
Multilingual search, geolocation,
fuzzy search, did-you-mean
suggestions, autocomplete
17. ELASTICSEARCH – PAST &
PRESENT
• 2004. Shay Banon (aka
Kimchy) started working on
Compass – distributed and
scalable Java Search
Engine on top of Lucene
• 2010. Initial release of ES
• Latest stable release: 1.7.2
(September 14, 2015)
• 2.0 to be released in
November
• 500K downloads per
• https://github.com/elastic/elasticsearch
http://opensource.hk/sites/default/files/u1/shay-banon.jpg
18. ELASTICSEARCH
AS A COMPANY
• 2012. Elasticsearch BV; Funding: $104M in 3
rounds, 100+ employees
• https://www.elastic.co/
• Product portfolio:
– Elasticsearch, Logstash, Kibana (ELK stack)
– Watcher
– Shield
– Marvel
– es-hadoop
– found
24. Cluster One or more nodes which
share the same cluster name
Node Running instance of
Elasticsearch which belongs
to a cluster
Shard A portion of data – single
Lucene instance.
Default: 5 shards in an index
Primary
Shard
Master copy of data
Replica
Shard
Exact copy of a primary
shard.
Default: 1 replica
27. BENEFITS OF SHARDING
• Take advantage of multi-core CPUs (one shard is
a single Lucene instance = single JVM process)
• Horizontal scalability. Dynamic rebalancing
• Fault tolerance and cluster resilience
• NB! The number of shards can not be changed
dynamically on the fly – need to perform full
reindexing
• Max number of documents per shard:
2,147,483,519 – imposed by Lucene
28. ELASTICSEARCH NODE TYPES
• Data node node.data = true
• Master node node.master = true
• Communication client http.enabled =
true
• TCP ports 9200 (ext), 9300 (int)
• A node can play 2 or 3 roles at the same time
• Multicast discovery (true by default):
discovery.zen.ping.multicast.enabled
34. CASE STUDIES
4 R E A L - L I F E P R O J E C T S
http://vignette1.wikia.nocookie.net/fallout/images/9/9d/FNV_Rake.png/revision/latest?cb=20140618212609&pat
h-prefix=ru
35. GENERAL INFO
• 4 projects, ~2 years
• RDBMS (MySQL, PostgreSQL) as a primary data
storage
• Both on-premise Elasticsearch installation (AWS,
MS Azure) and SaaS (Bonsai @ Heroku)
• 1 or 2 instances in a cluster
• Data volume: Gigabytes; millions of documents
• Back-end: Java, Ruby
37. • Document types: Blog Posts, Bloggers
(Influencers)
• Elasticsearch usage:
– search and rank Influencers by category,
keywords, tags, location, audience,
influence
– search blog posts by keywords etc.
• Amount of data:
– Influencers: hundreds of thousands
– Blog Posts: millions
• ES cluster size: 2 instances
• Technology stack: Java, MySQL, Dynamo
45. 1996 VW PASSAT SEDAN B4 TDI TURBO DIESEL 44+MPG
WAT???
• Fuzzy Search (Levenstein Distance Algorithm) used to
parse ads and classify cars
• Elasticsearch index contains dictionary (Year, Make,
Model, Trim)
• Used in conjunction with other approaches: regular
expressions, dictionaries of synonyms (VW Volkswagen,
Chevy Chevrolet), normalization (e.g. LX-370 LX370)
• Algorithm approach:
– Parse Year (1996)
– Search most relevant Make (VW, volkswagon
Volkswagen)
– Search most relevant Model (Passat) for Make =
Volkswagen, Year = 1996
– Search most relevant Trim (TDi 4dr Sedan)
• Parsing quality: 90%
https://www.elastic.co/guide/en/elasticsearch/reference/1.6/query-dsl-fuzzy-query.html
47. SOME UNCOVERED INFO
• Check documents against duplicate content
• Shingle analysis (commonly used by copywriters and SEO
experts)
– I have a dream that one day this nation will rise up and
live…
– Normalization
I have a dream that one day this nation will rise up and
live…
– Splitting a text into shingles (n-grams), n = 3..10
have dream that
dream that this
that this nation
this nation will
…
– Replacement: latin ‘c’ cyrillic ‘c’
https://en.wikipedia.org/wiki/W-shingling
49. FILTERS VS. QUERIES
As a general rule, filters should be used:
• for binary yes/no searches
• for queries on exact values
Filters are much faster than queries
Filters are usually great candidates for caching
27 Filters available (Elasticsearch 1.7.1)
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-filters.html
50. QUERIES VS. FILTERS
As a general rule, queries should be used instead
of filters:
• for full text search
• where the result depends on a relevance score
Common approach: Filter as many records as
possible, then query them.
38 Queries available (Elasticsearch v 1.7.1)
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-queries.html
52. SOME THEORY BEHIND
RELEVANCE SCORING
full AND text AND search AND (elasticsearch OR
lucene)
• Term Frequency: How often does the term
appear in the document?
• Inverse Document Frequency: How often does
the term appear in all documents in the
collection?
• Field-length norm: How long is the field?
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
http://blog.qbox.io/optimizing-search-results-in-elasticsearch-with-scoring-and-boosting
53. MORE COOL FEATURES
• Indexing attachments: MS Office, ePub, PDF
(Apache Tika)
• Autocomplete suggestion:
• Did-you-mean suggestion:
• Highlight results:
57. ELASTICSEARCH-RAILS
• https://github.com/elastic/elasticsearch-rails
• Includes three packages:
elasticsearch-model + elasticsearch-persistence
+ elasticsearch-rails
• ActiveModel integration with adapters for
ActiveRecord and Mongoid
• Enumerable-based wrapper for search results;
ActiveRecord::Relation-based wrapper for
returning search results as records
• Support for Kaminari and WillPaginate
pagination
• Convenience methods for (re)creating the
index, setting up mappings, indexing
documents, …
58. MY WAY (RAILS 4 APP)
Gemfile
config/environments/production.rb
67. ELASTICSEARCH DRAWBACKS
• No transaction support. Elasticsearch is not a
database.
• No joins, constraints and other RDBMS features
• Durability and consistency issues, data loss:
– https://aphyr.com/posts/323-call-me-maybe-
elasticsearch-1-5-0
– https://www.elastic.co/guide/en/elasticsearch/resili
ency/current/index.html
69. SUMMARY
• ES is not a silver bullet but really really powerful
tool
• Elasticsearch is not a RDBMS and is not supposed
to act as a database. Choose your tools
properly. Leverage the synergy of DB + ES
• Elasticsearch is dead simple at the start but
might be sophisticated later as you go
• Kick off easily, then hire a good DevOps
engineer for best results
• Ecosystem around Elasticsearch is just amazing
• Give it a try – it can bring a lot of value to your
product and your CV ;)
http://www.aperfectworld.org/clipart/gestures/rockhard11.png