2. Me
• Python developer
• DevOps role
• using/learning Elasticsearch for 3 years (version 0.16)
• Synopsi.tv, Reactor
3. Synopsi.tv
• movie recommendations service
• database of movies, TV shows
• need for search - Elasticsearch
• search-box on every page (prefix search)
• advanced search (search + facets)
• PostgreSQL as main datastore
• import to ES with script, hooks on add/update
4. Synopsi.tv - lessons
• Good for search
• Mappings are powerful
• need for reindexing - format/mapping change
• sometimes missing documents (Bruce Willis) - in index but not
searchable
• probably not yet suitable as only datastore
5. Reactor
• service for communication with users of your application
• send data about users and their activity (events)
• filter users, define segments
• set rules for reactions - email, webhook, SMS, etc.
6. Data structure
• small and simple pieces of data - JSON
• simple relations (application <-> user <-> field)
• “noSQL” more suitable, but doable in e.g. PostgreSQL
7. Backends - theory
• application saves data in several forms:
• raw (as they come)
• cleaned and sanitized
• formatted for specific datastore
• save method iterates over configured set of datastore backends and
send same data to them
• different backends for different operations - get, filter, analytics
• slight duplication of functionality in application
8. Backends - practice
• Mongo/DynamoDB - raw data, cold storage
• PostgreSQL - working data
• ElasticSearch - added later, working data, analytics
• different sets in different environments (devel machine,
production)
9. Source of truth
• one backend is trusted by definition - cold storage, need for
repopulation of data in other backends
• should be simple (hard to break) and scalable (probably in cloud)
• possible forms:
• JSON files, Hadoop, etc.
• noSQL database (Mongo, Cassandra, DynamoDB)
• Elasticsearch - different format (indices, nodes) from working
data
10. Input to ElasticSearch
• regularly run import/update script - if you do not need (almost-)
live data
• logic in application (our case)
• river - input channel from other source to ElasticSearch
(CouchDB, MongoDB, Hadoop)
11. ElasticSearch pros
• easy start
• easy scaling (e.g. with AWS plugin can new node automatically join
cluster and be ready in short time)
• search capabilities
• analytics with facets/aggregations (Kibana plugin)
• easy backup with snapshots
• easy to deploy - just one service
• highly tweakable yet sane defaults
12. ElasticSearch cons
• only one type of relation between documents - parent/child
(nested)
• higher need for reindexing - repopulate data from scratch
(change of format, new mappings for fields)
14. Clusters
• Possible to use several clusters - one for data, one for
monitoring
• Can talk to each other via tribe nodes - nodes from cluster1
send monitoring data (Marvel plugin) to tribe node to save them
to cluster2
15. Nodes
• Use data and non-data nodes
• Data nodes for storage
• Non-data nodes for admin, local nodes on web server, etc.
• Use tags
16. Indices
• Split data to different indices (per client, segments, etc.) -
reactor_app-id
• Use aliases (migrations, time windows, etc.) - reactor_app-id as
alias for reactor_app-id_timestamp1 and reactor_app-
id_timestamp2
• Possible to allocate bigger indices to more performant servers -
node.tag and index.routing.allocation.include.tag
17. Shards
• Use several shards (default 5)
• Utilize more machines
• Immutable for existing index - plan your infrastructure
18. Indexing
• If possible, use batch indexing
• Use update script wisely - too big batch with updates can slow
down application
• Send indexing traffic to (local) non-data node
• Play with mapping, use not_analyzed, field_data
19. Machines
• Use local SSD, not network storage
• Replicate, make snapshots
• Make benchmark - few bigger machines can outperform more
small machines (network delays, cluster management)
• If in cloud, use ElasticSearch plugins for that cloud provider
(cluster discovery, snapshots on cloud storage)