Contenu connexe

Similaire à Search and analyze your data with elasticsearch(20)

Search and analyze your data with elasticsearch

  1. SEARCH AND ANALYZE YOUR DATA WITH ELASTICSEARCH Anton Udovychenko JEEConf May 20, 2016
  2. ABOUT ME Software Architect @ Levi9 8+ years of Java experience Passionate about agile methodology and clean code http://ua.linkedin.com/in/antonudovychenko http://www.slideshare.net/antonudovychenko
  3. AGENDA • Why does search matter to you • Why Elasticsearch • Basic Concepts • Comparison with SQL • Elasticsearch usage • Elasticsearch and Java • Q&A
  4. WHY DOES SEARCH MATTER TO YOU
  5. WHY DOES SEARCH MATTER TO YOU
  6. WHAT IS IT ABOUT Elasticsearch is a distributed, open source, document-oriented, schema-free, RESTful, full text search and analytics engine, designed for horizontal scalability, high availability
  7. WHY ELASTICSEARCH Elasticsearch is a distributed, open source, document-oriented, schema-free, RESTful, full text search and analytics engine, designed for horizontal scalability, high availability
  8. WHY ELASTICSEARCH Elasticsearch is a distributed, open source, document-oriented, schema-free, RESTful, full text search and analytics engine, designed for horizontal scalability, high availability
  9. WHY ELASTICSEARCH Elasticsearch is a distributed, open source, document-oriented, schema-free, RESTful, full text search and analytics engine, designed for horizontal scalability, high availability Apache 2.0 License
  10. WHY ELASTICSEARCH Elasticsearch is a distributed, open source, document-oriented, schema-free, RESTful, full text search and analytics engine, designed for horizontal scalability, high availability { "title": "My blogpost", "body": "Having a lot of text...", "user": “es_user", "postDate": "2016-01-01 15:03:32" }
  11. WHY ELASTICSEARCH Elasticsearch is a distributed, open source, document-oriented, schema-free, RESTful, full text search and analytics engine, designed for horizontal scalability, high availability REST API
  12. WHY ELASTICSEARCH Elasticsearch is a distributed, open source, document-oriented, schema-free, RESTful, full text search and analytics engine, designed for horizontal scalability, high availability
  13. Image via batman-news.com
  14. WHY ELASTICSEARCH - ALTERNATIVES – Complex logic (No additional level of abstraction) + More fine-grained control = Elasticsearch is based on Lucene
  15. WHY ELASTICSEARCH - ALTERNATIVES – Proprietary protocol – Real-time caveats – Difficult to go to cloud – More difficult to start using – Smaller community Sphinx + Faster on a cold start + Occupies less memory = Non Java based (C++)
  16. WHY ELASTICSEARCH - ALTERNATIVES + Truly open-source + Primary support of Hadoop distributors + ZooKeeper is more mature than Zen = Near Real-Time Search = Similar performance – More difficult to start using – SolrCloud (vs ES out of the box) – Zookeeper is harder to use then Zen – Worse operational tools – Worse monitoring tools – Worse analytical abilities
  17. WHY ELASTICSEARCH
  18. BASIC CONCEPTS • Near realtime • Cluster • Node • Index • Type • Document • Shards and replicas
  19. BASIC CONCEPTS Cluster
  20. BASIC CONCEPTS Node Node Node
  21. BASIC CONCEPTS Shard Shard Shard Shard Shard Shard ShardShard
  22. BASIC CONCEPTS Shard Shard Shard Shard Shard Shard ShardShard Index
  23. BASIC CONCEPTS Shard Segment Segment Segment Segment Lucene Index
  24. BASIC CONCEPTS Segment core Term Freq DocIds brown 2 0,1 dog 2 0,1 fox 2 0,1 in 1 1 jump 2 0,1 lazy 2 0,1 over 2 0,1 quick 2 0,1 summer 1 1 the 2 0,1 Inverted index DocId Fields 0 Text: The quick brown fox jumped over the lazy dog Author: Bob 1 Text: Quick brown foxes leap over lazy dogs in summer Author: Bill Document store 0 210 1 90 Column store Likes 0 59 1 23 Shared
  25. BASIC CONCEPTS Segment core DocId Fields 0 Text: The quick brown fox jumped over the lazy dog Author: Bob 1 Text: Quick brown foxes leap over lazy dogs in summer Author: Bill Document store 0 210 1 90 Column store Likes 0 59 1 23 Shared Search term: Leaping brown Fox Term Freq DocIds brown 2 0,1 dog 2 0,1 fox 2 0,1 in 1 1 jump 2 0,1 lazy 2 0,1 over 2 0,1 quick 2 0,1 summer 1 1 the 2 0,1 Inverted index
  26. SQL ELASTIC
  27. COMPARISON WITH SQL SQL Elasticsearch Database Index Table Type Row Document Column field Field
  28. COMPARISON WITH SQL SQL Elasticsearch Database Index Table Type Row Document with properties Column field Field
  29. COMPARISON WITH SQL id title body user postDate 1 My first blogpost Having a lot of text... es_user 2016-01-01 15:03:32 2 About search The search data sometimes has a peculiar property… es_user 2016-01-01 19:22:03 3 Introduction to Elasticsearch Once I have stumbled upon this idea… es_user 2016-01-03 11:55:41
  30. COMPARISON WITH SQL POST http://localhost:9200/blog CREATE DATABASE blog; USE blog; CREATE TABLE post( id bigint(20) AUTO_INCREMENT, title varchar(250), body text, user varchar(50), postDate timestamp, PRIMARY KEY(id) ); {"mappings": { "post": { "properties": { "title": { "type": "string" }, "body": { "type": "string" }, "user": { "type": "string" }, "postDate": { "type": "date" } } } } (not obligatory)
  31. COMPARISON WITH SQL (CREATE) POST http://localhost:9200/blog/post INSERT INTO post( title, body, user, postDate ) VALUES( 'My blogpost', 'Having a lot of text...', ‘es_user', '2016-01-01 15:03:32' ); { "title": "My blogpost", "body": "Having a lot of text...", "user": "es_user", "postDate": "2016-01-01 15:03:32" }
  32. COMPARISON WITH SQL (UPDATE) POST http://localhost:9200/blog/post/1/_update UPDATE post SET title='My blogpost‘ WHERE id=1; { "doc": { "title": "My blogpost" } }
  33. COMPARISON WITH SQL (DELETE) DELETE http://localhost:9200/blog/post/1DELETE FROM post WHERE id=1
  34. COMPARISON WITH SQL (READ) GET http://localhost:9200/blog/post/1SELECT * FROM post WHERE id=1 SELECT * FROM post GET http://localhost:9200/blog/post/_search SELECT * FROM post WHERE user=‘es_user’ GET http://localhost:9200/blog/post/_search ?q=user:es_user
  35. COMPARISON WITH SQL (READ) POST http://localhost:9200/blog/post/_search SELECT * FROM post WHERE body LIKE '%Having %'; { "query": { "match": { "body": "Having" } } }
  36. DEMO TIME
  37. ELASTICSEARCH AND JAVA • Native Java client • Spring Data Elasticsearch • REST endpoints • Jest (https://github.com/searchbox-io/Jest) https://github.com/terrafant/es-feeder
  38. DEMO TIME
  39. Application ELASTICSEARCH USAGE ESclientJDBC DB Elasticsearch cluster RESTNative Request SQL Binary JSON
  40. ELASTICSEARCH USAGE (DETAILS)Loadbalancer Master- eligible Node Master- eligible Node Client Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Data Node Master Node Client Node Client Node Elasticsearchcluster
  41. ELASTICSEARCH USAGE (ELK) Frontend Backend ElasticsearchKibana Logstash Browser DB Logstash Logstash Broker
  42. TOP 10 PRODUCTION RECOMMENDATIONS 1. Take care of security
  43. TOP 10 PRODUCTION RECOMMENDATIONS 1. Take care of security 2. Avoid split-brain
  44. TOP 10 PRODUCTION RECOMMENDATIONS 1. Take care of security 2. Avoid split-brain 3. Use dedicated master nodes
  45. TOP 10 PRODUCTION RECOMMENDATIONS 1. Take care of security 2. Avoid split-brain 3. Use dedicated master nodes 4. Use unicast (not multicast)
  46. TOP 10 PRODUCTION RECOMMENDATIONS 1. Take care of security 2. Avoid split-brain 3. Use dedicated master nodes 4. Use unicast (not multicast) 5. Configure recovery settings
  47. TOP 10 PRODUCTION RECOMMENDATIONS 1. Take care of security 2. Avoid split-brain 3. Use dedicated master nodes 4. Use unicast (not multicast) 5. Configure recovery settings 6. Number of replicas is not less than 2
  48. TOP 10 PRODUCTION RECOMMENDATIONS 1. Take care of security 2. Avoid split-brain 3. Use dedicated master nodes 4. Use unicast (not multicast) 5. Configure recovery settings 6. Number of replicas is not less than 2 7. Allocate enough physical memory
  49. TOP 10 PRODUCTION RECOMMENDATIONS 1. Take care of security 2. Avoid split-brain 3. Use dedicated master nodes 4. Use unicast (not multicast) 5. Configure recovery settings 6. Number of replicas is not less than 2 7. Allocate enough physical memory 8. Configure OS user
  50. TOP 10 PRODUCTION RECOMMENDATIONS 1. Take care of security 2. Avoid split-brain 3. Use dedicated master nodes 4. Use unicast (not multicast) 5. Configure recovery settings 6. Number of replicas is not less than 2 7. Allocate enough physical memory 8. Configure OS user 9. Use monitoring tools
  51. TOP 10 PRODUCTION RECOMMENDATIONS 1. Take care of security 2. Avoid split-brain 3. Use dedicated master nodes 4. Use unicast (not multicast) 5. Configure recovery settings 6. Number of replicas is not less than 2 7. Allocate enough physical memory 8. Configure OS user 9. Use monitoring tools 10.Use Oracle JDKs
  52. THANK YOU! Get social @elastic Explore the docs elastic.co/guide Give it a try elastic.co/downloads/elasticsearch Join the community discuss.elastic.com Check ELK stack demo.elastic.co

Notes de l'éditeur

  1. Elasticsearch builds distributed capabilities on top of Apache Lucene to provide the most powerful full- text search capabilities available. Powerful, developer-friendly query API supports multilingual search, geolocation, contextual did-you-mean suggestions, autocomplete, and result snippets.
  2. A cluster may contain multiple indices that can be queried independently or as a group. Index aliases allow filtered views of an index, and may be updated transparently to your application. Elasticsearch allows you to start small and scale horizontally as you grow. Simply add more nodes, and let the cluster automatically take advantage of the extra hardware. Petabytes of data? Thousands of nodes? No problem.
  3. Elasticsearch clusters are resilient — they will detect new or failed nodes, and reorganize and rebalance data automatically, to ensure that your data is safe and accessible.
  4. Elasticsearch can be downloaded, used, and modified free of charge. It is available under the Apache 2 license, one of the most flexible open source licenses available.
  5. Store complex real world entities in Elasticsearch as structured JSON documents. All fields are indexed by default, and all the indices can be used in a single query, to easily return complex results at breathtaking speed.
  6. Elasticsearch is API driven. Almost any action can be performed using a simple RESTful API using JSON over HTTP. Client libraries are available for many programming languages.
  7. How long can you wait for insights on your fast-moving data? With Elasticsearch, all data is immediately made available for search and analytics. Combining the speed of search with the power of analytics changes your relationship with your data. Interactively search, discover, and analyze to gain insights that improve your products or streamline your business.
  8. Apache Lucene is a high performance, full-featured Information Retrieval library, written in Java. Elasticsearch uses Lucene internally to build its state of the art distributed search and analytics capabilities. Comparison is similar to comparison of a car and its engine
  9. You can do distributed indexes with Sphinx, but ES's sharding/auto-allocation system is really nice when dealing with more than 1 server.
  10. ES invented by Shay Banon in 2010, originally for ease his chef wife’s lif for the times hse had to find a recipe in her huge bunch of recipes Cloudera, MapR, HortonWorks/HDP, DataStax ZooKeeper for cluster management
  11. Index is divided into immutable segments To add more documents, add more segments In-place updates are not supported, to update documents, delete then add Keep the number of segments low for fast search Reclaim space for deleted documents Merging is expensive (so IndexBuffer keeps documents in memory)
  12. Lucene started out as a library for full-text search. Today Lucene is used for so much more: Analytics, Geo Search, Suggestions, Data store, structured search. To support these functionalities the Inverted Index was not enough any more, the Column store was introduced in 4.0. This enabled faster aggregations and sorting
  13. Lucene started out as a library for full-text search. Today Lucene is used for so much more: Analytics, Geo Search, Suggestions, Data store, structured search. To support these functionalities the Inverted Index was not enough any more, the Column store was introduced in 4.0. This enabled faster aggregations and sorting
  14. The term frequency (tf) for term t in document d is the square root of the number of times the term appears in the document. The inverse document frequency (idf) of term t is the logarithm of the number of documents in the index, divided by the number of documents that contain the term. The field-length norm (norm) is the inverse square root of the number of terms in the field.
  15. node.data: true node.master: true http.enabled: false discovery.zen.minimum_master_nodes: (number of master-eligible nodes / 2) + 1
  16. Old school: cat, grep, awk, cut Good luck with analyzing 200 GB of unstructured logs X-Pack Do not hardcode fields; possible notification emails flooding 1000x (solution: zabbix); Check quality of your logs Broker: rabbitmq, redis, apache kafka
  17. Monitoring: marvel+watcher(Xpack) or elastichead or bigdesk or elastichq or newrelic or appdynamics (php) Disable deletion of all indexes Ensure unique non-random node naming For OS user Increase the number of open file descriptors and max locked memory Memory allocation: keep it simple like 50% for OS and JVM and 50% for ES (16 GB of RAM, 8 GB assigned to ES). I used 5000 as a batch size for a document size of ~1.5 KB. I could store a batch of 5000 documents in about 1.5-3 seconds. Mostly around 1.5-2 seconds. Recovery settings: gateway.recover_after_nodes: 8; gateway.expected_nodes: 10; gateway.recover_after_time: 5m
  18. Monitoring: marvel+watcher(Xpack) or elastichead or bigdesk or elastichq or newrelic or appdynamics (php) Disable deletion of all indexes Ensure unique non-random node naming For OS user Increase the number of open file descriptors and max locked memory Memory allocation: keep it simple like 50% for OS and JVM and 50% for ES (16 GB of RAM, 8 GB assigned to ES). I used 5000 as a batch size for a document size of ~1.5 KB. I could store a batch of 5000 documents in about 1.5-3 seconds. Mostly around 1.5-2 seconds. Recovery settings: gateway.recover_after_nodes: 8; gateway.expected_nodes: 10; gateway.recover_after_time: 5m
  19. Monitoring: marvel+watcher(Xpack) or elastichead or bigdesk or elastichq or newrelic or appdynamics (php) Disable deletion of all indexes Ensure unique non-random node naming For OS user Increase the number of open file descriptors and max locked memory Memory allocation: keep it simple like 50% for OS and JVM and 50% for ES (16 GB of RAM, 8 GB assigned to ES). I used 5000 as a batch size for a document size of ~1.5 KB. I could store a batch of 5000 documents in about 1.5-3 seconds. Mostly around 1.5-2 seconds. Recovery settings: gateway.recover_after_nodes: 8; gateway.expected_nodes: 10; gateway.recover_after_time: 5m
  20. Monitoring: marvel+watcher(Xpack) or elastichead or bigdesk or elastichq or newrelic or appdynamics (php) Disable deletion of all indexes Ensure unique non-random node naming For OS user Increase the number of open file descriptors and max locked memory Memory allocation: keep it simple like 50% for OS and JVM and 50% for ES (16 GB of RAM, 8 GB assigned to ES). I used 5000 as a batch size for a document size of ~1.5 KB. I could store a batch of 5000 documents in about 1.5-3 seconds. Mostly around 1.5-2 seconds. Recovery settings: gateway.recover_after_nodes: 8; gateway.expected_nodes: 10; gateway.recover_after_time: 5m
  21. Monitoring: marvel+watcher (X-Pack) or bigdesk or elastichq or elastichead or newrelic Disable deletion of all indexes Ensure unique non-random node naming For OS user Increase the number of open file descriptors and max locked memory Memory allocation: keep it simple like 50% for JVM heap and 50% for ES Recovery settings: gateway.recover_after_nodes: 8; gateway.expected_nodes: 10; gateway.recover_after_time: 5m
  22. Monitoring: marvel+watcher(Xpack) or elastichead or bigdesk or elastichq or newrelic or appdynamics (php) Disable deletion of all indexes Ensure unique non-random node naming For OS user Increase the number of open file descriptors and max locked memory Memory allocation: keep it simple like 50% for OS and JVM and 50% for ES (16 GB of RAM, 8 GB assigned to ES). I used 5000 as a batch size for a document size of ~1.5 KB. I could store a batch of 5000 documents in about 1.5-3 seconds. Mostly around 1.5-2 seconds. Recovery settings: gateway.recover_after_nodes: 8; gateway.expected_nodes: 10; gateway.recover_after_time: 5m
  23. Monitoring: marvel+watcher(Xpack) or elastichead or bigdesk or elastichq or newrelic or appdynamics (php) Disable deletion of all indexes Ensure unique non-random node naming For OS user Increase the number of open file descriptors and max locked memory Memory allocation: keep it simple like 50% for OS and JVM and 50% for ES (16 GB of RAM, 8 GB assigned to ES). I used 5000 as a batch size for a document size of ~1.5 KB. I could store a batch of 5000 documents in about 1.5-3 seconds. Mostly around 1.5-2 seconds. Recovery settings: gateway.recover_after_nodes: 8; gateway.expected_nodes: 10; gateway.recover_after_time: 5m
  24. Monitoring: marvel+watcher(Xpack) or elastichead or bigdesk or elastichq or newrelic or appdynamics (php) Disable deletion of all indexes Ensure unique non-random node naming For OS user Increase the number of open file descriptors and max locked memory Memory allocation: keep it simple like 50% for OS and JVM and 50% for ES (16 GB of RAM, 8 GB assigned to ES). I used 5000 as a batch size for a document size of ~1.5 KB. I could store a batch of 5000 documents in about 1.5-3 seconds. Mostly around 1.5-2 seconds. Recovery settings: gateway.recover_after_nodes: 8; gateway.expected_nodes: 10; gateway.recover_after_time: 5m
  25. Monitoring: marvel+watcher(Xpack) or elastichead or bigdesk or elastichq or newrelic or appdynamics (php) Disable deletion of all indexes Ensure unique non-random node naming For OS user Increase the number of open file descriptors and max locked memory Memory allocation: keep it simple like 50% for OS and JVM and 50% for ES (16 GB of RAM, 8 GB assigned to ES). I used 5000 as a batch size for a document size of ~1.5 KB. I could store a batch of 5000 documents in about 1.5-3 seconds. Mostly around 1.5-2 seconds. Recovery settings: gateway.recover_after_nodes: 8; gateway.expected_nodes: 10; gateway.recover_after_time: 5m
  26. Monitoring: marvel+watcher(Xpack) or elastichead or bigdesk or elastichq or newrelic or appdynamics (php) Disable deletion of all indexes Ensure unique non-random node naming For OS user Increase the number of open file descriptors and max locked memory Memory allocation: keep it simple like 50% for OS and JVM and 50% for ES (16 GB of RAM, 8 GB assigned to ES). I used 5000 as a batch size for a document size of ~1.5 KB. I could store a batch of 5000 documents in about 1.5-3 seconds. Mostly around 1.5-2 seconds. Recovery settings: gateway.recover_after_nodes: 8; gateway.expected_nodes: 10; gateway.recover_after_time: 5m