We went over what Big Data is and it's value. This talk will cover the details of Elasticsearch, a Big Data solution. Elasticsearch is an NoSQL-backed search engine using a HDFS-based filesystem.
We'll cover:
• Elasticsearch basics
• Setting up a development environment
• Loading data
• Searching data using REST
• Searching data using NEST, the .NET interface
• Understanding Scores
Finally, I show a use-case for data mining using Elasticsearch.
You'll walk away from this armed with the knowledge to add Elasticsearch to your data analysis toolkit and your applications.
4. Big Data
“Big data is an all-encompassing term for any collection of data sets so
large and complex that it becomes difficult to process using traditional
data processing applications.”
- Wikipedia
5. The 3 Vs
• Volume
• A few Gigabytes -> Petabyte
• Velocity
• Arrives quickly
• Variety
• Multiple types of data
6. What is ElasticSearch?
• You know, for search…
• Elasticsearch is a search server based on Lucene. It provides a
distributed, multitenant-capable full-text search engine with a RESTful
web interface and schema-free JSON documents. Elasticsearch is
developed in Java and is released as open source under the terms of
the Apache License.
7. Let’s break that down…
• Distributed
• Run on multiple servers simultaneously
• Multitenant
• The same system serving different groups of data
• REST
• Web-based programming interface
• NoSQL for storage
• Uses JSON
• Open Source
8. So what is ElasticSearch?
• It’s a search engine
• Stores data on multiple machines
• Stores multiple types of data
• Stores in JSON format
• REST interface
• There are managed and unmanaged programming interfaces
• .NET
• Java
• NodeJs
• JavaScript
• Scala
• Clojure
• PHP
• Perl
• Python
• Ruby
• Haskell
• Erlang
• ColdFusion
• SmallTalk
• Ocaml
• CommandLine
• EventMachine
• Go
10. Definitions
• Cluster
• One or more nodes
• Document
• A stored record
• Field
• A document has a list of fields, or key-value pairs
• Index
• Think of this as a database
• Term
• This is an exact value to be matched (“FOO”, “Foo”, “foo”) are not the same term
• Type
• Similar to a database
• Text
• Field value
• Analyzed into terms
• Stored in the index