Wanna search? Piece of cake!

Wanna search? Piece of cake!
Fast, scalable and easy to setup search engine for your
data.
by Alexey Kursov
http://www.linkedin.com/in/kursov

ElasticSearch is a
● distributed
● RESTful
● free/open source search server
● based on Apache Lucene.
It is developed by Shay Banon(@kimchy) and is released
under the terms of the Apache License. ElasticSearch is
developed in Java.
http://elasticsearch.org/
http://elasticsearch.com/
WTF?

Apache Lucene is a
● free/open source information retrieval software library
● originally created in Java
● it is supported by the Apache Software Foundation
● it is released under the Apache Software License
While suitable for any application which requires full text indexing and
searching capability, Lucene has been widely recognized for its utility in the
implementation of Internet search engines and local, single-site searching.
http://lucene.apache.org/core/
Lucene?

Indexing.
ElasticSearch is able to achieve fast search responses because,
instead of searching the text directly, it searches an index instead.
This type of index is called an
inverted index, because it inverts
a page-centric data structure
(page->words) to a keyword-centric
data structure (word->pages).
ElasticSearch uses Apache Lucene
to create and manage this inverted
index.
Basic Concepts

In computer science, an inverted index is an index data structure storing a mapping
from content, such as words or numbers, to its locations in a database file, or in a
document or a set of documents. The purpose of an inverted index is to allow fast full
text searches, at a cost of increased processing when a document is added to the
database.
Simple example:
Given the texts:
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
we have the following inverted file index (where the integers in the set notation brackets refer to
the indexes (or keys) of the text symbols, T[0], T[1] etc.):
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Inverted index

Basic Concepts
Data representation.
In ElasticSearch, a Document is the unit of search and index. An index
consists of one or more Documents, and a Document consists of one or more
Fields (in database terminology, a Document corresponds to a table row, and a
Field corresponds to a table column).
Schema declares:
- what fields there are
- which field should be used as the unique/primary key
- which fields are required
- how to index and search each field
- etc.
An index may store documents of
different "mapping types".
You can associate multiple
mapping definitions for each mapping type.
A mapping type is a way of separating the
documents in an index into logical groups.

Competitors?
http://lucene.apache.org/solr/
http://sphinxsearch.com/

What's the same?
VS
Lucene Query, Facet, Index functionality
implementation:
Very similar, but have some differences and nuances, as the one or the other
side (in the internet a lot of information about this, you can read for example
this series of articles http://blog.sematext.com/2012/08/23/solr-vs-
elasticsearch-part-1-overview/ )

What's the difference?
VS
ElasticSearch main advantages (IMHO):
1. Low barriers to entry. ElasticSearch is a more "intuitive, accessible" system
(significantly less configuration, as it's dynamic via HTTP schema builder and
sensible defaults)
2. JSON-based API is cleaner and easier to use
3. The replication and sharding capabilities are much simpler to configure
4. Complex documents (nested)
5. Multiple document types per schema
6. Joins (parent/child relationships)
7. Online schema changes
8. Self-contained cluster

What's the difference?
VS
Solr main advantages (IMHO):
1. Solr has a bigger, more mature user, dev, and contributor community
2. Solr is more mature and maybe more stable
3. Solr has more response formats (XML,CSV,JSON)
4. Better 3rd-party product integration
5. Pivot Facets
6. More customizable

ES Clients and "river" plugins
There are clients for languages and platforms (from official site):
Java, .Net, Perl, Python, Python, Ruby, PHP, Javascript, Scala, Clojure, Go,
Erlang, EventMachine, OCaml, Smalltalk
There are "river" (data import) plugins for:
JDBC, CouchDB, Wikipedia, Twitter, RabbitMQ, RSS, MongoDB, Open
Archives Initiative (OAI) , St9, Sofa, Amazon SQS, LDAP, Dropbox, ActiveMQ,
Solr, CSV, JMS

How to connect from my code?
NEST
(Guys from stackowerflow.com and I think it is the best .net client for ElasticSearch)
NEST aims to be a .net client with a very concise API. (http://github.com/Mpdreamz/NEST)
Its main goal is to provide a solid strongly typed Elasticsearch client. It also has string/dynamic
overloads for more dynamic use cases.
Why NEST?
● Fluent. Looks like:
ElasticClient.Search<Foo>(s => s.From(0).Size(10).SortAscending(f => f.Name).Query(...
● Json serializer/deserializer - Newtonsoft Json.NET with all its advantages
● Strongly typed
● Useful attributes for configuring
● kept improving and developing
● Open-source
● Clear and beauty source code
● Available on NuGet

Wanna search? Piece of cake!

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Wanna search? Piece of cake!

Similaire à Wanna search? Piece of cake! (20)

Dernier

Dernier (20)

Wanna search? Piece of cake!