3. Information retrieval
Information Retrieval(IR) is finding
material(usually documents) of an unstructured
nature(usually text) that statisfies an information
need from within large collections(usually stored
on computers).
Search Engine is a software system that is
designed to search for information. It’s a kind of
implementation of IR.
3
4. What is search engine?
A search engine is
An index engine for documents
A search engine on indexes
A search engine is more powerful to do
searches:
It’s designed for it !
4
12. the follow two files
File1: Students should be allowed to go
out with their friends, but not allowed
to drink beer.
File2: My friend Jerry went to school to
see his students but found them drunk
which is not allowed.
12
13. Step 1: Tokenzier
Split doc into words
Remove the punctuation
Remove stop word (the, a, this, that etc.)
“Students”,“allowed”,“go”,“their”,
“friends”,“allowed”,“drink”,“beer”,“My”,
“friend”,“Jerry”,“went”,“school”,“see”,
“his”,“students”,“found”,“them”,“drunk”,
“allowed”
13
14. Step2: Linguistic Processor
Lowercase
Stemming, cars -> car, etc.
Lemmatizatio, drove -> drive, etc.
“student”,“allow”,“go”,“their”,“friend”
,“allow”,“drink”,“beer”,“my”,“friend”
,“jerry”,“go”,“school”,“see”,“his”,
“student”,“find”,“them”,“drink”,“allow”
Term
14
15. Step3: Index
Term Document ID
student 1
allow 1
go 1
their 1
friend 1
allow 1
… …
Dict
Sort
Posting list
15
18. Step1: User search query
• Suppose you have the follow query:
lucene AND learned NOT hadoop
18
19. Step2: Lexical & Syntax Analysis
Identify words and keywords
Words: lucene, learned, hadoop
Keywords: AND, NOT
Building a syntax tree
lucene learned
hadoopAND
Not
19
20. Step3: Search
Search in the Inverted List
Sort, Conjunction, Disconjunction
Scorer
20
21. full text search
RESTful API
real time,
Search and
analytics engine
open source
high availability
schema free
JSON over HTTP
Lucene based
distributed
RESTful API
ElasticSearch
21
22. Elastic Search
Distributed and Highly Available Search Engine.
Each index is fully sharded with a configurable number of shards.
Each shard can have one or more replicas.
Read / Search operations performed on either one of the replica shard.
Multi Tenant with Multi Types.
Support for more than one index.
Support for more than one type per index.
Index level configuration (number of shards, index storage, ...).
Document oriented
No need for upfront schema definition.
Schema can be defined per type for customization of the indexing process.
Various set of APIs
HTTP RESTful API
Native Java API.
All APIs perform automatic node operation rerouting.
(Near) Real Time Search.
Reliable, Asynchronous Write Behind for long term persistency.
Built on top of Lucene
Each shard is a fully functional Lucene index
All the power of Lucene easily exposed through simple configuration / plugins.
Per operation consistency
Single document level operations are atomic, consistent, isolated and durable.
Open Source under the Apache License, version 2 ("ALv2")
22
24. Cluster
● A cluster is a collection of one or more
nodes (servers) that together holds your
entire data and provides federated indexing
and search capabilities across all nodes
● A cluster is identified by a unique name
which by default is "elasticsearch"
Terminologies of Elastic Search
24
25. Node
● It is an elasticsearch instance (a java process)
● A node is created when a elasticsearch instance is
started
● A random Marvel Charater name is allocated by
default
Terminologies of Elastic Search
25
26. Index
● An index is a collection of documents that have
somewhat similar characteristics. eg:customer data,
product catalog
● Very crucial while performing indexing, search, update,
and delete operations against the documents in it
● One can define as many indexes in one single cluster
Terminologies of Elastic Search
26
27. Document
● It is the most basic unit of information which can be
indexed
● It is expressed in json (key:value) pair.
‘{“user”:”nullcon”}’
● Every Document gets associated with a type and a
unique id.
Terminologies of Elastic Search
27
28. Shard
● Every index can be split into multiple shards to
be able to distribute data.
● The shard is the atomic part of an index, which
can be distributed over the cluster if you add
more nodes.
Terminologies of Elastic Search
28
31. A terminology comparison
Relational database Elasticsearch
Database Index
Table Type
Row Document
Column Field
Schema Mapping
Index Everything is indexed
SQL Query DSL
SELECT * FROm tb … GET http://
UPDATE tb SET … PUT http://
31
35. Example: Search
Curl –XGET http://localhost:9200/my_index/_search –d
‘{
“query”: {
“match_all”: {}
}
}’
Total number of docs
Relevance
Search time
Max score
35