SlideShare une entreprise Scribd logo
1  sur  47
INTRODUCTION TO
ELASTICSEARCH
2

Agenda
• Me
• ElasticSearch Basics
• Concepts
• Network / Discovery
• Data Structure
• Inverted Index
• The REST API
• Sample Deployment
3

Me
• Roy Russo
• JBoss Portal Co-Founder
• LoopFuse Co-Founder
• ElasticHQ
• http://www.elastichq.org
• AltiSource Labs Architect
4

ElasticSearch in One Slide
• Document - Oriented Search Engine
• JSON
• Apache Lucene
• No Schema
• Mapping Types
• Horizontal Scale, Distributed
• REST API
• Vibrant Ecosystem
• Tooling, Plugins, Hosting, Client-Libs
5

When to use ElasticSearch
• Full-Text Search
• Fast Read Database
• “Simple” Data Structures
• Minimize Impedance Mismatch
6

When to use ElasticSearch - Logs
• Logstash + ElasticSearch + Kibana
7

How to use ElasticSearch - CQRS
Client

Command Sent
Ack Resp.

Remote Interfaces
Services
Domain Objects
Data
Storage

Request DTO
DTO Returned
8

How to use ElasticSearch - CQRS
Request DTO
DTO Returned

Client

Command Sent
Ack Resp.

Remote Interfaces

Remote Interfaces

Services

DTO Read Layer

Domain Objects
Event
Storage

?

Data
Storage
9

A note on Rivers
• JDBC
• CouchDB
• MongoDB
• RabbitMQ

• Twitter
• And more…

"type" : "jdbc",
"jdbc" : {
"driver" : "com.mysql.jdbc.Driver",
"url" : "jdbc:mysql://localhost:3306/my_db",
"user" : "root",
"password" : "mypassword",
"sql" : "select * from products"
}
10

ElasticSearch at Work
REALTrans

REALServicing
REALSearch

ElasticSearch

REALDoc
11

What sucks about ElasticSearch
• No AUTH/AUTHZ
• No Usage Metrics
12

How the World Uses ElasticSearch
13

The Basics - Distro
• Download and Run
Executables

Node Configs

Data Storage

Log files

├── bin
│ ├── elasticsearch
│ ├── elasticsearch.in.sh
│ └── plugin
├── config
│ ├── elasticsearch.yml
│ └── logging.yml
├── data
│ └── cluster1
├── lib
│ ├── elasticsearch-x.y.z.jar
│ ├── ...
│ └──
└── logs
├── elasticsearch.log
└── elasticsearch_index_search_slowlog.log
└── elasticsearch_index_indexing_slowlog.log
14

The Basics - Glossary
• Node = One ElasticSearch instance (1 java proc)
• Cluster = 1..N Nodes w/ same Cluster Name
• Index = Similar to a DB
• Named Collection of Documents
• Maps to 1..N Primary shards && 0..N Replica shards
• Mapping Type = Similar to a DB Table
• Document Definition
• Shard = One Lucene instance
• Distributed across all nodes in the cluster.
15

The Basics - Document Structure
• Modeled as a JSON object
{

{
"genre": "Crime",
“language": "English",
"country": "USA",
"runtime": 170,
"title": "Scarface",
"year": 1983

"_index": "imdb",
"_type": "movie",
"_id": "u17o8zy9RcKg6SjQZqQ4Ow",
"_version": 1,
"_source": {
"genre": "Crime",
"language": "English",
"country": "USA",
"runtime": 170,
"title": "Scarface",
"year": 1983
}

}

}
16

The Basics - Document Structure
• Document Metadata fields
• _id
• _type : mapping type
• _source : enabled/disabled
• _timestamp
• _ttl
• _size : size of uncompressed _source
• _version
17

The Basics - Document Structure
• Mapping:
• ES will auto-map (type) fields
• You can specify mapping, if needed
• Data Types:
• String
• Number
• Int, long, float, double, short, byte

• Boolean
• Datetime
• formatted

• geo_point, geo_shape
• Array
• Nested
• IP
18

A Mapping Type
"imdb": {
"movie": {
"properties": {
"country": {
"type": "string“,
“store”:true,
“index”:false
},
"genre": {
"type": "string“,
"null_value" : "na“,
“store”:false,
“index:true
},
"year": {
"type": "long"
}
}
}
}
19

Lucene – Inverted Index
• Which presidential speeches contain the words “fair”
• Go over every speech, word by word, and mark the speeches that
contain it
• Fails at large scale
20

Lucene – Inverted Index
• Inverting
• Take all the speeches
• Break them down by word (tokenize)
• For each word, store the IDs of the speeches
• Sort all words (tokens)
• Searching
• Finding the word is fast
• Iterate over document IDs that are referenced
Token

Doc Frequency

Doc IDs

Jobs

2

4,8

Fair

5

1,2,4,8,42

Bush

300

1,2,3,4,5,6, …
21

Lucene – Inverted Index
• Not an algorithm
• Implementations vary
22

Cluster Topology
• 4 Node Cluster
• Index Configuration:
• “A”: 2 Shards, 1 Replica
• “B”: 3 Shards, 1 Replica

A1
B3

B2

A2

B1

B2

A1

B1
B3

A2
23

Building a Cluster
Start Cluster…
start cmd.exe /C elasticsearch -Des.node.name=Primus
start cmd.exe /C elasticsearch -Des.node.data=true -Des.node.master=false -Des.node.name=Slayer
start cmd.exe /C elasticsearch -Des.node.data=true -Des.node.master=true -Des.node.name=Maiden

Create Index…
curl -XPUT 'http://localhost:9200/imdb/' -d '{
"settings" : {
"index" : {
"number_of_shards" : 3,
"number_of_replicas" : 1
}
}
}'

Index Document…
curl -XPOST 'http://localhost:9200/imdb/movie/' -d '{
"genre": “Comedy",
"language": "English",
"country": "USA",
"runtime": 99,
"title": “Big Trouble in Little China",
"year": 1986
}'
24

Cluster State
• Cluster State
• Node Membership
• Indices Settings and Mappings (Types)
• Shard Allocation Table
• Shard State
• cURL -XGET http://localhost:9200/_cluster/state?pretty=1'
25

Cluster State
• Changes in State published from Master to other nodes
1

(M)

3

2

PUT /newIndex
CS1

1

(M)

CS1

1

(M)
CS2

3

2

CS2

CS1

CS1

CS1

3

2
CS2

CS2
26

Discovery
• Nodes discover each other using multicast.
• Unicast is an option
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["host1", "host2:port", "host3"]

• Each cluster has an elected master node
• Beware of split-brain
27

The Basics - Shards
• Primary Shard:
• First time Indexing
• Index has 1..N primary shards (default: 5)
• # Not changeable once index created
• Replica Shard:
• Copy of the primary shard
• Can be changed later
• Each primary has 0..N replicas
• HA:
• Promoted to primary if primary fails
• Get/Search handled by primary||replica
28

Shard Auto-Allocation
• Add a node - Shards Relocate

Node 1

Node 2

0P

1P

1R

0R

Node 2

0R

• Shard Stages
• UNASSIGNED
• INITIALIZING
• STARTED
• RELOCATING
29

The Basics – Searching
• How it works:
• Search request hits a node
• Node broadcasts to every shard in the index
• Each shard performs query
• Each shard returns results
• Results merged, sorted, and returned to client.
• Problems:
• ES has no idea where your document is
• Broadcast query to 100 nodes
• Performance degrades
30

The Basics - Shards
• Shard Allocation Awareness
• cluster.routing.allocation.awareness.attributes: rack_id
• Example:
•
•
•
•
•

2 Nodes with node.rack_id=rack_one
Create Index 5 shards / 1 replica (10 shards)
Add 2 Nodes with node.rack_id=rack_two
Shards RELOCATE to even distribution
Primary & Replica will NOT be on the same rack_id value.

• Shard Allocation Filtering
• node.tag=val1
• index.routing.allocation.include.tag:val1,val2
curl -XPUT localhost:9200/newIndex/_settings -d '{
"index.routing.allocation.include.tag" : "val1,val2"
}'
31

Nodes
• Master node handles cluster-wide (Meta-API) events:
• Node participation
• New indices create/delete
• Re-Allocation of shards
• Data Nodes
• Indexing / Searching operations
• Client Nodes
• REST calls
• Light-weight load balancers
32

REST API
• Create Index
• action.auto_create_index: 0
• Index Document
• Dynamic type mapping
• Versioning
• ID specification
• Parent / Child (/1122?parent=1111)
33

REST API – Versioning
• Every document is Versioned
• Version assigned on creation
• Version number can be assigned
34

REST API - Update
• Update using partial data
• Partial doc merged with existing
• Fails if document doesn’t exist
• “Upsert” data used to create a doc, if doesn’t exist
{
“upsert" : {
“title": “Blade Runner”
}
}
35

REST API
• Exists
• No overhead in loading
• Status Code Result
• Delete
• Get
• Multi-Get

{
"docs" : [
{
"_id" : "1"
"_index" : "imdb"
"_type" : "movie"
},
{
"_id" : "5"
"_index" : "oldmovies"
"_type" : "movie"
"_fields" " ["title", "genre"]
}
]
}
36

REST API - Search
• Free Text Search
• URL Request
• http://localhost:9200/imdb/movie/_search?q=scar*
• Complex Query
• http://localhost:9200/imdb/movie/_search?q=scarface+OR

+star
• http://localhost:9200/imdb/movie/_search?q=(scarface+O
R+star)+AND+year:[1981+TO+1984]
37

REST API - Search
• Search Types:
• http://localhost:9200/imdb/movie/_search?q=(scarface+OR+star)+A
ND+year:[1941+TO+1984]&search_type=count
• http://localhost:9200/imdb/movie/_search?q=(scarface+OR+star)+A
ND+year:[1941+TO+1984]&search_type=query_then_fetch
• Query and Fetch (fastest):
• Executes on all shards and return results

• Query then Fetch (default):
• Executes on all shards. Only some information returned for rank/sort,

only the relevant shards are asked for data
38

REST API – Query DSL
http://localhost:9200/imdb/movie/_search?q=(scarface+OR+star)+AND+year:[1981+TO+1984]

Becomes…
curl -XPOST 'localhost:9200/_search?pretty' -d '{
"query" : {
"bool" : {
"must" : [
{
"query_string" : {
"query" : "scarface or star"
}
},
{
"range" : {
"year" : { "gte" : 1931 }
}
}
]
}
}
}'
39

REST API – Query DSL
• Query String Request use Lucene query syntax
• Limited
• Instead use “match” query

curl -XPOST 'localhost:9200/_search?pretty' -d '{
"query" : {
"bool" : {
"must" : [
Automatically builds
{
a boolean query
“match" : {
“message" : “scarface star"
}
},
{
"range" : {
“year" : { "gte" : 1981 }
}
}
]
…
40

REST API – Query DSL
• Match Query
{
“match”:{
“title”:{
“type”:“phrase”,
“query”:“quick fox”,
“slop”:1
}
}
}

• Boolean Query
• Must: document must match query
• Must_not: document must not match query
• Should: document doesn’t have to match
• If it matches… higher score

{
"bool":{
"must":[
{
"match":{
"color":"blue"
}
},
{
"match":{
"title":"shirt"
}
}
],
"must_not":[
{
"match":{
"size":"xxl"
}
}
],
"should":[
{
"match":{
"textile":"cotton"
}
41

REST API – Query DSL
• Range Query
• Numeric / Date Types
• Prefix/Wildcard Query
• Match on partial terms
• RegExp Query

{
"range":{
"founded_year":{
"gte":1990,
"lt":2000
}
}
}
42

REST API – Query DSL
• Geo_bbox
• Bounding box filter
• Geo_distance
• Geo_distance_range

{
"query":{
"filtered":{
"query":{
"match_all":{
}
},
"filter":{
"geo_bbox":{
"location":{
"top_left":{
"lat":40.73,
"lon":-74.1
},
"bottom_right":{
"lat":40.717,
"lon":-73.99
}

{
"query":{
"filtered":{
"query":{
"match_all":{

}
},
"filter":{
"geo_distance":{
"distance":"400km"
"location":{
"lat":40.73,
"lon":-74.1
}
}

…
43

REST API – Bulk Operations
• Bulk API
• Minimize round trips with index/delete ops
• Individual response for every request action
• In order

• Failure of one action will not stop subsequent actions.

• localhost:9200/_bulk

{ "delete" : { "_index" : “imdb", "_type" : “movie", "_id" : "2" } }n
{ "index" : { "_index" : “imdb", "_type" : “actor", "_id" : "1" } }n
{ "first_name" : "Tony", "last_name" : "Soprano" }n
...
{ “update" : { "_index" : “imdb", "_type" : “movie", "_id" : "3" } }n
{ doc : {“title" : “Blade Runner" } }n
44

Percolate API
• Reversing Search
• Store queries and filter (percolate) documents through them.
• Useful for Alert/Monitoring systems
curl -XPUT localhost:9200/_percolator/stocks/alert-on-nokia -d '{
"query" : {
"boolean" : {
"must" : [
{ "term" : { "company" : "NOK" }},
{ "range" : { "value" : { "lt" : "2.5" }}}
]
}
}
}'

curl -X PUT localhost:9200/stocks/stock/1?percolate=* -d '{
"doc" : {
"company" : "NOK",
"value" : 2.4
}
}'
45

Clients
• Client list: http://www.elasticsearch.org/guide/clients/
• Java Client, JS, PHP, Perl, Python, Ruby
• Spring Data:
• Uses TransportClient
• Implementation of ElasticsearchRepository aligns with generic
Repository interfaces.
• ElasticSearchCrudRepository extends PagingandSortingRepository
• https://github.com/spring-projects/spring-data-elasticsearch
@Document(indexName = "book", type = "book", indexStoreType = "memory", shards = 1, replicas = 0, refreshInterval = "-1")
public class Book {
…
}
public interface ElasticSearchBookRepository extends ElasticsearchRepository<Book, String> {
}
46

B’what about Mongo?
• Mongo:
• General purpose DB
• ElasticSearch:
• Distributed text search engine

… that’s all I have to say about that.
47

Questions?

Contenu connexe

Tendances

Tendances (20)

Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!Elasticsearch: You know, for search! and more!
Elasticsearch: You know, for search! and more!
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data Analytics
 
Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)Elasticsearch quick Intro (English)
Elasticsearch quick Intro (English)
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document database
 
Elasticsearch - under the hood
Elasticsearch - under the hoodElasticsearch - under the hood
Elasticsearch - under the hood
 
The ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch pluginsThe ultimate guide for Elasticsearch plugins
The ultimate guide for Elasticsearch plugins
 
Elasticsearch first-steps
Elasticsearch first-stepsElasticsearch first-steps
Elasticsearch first-steps
 
Использование Elasticsearch для организации поиска по сайту
Использование Elasticsearch для организации поиска по сайтуИспользование Elasticsearch для организации поиска по сайту
Использование Elasticsearch для организации поиска по сайту
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)Your Data, Your Search, ElasticSearch (EURUKO 2011)
Your Data, Your Search, ElasticSearch (EURUKO 2011)
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearch
 
Elasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational databaseElasticsearch as a search alternative to a relational database
Elasticsearch as a search alternative to a relational database
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Elasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English versionElasticsearch - Devoxx France 2012 - English version
Elasticsearch - Devoxx France 2012 - English version
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
 

Similaire à ElasticSearch - DevNexus Atlanta - 2014

How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
琛琳 饶
 
Scaling Analytics with elasticsearch
Scaling Analytics with elasticsearchScaling Analytics with elasticsearch
Scaling Analytics with elasticsearch
dnoble00
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
Korea Sdec
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Ontico
 

Similaire à ElasticSearch - DevNexus Atlanta - 2014 (20)

How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
 
Scaling Analytics with elasticsearch
Scaling Analytics with elasticsearchScaling Analytics with elasticsearch
Scaling Analytics with elasticsearch
 
Percona Live London 2014: Serve out any page with an HA Sphinx environment
Percona Live London 2014: Serve out any page with an HA Sphinx environmentPercona Live London 2014: Serve out any page with an HA Sphinx environment
Percona Live London 2014: Serve out any page with an HA Sphinx environment
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
MongoDB: a gentle, friendly overview
MongoDB: a gentle, friendly overviewMongoDB: a gentle, friendly overview
MongoDB: a gentle, friendly overview
 
ElasticSearch: Найдется все... и быстро!
ElasticSearch: Найдется все... и быстро!ElasticSearch: Найдется все... и быстро!
ElasticSearch: Найдется все... и быстро!
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction
 
REST easy with API Platform
REST easy with API PlatformREST easy with API Platform
REST easy with API Platform
 
Elk presentation1#3
Elk presentation1#3Elk presentation1#3
Elk presentation1#3
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
 
Workshop: Big Data Visualization for Security
Workshop: Big Data Visualization for SecurityWorkshop: Big Data Visualization for Security
Workshop: Big Data Visualization for Security
 
曾勇 Elastic search-intro
曾勇 Elastic search-intro曾勇 Elastic search-intro
曾勇 Elastic search-intro
 
Building APIs in an easy way using API Platform
Building APIs in an easy way using API PlatformBuilding APIs in an easy way using API Platform
Building APIs in an easy way using API Platform
 
Retaining globally distributed high availability
Retaining globally distributed high availabilityRetaining globally distributed high availability
Retaining globally distributed high availability
 
Test driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDBTest driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDB
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
Elastic search intro-@lamper
Elastic search intro-@lamperElastic search intro-@lamper
Elastic search intro-@lamper
 
Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012Sphinx at Craigslist in 2012
Sphinx at Craigslist in 2012
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

ElasticSearch - DevNexus Atlanta - 2014

  • 2. 2 Agenda • Me • ElasticSearch Basics • Concepts • Network / Discovery • Data Structure • Inverted Index • The REST API • Sample Deployment
  • 3. 3 Me • Roy Russo • JBoss Portal Co-Founder • LoopFuse Co-Founder • ElasticHQ • http://www.elastichq.org • AltiSource Labs Architect
  • 4. 4 ElasticSearch in One Slide • Document - Oriented Search Engine • JSON • Apache Lucene • No Schema • Mapping Types • Horizontal Scale, Distributed • REST API • Vibrant Ecosystem • Tooling, Plugins, Hosting, Client-Libs
  • 5. 5 When to use ElasticSearch • Full-Text Search • Fast Read Database • “Simple” Data Structures • Minimize Impedance Mismatch
  • 6. 6 When to use ElasticSearch - Logs • Logstash + ElasticSearch + Kibana
  • 7. 7 How to use ElasticSearch - CQRS Client Command Sent Ack Resp. Remote Interfaces Services Domain Objects Data Storage Request DTO DTO Returned
  • 8. 8 How to use ElasticSearch - CQRS Request DTO DTO Returned Client Command Sent Ack Resp. Remote Interfaces Remote Interfaces Services DTO Read Layer Domain Objects Event Storage ? Data Storage
  • 9. 9 A note on Rivers • JDBC • CouchDB • MongoDB • RabbitMQ • Twitter • And more… "type" : "jdbc", "jdbc" : { "driver" : "com.mysql.jdbc.Driver", "url" : "jdbc:mysql://localhost:3306/my_db", "user" : "root", "password" : "mypassword", "sql" : "select * from products" }
  • 11. 11 What sucks about ElasticSearch • No AUTH/AUTHZ • No Usage Metrics
  • 12. 12 How the World Uses ElasticSearch
  • 13. 13 The Basics - Distro • Download and Run Executables Node Configs Data Storage Log files ├── bin │ ├── elasticsearch │ ├── elasticsearch.in.sh │ └── plugin ├── config │ ├── elasticsearch.yml │ └── logging.yml ├── data │ └── cluster1 ├── lib │ ├── elasticsearch-x.y.z.jar │ ├── ... │ └── └── logs ├── elasticsearch.log └── elasticsearch_index_search_slowlog.log └── elasticsearch_index_indexing_slowlog.log
  • 14. 14 The Basics - Glossary • Node = One ElasticSearch instance (1 java proc) • Cluster = 1..N Nodes w/ same Cluster Name • Index = Similar to a DB • Named Collection of Documents • Maps to 1..N Primary shards && 0..N Replica shards • Mapping Type = Similar to a DB Table • Document Definition • Shard = One Lucene instance • Distributed across all nodes in the cluster.
  • 15. 15 The Basics - Document Structure • Modeled as a JSON object { { "genre": "Crime", “language": "English", "country": "USA", "runtime": 170, "title": "Scarface", "year": 1983 "_index": "imdb", "_type": "movie", "_id": "u17o8zy9RcKg6SjQZqQ4Ow", "_version": 1, "_source": { "genre": "Crime", "language": "English", "country": "USA", "runtime": 170, "title": "Scarface", "year": 1983 } } }
  • 16. 16 The Basics - Document Structure • Document Metadata fields • _id • _type : mapping type • _source : enabled/disabled • _timestamp • _ttl • _size : size of uncompressed _source • _version
  • 17. 17 The Basics - Document Structure • Mapping: • ES will auto-map (type) fields • You can specify mapping, if needed • Data Types: • String • Number • Int, long, float, double, short, byte • Boolean • Datetime • formatted • geo_point, geo_shape • Array • Nested • IP
  • 18. 18 A Mapping Type "imdb": { "movie": { "properties": { "country": { "type": "string“, “store”:true, “index”:false }, "genre": { "type": "string“, "null_value" : "na“, “store”:false, “index:true }, "year": { "type": "long" } } } }
  • 19. 19 Lucene – Inverted Index • Which presidential speeches contain the words “fair” • Go over every speech, word by word, and mark the speeches that contain it • Fails at large scale
  • 20. 20 Lucene – Inverted Index • Inverting • Take all the speeches • Break them down by word (tokenize) • For each word, store the IDs of the speeches • Sort all words (tokens) • Searching • Finding the word is fast • Iterate over document IDs that are referenced Token Doc Frequency Doc IDs Jobs 2 4,8 Fair 5 1,2,4,8,42 Bush 300 1,2,3,4,5,6, …
  • 21. 21 Lucene – Inverted Index • Not an algorithm • Implementations vary
  • 22. 22 Cluster Topology • 4 Node Cluster • Index Configuration: • “A”: 2 Shards, 1 Replica • “B”: 3 Shards, 1 Replica A1 B3 B2 A2 B1 B2 A1 B1 B3 A2
  • 23. 23 Building a Cluster Start Cluster… start cmd.exe /C elasticsearch -Des.node.name=Primus start cmd.exe /C elasticsearch -Des.node.data=true -Des.node.master=false -Des.node.name=Slayer start cmd.exe /C elasticsearch -Des.node.data=true -Des.node.master=true -Des.node.name=Maiden Create Index… curl -XPUT 'http://localhost:9200/imdb/' -d '{ "settings" : { "index" : { "number_of_shards" : 3, "number_of_replicas" : 1 } } }' Index Document… curl -XPOST 'http://localhost:9200/imdb/movie/' -d '{ "genre": “Comedy", "language": "English", "country": "USA", "runtime": 99, "title": “Big Trouble in Little China", "year": 1986 }'
  • 24. 24 Cluster State • Cluster State • Node Membership • Indices Settings and Mappings (Types) • Shard Allocation Table • Shard State • cURL -XGET http://localhost:9200/_cluster/state?pretty=1'
  • 25. 25 Cluster State • Changes in State published from Master to other nodes 1 (M) 3 2 PUT /newIndex CS1 1 (M) CS1 1 (M) CS2 3 2 CS2 CS1 CS1 CS1 3 2 CS2 CS2
  • 26. 26 Discovery • Nodes discover each other using multicast. • Unicast is an option discovery.zen.ping.multicast.enabled: false discovery.zen.ping.unicast.hosts: ["host1", "host2:port", "host3"] • Each cluster has an elected master node • Beware of split-brain
  • 27. 27 The Basics - Shards • Primary Shard: • First time Indexing • Index has 1..N primary shards (default: 5) • # Not changeable once index created • Replica Shard: • Copy of the primary shard • Can be changed later • Each primary has 0..N replicas • HA: • Promoted to primary if primary fails • Get/Search handled by primary||replica
  • 28. 28 Shard Auto-Allocation • Add a node - Shards Relocate Node 1 Node 2 0P 1P 1R 0R Node 2 0R • Shard Stages • UNASSIGNED • INITIALIZING • STARTED • RELOCATING
  • 29. 29 The Basics – Searching • How it works: • Search request hits a node • Node broadcasts to every shard in the index • Each shard performs query • Each shard returns results • Results merged, sorted, and returned to client. • Problems: • ES has no idea where your document is • Broadcast query to 100 nodes • Performance degrades
  • 30. 30 The Basics - Shards • Shard Allocation Awareness • cluster.routing.allocation.awareness.attributes: rack_id • Example: • • • • • 2 Nodes with node.rack_id=rack_one Create Index 5 shards / 1 replica (10 shards) Add 2 Nodes with node.rack_id=rack_two Shards RELOCATE to even distribution Primary & Replica will NOT be on the same rack_id value. • Shard Allocation Filtering • node.tag=val1 • index.routing.allocation.include.tag:val1,val2 curl -XPUT localhost:9200/newIndex/_settings -d '{ "index.routing.allocation.include.tag" : "val1,val2" }'
  • 31. 31 Nodes • Master node handles cluster-wide (Meta-API) events: • Node participation • New indices create/delete • Re-Allocation of shards • Data Nodes • Indexing / Searching operations • Client Nodes • REST calls • Light-weight load balancers
  • 32. 32 REST API • Create Index • action.auto_create_index: 0 • Index Document • Dynamic type mapping • Versioning • ID specification • Parent / Child (/1122?parent=1111)
  • 33. 33 REST API – Versioning • Every document is Versioned • Version assigned on creation • Version number can be assigned
  • 34. 34 REST API - Update • Update using partial data • Partial doc merged with existing • Fails if document doesn’t exist • “Upsert” data used to create a doc, if doesn’t exist { “upsert" : { “title": “Blade Runner” } }
  • 35. 35 REST API • Exists • No overhead in loading • Status Code Result • Delete • Get • Multi-Get { "docs" : [ { "_id" : "1" "_index" : "imdb" "_type" : "movie" }, { "_id" : "5" "_index" : "oldmovies" "_type" : "movie" "_fields" " ["title", "genre"] } ] }
  • 36. 36 REST API - Search • Free Text Search • URL Request • http://localhost:9200/imdb/movie/_search?q=scar* • Complex Query • http://localhost:9200/imdb/movie/_search?q=scarface+OR +star • http://localhost:9200/imdb/movie/_search?q=(scarface+O R+star)+AND+year:[1981+TO+1984]
  • 37. 37 REST API - Search • Search Types: • http://localhost:9200/imdb/movie/_search?q=(scarface+OR+star)+A ND+year:[1941+TO+1984]&search_type=count • http://localhost:9200/imdb/movie/_search?q=(scarface+OR+star)+A ND+year:[1941+TO+1984]&search_type=query_then_fetch • Query and Fetch (fastest): • Executes on all shards and return results • Query then Fetch (default): • Executes on all shards. Only some information returned for rank/sort, only the relevant shards are asked for data
  • 38. 38 REST API – Query DSL http://localhost:9200/imdb/movie/_search?q=(scarface+OR+star)+AND+year:[1981+TO+1984] Becomes… curl -XPOST 'localhost:9200/_search?pretty' -d '{ "query" : { "bool" : { "must" : [ { "query_string" : { "query" : "scarface or star" } }, { "range" : { "year" : { "gte" : 1931 } } } ] } } }'
  • 39. 39 REST API – Query DSL • Query String Request use Lucene query syntax • Limited • Instead use “match” query curl -XPOST 'localhost:9200/_search?pretty' -d '{ "query" : { "bool" : { "must" : [ Automatically builds { a boolean query “match" : { “message" : “scarface star" } }, { "range" : { “year" : { "gte" : 1981 } } } ] …
  • 40. 40 REST API – Query DSL • Match Query { “match”:{ “title”:{ “type”:“phrase”, “query”:“quick fox”, “slop”:1 } } } • Boolean Query • Must: document must match query • Must_not: document must not match query • Should: document doesn’t have to match • If it matches… higher score { "bool":{ "must":[ { "match":{ "color":"blue" } }, { "match":{ "title":"shirt" } } ], "must_not":[ { "match":{ "size":"xxl" } } ], "should":[ { "match":{ "textile":"cotton" }
  • 41. 41 REST API – Query DSL • Range Query • Numeric / Date Types • Prefix/Wildcard Query • Match on partial terms • RegExp Query { "range":{ "founded_year":{ "gte":1990, "lt":2000 } } }
  • 42. 42 REST API – Query DSL • Geo_bbox • Bounding box filter • Geo_distance • Geo_distance_range { "query":{ "filtered":{ "query":{ "match_all":{ } }, "filter":{ "geo_bbox":{ "location":{ "top_left":{ "lat":40.73, "lon":-74.1 }, "bottom_right":{ "lat":40.717, "lon":-73.99 } { "query":{ "filtered":{ "query":{ "match_all":{ } }, "filter":{ "geo_distance":{ "distance":"400km" "location":{ "lat":40.73, "lon":-74.1 } } …
  • 43. 43 REST API – Bulk Operations • Bulk API • Minimize round trips with index/delete ops • Individual response for every request action • In order • Failure of one action will not stop subsequent actions. • localhost:9200/_bulk { "delete" : { "_index" : “imdb", "_type" : “movie", "_id" : "2" } }n { "index" : { "_index" : “imdb", "_type" : “actor", "_id" : "1" } }n { "first_name" : "Tony", "last_name" : "Soprano" }n ... { “update" : { "_index" : “imdb", "_type" : “movie", "_id" : "3" } }n { doc : {“title" : “Blade Runner" } }n
  • 44. 44 Percolate API • Reversing Search • Store queries and filter (percolate) documents through them. • Useful for Alert/Monitoring systems curl -XPUT localhost:9200/_percolator/stocks/alert-on-nokia -d '{ "query" : { "boolean" : { "must" : [ { "term" : { "company" : "NOK" }}, { "range" : { "value" : { "lt" : "2.5" }}} ] } } }' curl -X PUT localhost:9200/stocks/stock/1?percolate=* -d '{ "doc" : { "company" : "NOK", "value" : 2.4 } }'
  • 45. 45 Clients • Client list: http://www.elasticsearch.org/guide/clients/ • Java Client, JS, PHP, Perl, Python, Ruby • Spring Data: • Uses TransportClient • Implementation of ElasticsearchRepository aligns with generic Repository interfaces. • ElasticSearchCrudRepository extends PagingandSortingRepository • https://github.com/spring-projects/spring-data-elasticsearch @Document(indexName = "book", type = "book", indexStoreType = "memory", shards = 1, replicas = 0, refreshInterval = "-1") public class Book { … } public interface ElasticSearchBookRepository extends ElasticsearchRepository<Book, String> { }
  • 46. 46 B’what about Mongo? • Mongo: • General purpose DB • ElasticSearch: • Distributed text search engine … that’s all I have to say about that.