BlaBlaCar Elastic Search Feedback

3/37
Nicolas Blanc - BlaBlArchitect
SinfomicSinfomic
(1999)
@thewhitegeek
(2001)
(2005)
(2008)
(2012)

5/37
3 000 000MEMBERS
IN EUROPE

6/37
10 9 countries10 9 countries
● France
● Spain
● Italy
● UK
● Poland
● Portugal
● Netherlands
● Belgium
● Luxemburg
● NEW Germany
● France
● Spain
● Italy
● UK
● Poland
● Portugal
● Netherlands
● Belgium
● Luxemburg

7/37
Growth
50 millions
25 millions
January
2008
January
2013

8/37
Infrastructure
 2 front web servers
 2 MySQL master (+4 slaves SSD)
 1 private cloud
(KVM + Open vSwitch)
●
Redis
●
Memcache
●
RabbitMQ/workers
 1 cluster ElasticSearch

9/37
Changing the Search Engine

10/37
What's existing ? Why Changing ?
MySQL Database
●
Relationnal DB (lots of join needed)
●
Plain SQL query
●
Home made geographical search
Recent problems
●
New feature, means more complex queries
●
Scalability : Performance depending on DB load

11/37
Initial requirements
Scalability
●
Trip search need to be made in less than 200ms
●
The system part of the solution easy to maintain
●
Be able to cluster it (also to not have SPOF)
Low code impact on existing application
●
Same features as of today (geographical search)
●
Minimize the developper's work
●
Add one missing feature : facets

12/37
Initial Competitors
SenseiDB

13/37
Why ElasticSearch
✔
Easyest cluster possibility
✔
Good performance when indexing
✔
Few code to write to use it
✔
Schema less
✔
Based on Lucene
✔
Written in Java (need to code grouping feature)

14/37
ElasticSearch has won,
now migrate our search !

15/37
Changing our mindset
Object in Relationnal Database
●
Can be exploded on multiple tables
●
Lots of informations usable by JOIN
Object in Document Oriented Database
●
Only one big index for theses objects
●
All informations need to be in the object, not on
multiple tables

16/37
Changing our mindset
Object in Relationnal Database
●
Can be exploded on multiple tables
●
Lots of informations usable by JOIN
Object in Document Oriented Database
●
Only one big index for theses objects
●
All informations need to be in the object, not on
multiple tables

17/37
Well defining our objects
Need to know what we want to search
●
Searching trips (front office usage)
●
Searching members (backoffice usage)
●
Searching FAQ (front office usage)
Think of all needed field
●
The ones used for query
●
The ones used for filters
●
The ones used for facets

18/37
Thinking of well defining index
System point of view
●
Number of Nodes in the cluster
●
Number of Shards
●
Number of Replica
Application point of view
●
Define type and attributes for all fields (mapping)
●
Using parent/child or nested to improve indexing
●
How to push documents from DB ?

19/37
Indexing : using a river or not ?
River advantages
●
Plugs directly to our source backend
●
ElasticSearch API exists to code a new one
River problems
●
Not easy to add business logic on some fields
●
Really hard when your DB is unconventionnal
●
Full Reindex all the documents

20/37
Indexing : our manual way
We write an asynchronous indexer
●
Written in java
●
Have business logic when fetching from db
●
Fetch from multiple DB/source
●
Use of java ES library
●
Easy interface
●
send {“trip”:1234567} and the server answer {“OK”}

22/37
Well defining our object Trip
Think of all needed field
●
The ones used for query
●
Trip date of departure,from where,to where,user id
●
The ones used for filters
●
User ratings,price,vehicle,seats left,is user blocked
(a blocked user, is a user who made some forbidden
action on the website.)
●
The ones used for facets
●
User ratings,price,vehicle

23/37
Well defining our index Trip
Think of all system requirement
●
The cluster has 2 nodes
●
We keep the default configuration for shards/replica
Think of object mapping
●
For each field :
●
Define the type (string, long, geo_point, date,
float, boolean)
●
Define the scope (include_in_all)
●
Define the analyzer (for type string)

24/37
Trip Mapping
"trip": {
"properties": {
"is_user_blocked": {
"type": "boolean",
"include_in_all" : false
},
"user_ratings" : {
"type" : "long",
},
"from": {
"type": "geo_point",
},
"price": {
"include_in_all": false,
"type": "float"
},
"price_euro": {
"type": "float",
“include_in_all: false
},
"seats_left": {
"type": "long"
},
"seats_offered": {
"type": "long"
},
"to": {
"type": "geo_point"
},
"trip_date": {
"format": "dateOptionalTime",
"type": "date"
},
“vehicle”: {
"type": "string"
},
"userid": {
"index": "not_analyzed",
"type": "string"
}
}
}

25/37
Well indexing events
Which modification send event change
●
All trips creation/deletion/modification
●
Member modifications (block or not)
●
New ratings from other members
●
A seat has been reserved
●
Member change his vehicle
Event change is a call to internal indexer
●
Send '{“trip”:123456}' to indexer (create/update)
●
Send '{“tripd”:123456}' to indexer (delete)

26/37
Sample trip index query
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"and": [{
"geo_distance": {
"distance": "40.14937866995km",
"from": {
"lat": 48.856614,
"lon": 2.3522219
}
}
}, {
"geo_distance": {
"distance": "40.14937866995km",
"to": {
"lat": 45.764043,
"lon": 4.835659
}
}
},
{
"range": {
"price": {
"from": 0,
"include_lower": false
}
}
}]
}
}
},
"sort": [{
"trip_date": { "order": "asc" },
}],
"filter": {
"term": { "is_user_blocked": false }
}
},
"from": 0,
"size": 10
}

27/37
The Real World
A trip has now more than 30 fields
●
(faq is around 25 fields)
●
(members even more...)
To build a trip document we need 3
differents SQL queries
●
(FAQ : 2 differents SQL queries)
●
(Member : 10 differents SQL queries)
A trip has only 1 shard (grouping)

29/37
Preloaded Scripts
We use mvel script to improve scoring
●
They are not clustered
●
Each node need to have the scripts
●
Need a node restart to be added or modified
Solution : Chef (tool from Opscode)
All nodes configurations are centralized into Chef
repository

30/37
Grouping documents
Home made patchs to ElasticSearch
(based on a Martijn Van Groningen work for
lusini.de)
Soon in ElasticSearch
(I hope so much)

31/37
Mapping modification
On a running index :
Changing a type is not allowed
Changing analyzer is not allowed
Solution : index alias
1) Changing mapping → create a new index
2) When new index is up to date → changing alias

32/37
IOs limits
We have only 2 nodes
●
Trip index is around 2GB
●
But only 1 shard for Trip index
●
Can index 100 trips / seconds on busy evening
Solution : We put Intel SSDs
(waiting for distributed grouping feature)

33/37
Choosing the analyzer
Some field need to not be analyzed
●
If you use ISO code for country
(IT, for Italy or DE for Germany are ignored in
some cases)
Global analyzer has limits
●
Accentuation from countries like France,
Germany or Spain are not always parsed correctly
●
One analyzer by country is difficult to implement
in some cases

35/37
Using ElasticSearch to ease log analysis

36/37
By the way…
We’re hiring !!!
Dev, HTML Ninja, leader,…
Come & See me right now
… or send me your friends 
(And we have beer, baby foot and arcade cabinet  )

37/37
Thank you !
Follow us !
@covoiturage
Apply now :
join@BlaBlaCar.com

BlaBlaCar Elastic Search Feedback

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à BlaBlaCar Elastic Search Feedback

Similaire à BlaBlaCar Elastic Search Feedback (20)

Dernier

Dernier (20)

BlaBlaCar Elastic Search Feedback