At bol.com, a leading ecommerce platform in The Netherlands, we have done extensive research into what it would take to use ElasticSearch as the main search provider. We will explain the specific challenges and requirements of running an Elasticsearch cluster at bol.com-scale, and show how we have used generated data to do performance and scalability tests on different ways to model a hierarchical data model into Elasticsearch. We will describe the benefits and drawbacks of the different data model options, and their consequences for the design of the index and search applications.
2. introduction
• Anne Veling
– Elasticsearch consultancy and custom
training
– Performance and Stability Troubleshooting
– Software Architect, Team Lead
3. • Hierarchical data model, multiple levels
• High volume
– searches
– data changes
• Complex query requirements
– Both Product and Offer fields in query
– Facet on both levels
bol.com challenge
6. Test Data Creation
• Node.js Script creating random data
– Product
• Title: two random nouns from noun list
• Category: pick one out 26 nouns
• Half have no offer, half between 1-4
– Offer
• Random price between 1-20
• Seller: pick one out of 10k
• Stream in memory, flush out to disk in 3 flavors
– Each flavor keeping its own bulk size of 100k
– For 1M, 10M and 100M products
14. Indexing
• 1M product set, local naive
– 80s Document
– 41s Nested
– 64s Parent/Child
• ES index bottleneck:
– Your source system and latency
it can slurp it up faster than you can serve it
16. Use Cases
Use Case A Use Case B Use Case C
Product Search Word in Title Word in Title
∃ DeliveryC = 0
Word in Title
∃ Price < P
Order By Relevance Relevance (Lowest) Price
Display for top N
products
Product Fields
Cheapest Offer
fields
Product Fields
Correct Cheapest
Offer fields
Product Fields
Cheapest Offer
fields
Aggregate On Category Category Category
∀ Offer SellerId ∀ Correct Offers
SellerId
∀ Correct Offers
SellerId
∀ Offer Price ∀ Correct Offers
Price
∀ Correct Offers
Price
∀ Offer
DeliveryCode
∀ Correct Offers
DeliveryCode
∀ Correct Offers
DeliveryCode
• Product
• Offer
17. Use Cases
D: query B, roll up by family
• Families (with products with
offers)
– with product.title:lunchroom
– filter by
product.offer.deliveryCode:tom
orrow
30. Results
0
20
40
60
80
100
120
140
160
180
200
a b c d
1m tun 30102015 32 GB new queries
doc
nested
parentchild
0
500
1000
1500
2000
2500
3000
3500
a b c d
10m tun 30102015 32 GB new queries
doc
nested
parentchild
31. Conclusions
• Parent/Child has limitations
– Combining cross-level queries with
aggregations in one go
• Doc not as fast as we’d expected
– Because we needed top_hits aggregation
• Elasticsearch scales predictably
32. Conclusions
• For us, nested was the best solution
• What is yours?
• What are you searching for?
– What are the rows?
– What are the facets about?
33. Lessons Learned
• Testing the scalability of your data model
– Fast iterations early on
– Valuable insight in indexing and search
requirements
• Data Modeling is hard
– Do it early
– Make it fun
34. Tech Lessons Learned
• Don’t forget to tune the ES cluster
– Configure memory ;)
• If bulk file last line has no n, gets ignored!
– count the differences
• 100k bulk files with .000 suffixes ought to
be enough for everyone, right?
• Do not underestimate Sneakernet