Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
OCTOBER 11-14, 2016 • BOSTON, MA
Near Real time Indexing
Building Real Time Search Index For E-Commerce
Umesh Prasad
Tech Lead @ Flipkart
Thejus V M
Data A...
Agenda
• Search @ Flipkart
• Need for Real Time Search
• SolrCloud Solution
• Our approach
• Q & A
Traffic @ Flipkart
• Peak Traffic
– ~ 800K active users
– ~ 160K requests per second
• Search Traffic
– ~ 40K searches per...
Search @ Flipkart
• Catalogue
– ~ 50 main categories
– ~ 5000 sub-categories
– ~ 231 million documents
– ~ 90 million SKUs...
E-commerce Search
• Heavy usage of drill down filters
• Heavy usage of faceting
• Only top results matter
• Results groupe...
Agenda
• Search @ Flipkart
• Need for Real Time Search
• SolrCloud Solution
• Our approach
• Q & A
Sorry, Stock Over !!?
Damn !! Is Offer Over ??
What !! All Steal Deals Gone ??
Product /Listing: Important Attributes
Seller
Rating
Service
catalogue
service
Promise
Service
Availability
Service
Offer
...
Summary : Lucene Document
• Product/SKU (Parent Document)
– Listing (Child Document)
• Query : Mostly SKU Attributes (Free...
Out Of Stock, but Why Show?
Index has Stale
Availability Data
234K
Products
Challenge 1 : High Update Rates
updates / sec updates /hr
normal Peak
text / catalogue ~10 ~100 ~100K
pricing ~100 ~1K ~10...
Challenge 2 : Micro Services
Ingestion pipeline
Catalogue Pricing Availability Offers ...
Document Builder
Solr/Lucene
Cha...
Agenda
• Search @ Flipkart
• Need for Real Time Search
• SolrCloud Solution
• Our approach
• Q & A
SolrCloud for NRT
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Shard
Replica
Re-open
searcher
Re-...
SolrCloud Evaluation
• Update = Delete + Add
– Block Join Index ⇒ Update Whole Block (Product + Listings)
• Updated Docume...
Agenda
• Search @ Flipkart
• Need for Real Time Index
• SolrCloud Solution
•Our approach
• Q & A
ProductA
brand : Apple
availability : T
price : 45000
ProductB
brand : Samsung
availability : T
price : 23000
ProductC
bra...
A Typical Search Flow
Query Rewrite
Results
Query
Matching
Ranking Faceting
Stats
Posting List
Doc Values
Other
Components...
NRT Forward Index - Considerations
● Lookup efficiency
– 50th percentile : ~10K matches
– 99th percentile : ~1 million mat...
NRT Forward Index - Naive Implementation
NRT Forward IndexLucene Segment
Lookup Engine
0 ProductB
1 ProductA
2 ProductC
3 ...
NRT Store - Forward Index Optimized
Lookup Engine
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
DocId : 3
Fie...
NRT Store Filter - PostFilter
PostFilter(Price:[100 TO 150])
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 ProductD
Do...
NRT Filter
NRT Store - Invert index
NRT Forward StoreNRT Inverter
Lucene Segment
0 ProductB
1 ProductA
2 ProductC
3 Produc...
Solr Integration Points
• ValueSources
• Filtering
– Custom Filter Implementation for cached DocIdSet
– Custom PostFilter
...
Near Real Time Solr Architecture
Solr
Kafka
Ingestion pipeline
NRT Forward
Index
Ranking
Matching
Faceting
Redis
Bootstrap...
Accomplishments
• Real time sorting
• Real time filtering : PostFilter
– Higher latency
• Near real time filtering : cache...
Accomplishments @ Flipkart
● Real time consumption for ~150 Signals
● Reduction in shown out of stock products by 2X
● Pro...
Thank you
&
Questions
Near RealTime search @Flipkart
Near RealTime search @Flipkart
Prochain SlideShare
Chargement dans…5
×

Near RealTime search @Flipkart

274 vues

Publié le

Presented at Solr/Lucene revolution Boston

Publié dans : Logiciels
  • Soyez le premier à commenter

Near RealTime search @Flipkart

  1. 1. OCTOBER 11-14, 2016 • BOSTON, MA
  2. 2. Near Real time Indexing Building Real Time Search Index For E-Commerce Umesh Prasad Tech Lead @ Flipkart Thejus V M Data Architect @ Flipkart
  3. 3. Agenda • Search @ Flipkart • Need for Real Time Search • SolrCloud Solution • Our approach • Q & A
  4. 4. Traffic @ Flipkart • Peak Traffic – ~ 800K active users – ~ 160K requests per second • Search Traffic – ~ 40K searches per second (Service) – ~ 10K searches per second (Solr ) • Latency – Median : 11 ms – 99th percentile : 1.1 second
  5. 5. Search @ Flipkart • Catalogue – ~ 50 main categories – ~ 5000 sub-categories – ~ 231 million documents – ~ 90 million SKUs – ~ 160 million listings • E-commerce Marketplace – ~ 100K Sellers – Local Sellers – Regional Availability – Logistics Constraints
  6. 6. E-commerce Search • Heavy usage of drill down filters • Heavy usage of faceting • Only top results matter • Results grouped/collapsed by products • Serviceability and delivery experience MATTERS
  7. 7. Agenda • Search @ Flipkart • Need for Real Time Search • SolrCloud Solution • Our approach • Q & A
  8. 8. Sorry, Stock Over !!?
  9. 9. Damn !! Is Offer Over ??
  10. 10. What !! All Steal Deals Gone ??
  11. 11. Product /Listing: Important Attributes Seller Rating Service catalogue service Promise Service Availability Service Offer Service Pricing Service Product aka SKU Listings
  12. 12. Summary : Lucene Document • Product/SKU (Parent Document) – Listing (Child Document) • Query : Mostly SKU Attributes (Free Text) • Filters : SKU + Listing Attributes (Drill Down) • Ranking : SKU + Listing Attributes (Explicit/Relevance) • Index Time Join aka Block Join (Best Performance)
  13. 13. Out Of Stock, but Why Show? Index has Stale Availability Data 234K Products
  14. 14. Challenge 1 : High Update Rates updates / sec updates /hr normal Peak text / catalogue ~10 ~100 ~100K pricing ~100 ~1K ~10 million availability ~100 ~10K ~10 million offer ~100 ~10K ~10 million seller rating ~10 ~1K ~1 million signal 6 ~10 ~100 ~1 million signal 7 ~100 ~10K ~10 million signal 8 ~100 ~10K ~10 million
  15. 15. Challenge 2 : Micro Services Ingestion pipeline Catalogue Pricing Availability Offers ... Document Builder Solr/Lucene Change Propagation Documents {L1,L2 … P1} Updates Stream 1 Updates Stream 2 Updates Stream 3 ● Lucene doesn’t support Partial Updates ● Update = Delete + Add
  16. 16. Agenda • Search @ Flipkart • Need for Real Time Search • SolrCloud Solution • Our approach • Q & A
  17. 17. SolrCloud for NRT Shard Replica Shard Replica Shard Replica Shard Replica Shard Replica Shard Replica Re-open searcher Re-open searcher Re-open searcher Re-open searcher Re-open searcher Re-open searcher Ingestion pipeline Shard Leader Auto commit Soft Commit Batch of documents For Document Versioning Update Log Forward to Replica
  18. 18. SolrCloud Evaluation • Update = Delete + Add – Block Join Index ⇒ Update Whole Block (Product + Listings) • Updated Document gets streamed to all replicas in sync – Reduces indexing throughput • Soft commit is Not Free – Soft commit ⇒ In Memory Segment – Lots of Merges – Huge document churn / deletes – All caches still need to be re-generated – Filter Cache miss specially hurts performance
  19. 19. Agenda • Search @ Flipkart • Need for Real Time Index • SolrCloud Solution •Our approach • Q & A
  20. 20. ProductA brand : Apple availability : T price : 45000 ProductB brand : Samsung availability : T price : 23000 ProductC brand : Apple availability : F price : 5000 Document ID Mappings Posting List (Inverted Index) DocValues (columunar data) Lucene Segment Lucene Index 0 ProductA 1 ProductB 2 ProductC 45000 23000 5000Price availability : T brand : Samsung brand : Apple 0 , 2 1 0 , 1 Terms Sparse Bitsets
  21. 21. A Typical Search Flow Query Rewrite Results Query Matching Ranking Faceting Stats Posting List Doc Values Other Components Lucene Segment Inverted Index Forward Index NRT Store samsung mobiles Offer : exchange offer price desc category : mobiles brand : samsung Offer : exchange offer
  22. 22. NRT Forward Index - Considerations ● Lookup efficiency – 50th percentile : ~10K matches – 99th percentile : ~1 million matches ● Data on Java heap – Memory efficiency
  23. 23. NRT Forward Index - Naive Implementation NRT Forward IndexLucene Segment Lookup Engine 0 ProductB 1 ProductA 2 ProductC 3 ProductD ProductD ProductA ProductB ProductC ProductD True False False True 100 150 200 250 ProductId(3) <ProductD,price> DocId : 3 field: price 250 ProductId Availability Price Latency : ~10 secs for ~1 Million lookups
  24. 24. NRT Store - Forward Index Optimized Lookup Engine Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD DocId : 3 Field : price 250 DocId - NrtId 0 1 2 3 3 0 1 2 NrtId(3) 2 Price(2 ) NRT Forward Index (Segment Independent) 100 200 250 150Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB Availability T F F T Status 01 10 01 00 Latency : ~100 ms for ~1 Million lookups
  25. 25. NRT Store Filter - PostFilter PostFilter(Price:[100 TO 150]) Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD DocId : 3 Don’t Delegate DocId - NrtId 0 1 2 3 3 0 1 2 NrtId(3) 2 Price(2 ) NRT Forward Index (Segment Independent) 100 200 250 150Price 0 ProductA 1 ProductC 2 ProductD 3 ProductB Availability T F F T Status 01 10 01 00
  26. 26. NRT Filter NRT Store - Invert index NRT Forward StoreNRT Inverter Lucene Segment 0 ProductB 1 ProductA 2 ProductC 3 ProductD NRT DocIdSet Cache Availability : T 0 3 Offer : O1 2 3 Offer:O1 DocIdSet
  27. 27. Solr Integration Points • ValueSources • Filtering – Custom Filter Implementation for cached DocIdSet – Custom PostFilter • Query – Wrapper over Filter • Custom FacetComponent
  28. 28. Near Real Time Solr Architecture Solr Kafka Ingestion pipeline NRT Forward Index Ranking Matching Faceting Redis Bootstrap NRT Inverted store Solr Master NRT Updates Lucene Updates Catalogue Pricing Availability Offers Seller Quality Commit + Replicate + Reopen Lucene Others
  29. 29. Accomplishments • Real time sorting • Real time filtering : PostFilter – Higher latency • Near real time filtering : cached DocIdSet – No consistency between lookup and filtering • Independent of lucene commits • Query latency comparable to DocValues – Consistent 99% performance
  30. 30. Accomplishments @ Flipkart ● Real time consumption for ~150 Signals ● Reduction in shown out of stock products by 2X ● Production instances of ~50K updates/second real time
  31. 31. Thank you & Questions

×