SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
Consuming Real
time Signals in Solr
Umesh Prasad
SDE 3 @ Flipkart
Flipkart’s Index
Flipkart’s Index
1. Data organized in multiple indexes/Solr cores. Couple of millions of
documents.
2. SKUs are documents.
3. Data organized in multiple solr cores.
4. Extensive use of facets and filters.
5. All search doesn’t allow faceting.
Lots of custom components
1. Custom collectors ( for enabling blending of results for diversity /
personalization )
2. Custom Query parsers ( for enabling really customized scoring)
3. Custom fields
Typical Ecommerce Document
● Catalogue data
○ Static
○ Largely textual
● Pricing related data
○ Dynamic
○ Faster moving
● Offers
○ Channel specific based on nature of event
● Availability
○ Dynamic
○ Faster moving
and more...
First Cut Integration
1. Catalogue Management System aka CMS
a. Single Source of truth for all Systems
b. Merges data from multiple sources, doing joins and keeps the latest snapshot,
keyed by Product Id
c. Raises notification whenever the data changes .
Catalogue Management System
(Static and dynamic)
Data Import
Handler
(Fetch, Transform,
Dedup,
Update)
SOLR
Notification
Sales Signals,
Custom tags
But ….
1. Limitations
a. Too much data ( and more than 80% , not of any interest to search system)
b. CMS has to keep data for ever. (Remember it is source of truth). But search System
doesn’t need to index all documents. ( obsolete products). So lots of drops.
c. Merging becomes too much for CMS. Introduces Lag.
2. DIH Limitations
a. Single Threaded. (Multithreaded had bugs and was removed in 4X SOLR-3262)
b. Too many notifications from CMS. ( Fetch, Transform, compare, discard still costs) and
single threaded doesn’t help.
c. Some signals are of interest to search system only. (Normalized revenue, tag pages). But
difficult to integrate proactively.
So CMS is re-factored
CMS
(service)
Dynamic Field 1
Service (service)
Notification stream
Notification stream
dynamic sorting fields (
sparse but a lot of them
)
(mysql db)
Snapshot
SOLR Master
External Field ,
consumed through
DIH
Solr
Slaves
Why are Partial updates a challenge in Lucene ?
1. Update
a. Lucene doesn’t support partial updates. Tough to do with inverted index. It
is because all terms for that document needs to be updated. Lots of open
tickets
b. LUCENE-4272 (term vector based), LUCENE-3837, LUCENE-4258
(overlay segment based) , Incremental Field Updates through Stacked
Segments
c. Document @ t1 → Term vectors {T1, T2, T3, T4, T5}
d. Document @ t2 → Term vectors { T1, T4, T10 }
e. Inverted index actually stores the posting list for its terms. These posting
lists are quite sparse and compressed using delta encodings for efficiency
reasons.
f. T1 → {1, 5, 7 } etc
g. T2 → {2, 5, 6}
h. To support partial update, the document has to be removed from posting
listing of all its previous terms .. That is non-trivial. Because that will involve
remembering and storing all terms for a given document.
i. So instead Lucene and inverted index systems, mark old document as
deleted in another data structure (live docs)
Why are Partial updates a challenge in Lucene ?
1. What it means is a update in actually
a. Delete + Add . ( Regardless of which
attribute changed)
b. Deleted documents are compacted by a
background merge thread.
2. Updates become only after a commit
c. Soft commit will create a new segment in
memory.
d. Hard commit will do a fsync to directory.
But do we need to re-index a document ? Lets evaluate
1. Lucene might hold 3 kinds of data
a. Data used for actual search ( analyzed, converted into tokens )
b. Data used for plain filtering ( not analyzed, e.g. price, discount)
c. Data used for ranking ( e.g. relevancy signals and there can be a
lot of them)
2. Searchable Attributes ⇒ Need be to inverted. ⇒ Slow Changing.
a. Pipeline can be spam filtering → text cleaning → duplicate
detection → NLP → Entity extraction etc etc
3. Facetable/Filterable Attributes ⇒ Little Analysis ⇒ Numeric or Tags ,
usually with enumerated values
a. Can be dynamic
b. Can be governed by policies and business constraints.
But do we need to re-index a document ? Lets evaluate
1. Ranking Signals ⇒ Needs to be row oriented.
a. Can be batch update (e.g. category specific ranks, ratings)
or real time updates e.g. availability.
b. Lucene actually un-inverts such fields using FieldCache
c. Doc values were introduced to manage the cost of
FieldCache and better provide updatability.
d. updatable NumericDocValues (LUCENE-5189, since 4.6)
, updatable binary doc values (LUCENE-5513, since 4.8)
e. Solr still doesn’t have updatabale doc values. Jira ticket
open, but issues around update/write-ahead logs. ( SOLR-
5944)
First Approach : Leverage Updatable Numeric DocValues
1. Solr Limitation : Easily overcome in master slave model by
plugging your own update chain and accessing IndexWriter
directly.
2. But :
a. You need a commit for docvalues to reflect. ( Not real time !! )
b. Filtering on DocValues : is inefficient. Specially on Numeric
Fields.
c. Making it work is solr cloud is non trivial. For details please
see SOLR-5944.
d. Docvalues are dense. Updates are not stacked. It always
dumps the full view of modified field doc value on every
commit. (optimizing for search performance) (http://shaierera.
blogspot.in/2014/04/updatable-docvalues-under-hood.html)
e. But what if we had 500 fields doc values for millions of docs.
First Approach : Leverage Updatable Numeric DocValues
1. Commit caveats:
a. Soft commits is NOT FREE.
Soft-commit in solr = IndexWriter.getReader() in lucene ==
flush + open .
There is NRTCachingDirectory, which caches the small
segment produced and makes it cheaper to do soft
commits. Details can found in McCandless’s post.
b. In Solr invalidate all caches and they have to be re-
generated on every commit. Some caches like filterCache
have a huge impact on performance. Warming them up
itself might take 2-3 minutes at times.
c. Warmup puts memory pressure on jvm and builds spikes
in allocations. Some caches like documentCache can’t
even be warmed up.
d. More commits ⇒ more segments ⇒ more merges
2nd Approach. : NRT Store and Value Sources
http://lucene.apache.org/core/4_10_0/queries/org/apache/lucene/queries/function/ValueSource.html
- abstract FunctionValues getValues(Map context, AtomicReaderContext readerContext)
Gets the values for this reader and the context that was previously passed to createWeight()
http://lucene.apache.org/core/4_10_0/queries/org/apache/lucene/queries/function/FunctionValues.htm
FunctionValues
- boolean exists(int doc) : Returns true if there is a value for this document
- double doubleVal(int doc)
Value Sources Allowed us to Plug External Data sources right inside Solr. These
external data need not be part of the index themselves, but should be easily retrievable.
Because they would be called millions of times and right inside a loop.
The Challenge
1. Entries in Solr caches have really no expiry time and have no way to invalidate entries.
2. Solution : Get rid of query cache altogether. But still, we have filterCache.
3. So now : matching and scoring had to be really fast.
a. Calls to value source need to be extremely fast. We have optimized them out, so
that they are as fast as accessing doc values.
b. The cost of ranking functions themselves. Some of the optimizations involved
getting and reducing cost of Math functions themselves
So the learnings
1. Understand your data, change rate and what you want to do with your data
2. Solr / Lucene have really good abstractions both around indexing and query. Both
provide you with a lot of hooks and plugins. Think through and take advantage of them.
3. Experiment, profile and benchmark. Delve into the APIs and internals.
4. The experts do help. The dense docValues and softcommits not being free, were direct
contributions of discussions with Shalin.
5. Learnt the hard way : It is really difficult to keep inverted index in sync. We actually built
a lucene-codecs (which built and updated inverted index in redis).

Contenu connexe

Tendances

Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchhypto
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into ElasticsearchKnoldus Inc.
 
Elastic search Walkthrough
Elastic search WalkthroughElastic search Walkthrough
Elastic search WalkthroughSuhel Meman
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkXiaoqian Liu
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchIsmaeel Enjreny
 
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)Weiwei Guo
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
 
Proposal for nested document support in Lucene
Proposal for nested document support in LuceneProposal for nested document support in Lucene
Proposal for nested document support in LuceneMark Harwood
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginnersNeil Baker
 
Anatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur DatarAnatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur DatarNaresh Jain
 
Elasticsearch From the Bottom Up
Elasticsearch From the Bottom UpElasticsearch From the Bottom Up
Elasticsearch From the Bottom Upfoundsearch
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearchAnton Udovychenko
 

Tendances (20)

Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Elastic search Walkthrough
Elastic search WalkthroughElastic search Walkthrough
Elastic search Walkthrough
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on Spark
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
Deep Natural Language Processing for Search Systems (sigir 2019 tutorial)
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Lucene
LuceneLucene
Lucene
 
Proposal for nested document support in Lucene
Proposal for nested document support in LuceneProposal for nested document support in Lucene
Proposal for nested document support in Lucene
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
Druid
DruidDruid
Druid
 
Anatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur DatarAnatomy of an eCommerce Search Engine by Mayur Datar
Anatomy of an eCommerce Search Engine by Mayur Datar
 
Elasticsearch From the Bottom Up
Elasticsearch From the Bottom UpElasticsearch From the Bottom Up
Elasticsearch From the Bottom Up
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearch
 

En vedette

The parsers & test upload
The parsers & test uploadThe parsers & test upload
The parsers & test uploadAnupam Jain
 
EmergingTrendsInComputingAndProgrammingLanguages
EmergingTrendsInComputingAndProgrammingLanguagesEmergingTrendsInComputingAndProgrammingLanguages
EmergingTrendsInComputingAndProgrammingLanguagesDeepak Shevani
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Evolution of Programming Languages
Evolution of Programming LanguagesEvolution of Programming Languages
Evolution of Programming LanguagesSayanee Basu
 
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseImplementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseLucidworks (Archived)
 
Evolution of Programming Languages Over the Years
Evolution of Programming Languages Over the YearsEvolution of Programming Languages Over the Years
Evolution of Programming Languages Over the Yearsdesigns.codes
 

En vedette (7)

The parsers & test upload
The parsers & test uploadThe parsers & test upload
The parsers & test upload
 
EmergingTrendsInComputingAndProgrammingLanguages
EmergingTrendsInComputingAndProgrammingLanguagesEmergingTrendsInComputingAndProgrammingLanguages
EmergingTrendsInComputingAndProgrammingLanguages
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Evolution of Programming Languages
Evolution of Programming LanguagesEvolution of Programming Languages
Evolution of Programming Languages
 
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseImplementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
 
Evolution of Programming Languages Over the Years
Evolution of Programming Languages Over the YearsEvolution of Programming Languages Over the Years
Evolution of Programming Languages Over the Years
 

Similaire à Consuming RealTime Signals in Solr

Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion
 
127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collectionsAmit Sharma
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014nkabra
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaSpringPeople
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorizationAndreas Loupasakis
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization Warply
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architectureAjeet Singh
 
KP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation MethodologyKP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation MethodologyDataStax Academy
 
22827361 ab initio-fa-qs
22827361 ab initio-fa-qs22827361 ab initio-fa-qs
22827361 ab initio-fa-qsCapgemini
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf pointsocporacledba
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf pointsdba3003
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesCidar Mendizabal
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperDerek Diamond
 

Similaire à Consuming RealTime Signals in Solr (20)

Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
 
Remus_3_0
Remus_3_0Remus_3_0
Remus_3_0
 
127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections127556030 bisp-informatica-question-collections
127556030 bisp-informatica-question-collections
 
Solr and ElasticSearch demo and speaker feb 2014
Solr  and ElasticSearch demo and speaker feb 2014Solr  and ElasticSearch demo and speaker feb 2014
Solr and ElasticSearch demo and speaker feb 2014
 
Elastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & KibanaElastic - ELK, Logstash & Kibana
Elastic - ELK, Logstash & Kibana
 
Automated product categorization
Automated product categorizationAutomated product categorization
Automated product categorization
 
Automated product categorization
Automated product categorization   Automated product categorization
Automated product categorization
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architecture
 
BigDataDebugging
BigDataDebuggingBigDataDebugging
BigDataDebugging
 
KP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation MethodologyKP Partners: DataStax and Analytics Implementation Methodology
KP Partners: DataStax and Analytics Implementation Methodology
 
22827361 ab initio-fa-qs
22827361 ab initio-fa-qs22827361 ab initio-fa-qs
22827361 ab initio-fa-qs
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
 
Informatica perf points
Informatica perf pointsInformatica perf points
Informatica perf points
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiences
 
MineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White PaperMineDB Mineral Resource Evaluation White Paper
MineDB Mineral Resource Evaluation White Paper
 
Bt0066
Bt0066Bt0066
Bt0066
 
B T0066
B T0066B T0066
B T0066
 
LDV.pptx
LDV.pptxLDV.pptx
LDV.pptx
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 

Dernier

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Dernier (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Consuming RealTime Signals in Solr

  • 1. Consuming Real time Signals in Solr Umesh Prasad SDE 3 @ Flipkart
  • 2. Flipkart’s Index Flipkart’s Index 1. Data organized in multiple indexes/Solr cores. Couple of millions of documents. 2. SKUs are documents. 3. Data organized in multiple solr cores. 4. Extensive use of facets and filters. 5. All search doesn’t allow faceting. Lots of custom components 1. Custom collectors ( for enabling blending of results for diversity / personalization ) 2. Custom Query parsers ( for enabling really customized scoring) 3. Custom fields
  • 3. Typical Ecommerce Document ● Catalogue data ○ Static ○ Largely textual ● Pricing related data ○ Dynamic ○ Faster moving ● Offers ○ Channel specific based on nature of event ● Availability ○ Dynamic ○ Faster moving and more...
  • 4. First Cut Integration 1. Catalogue Management System aka CMS a. Single Source of truth for all Systems b. Merges data from multiple sources, doing joins and keeps the latest snapshot, keyed by Product Id c. Raises notification whenever the data changes . Catalogue Management System (Static and dynamic) Data Import Handler (Fetch, Transform, Dedup, Update) SOLR Notification Sales Signals, Custom tags
  • 5. But …. 1. Limitations a. Too much data ( and more than 80% , not of any interest to search system) b. CMS has to keep data for ever. (Remember it is source of truth). But search System doesn’t need to index all documents. ( obsolete products). So lots of drops. c. Merging becomes too much for CMS. Introduces Lag. 2. DIH Limitations a. Single Threaded. (Multithreaded had bugs and was removed in 4X SOLR-3262) b. Too many notifications from CMS. ( Fetch, Transform, compare, discard still costs) and single threaded doesn’t help. c. Some signals are of interest to search system only. (Normalized revenue, tag pages). But difficult to integrate proactively.
  • 6. So CMS is re-factored CMS (service) Dynamic Field 1 Service (service) Notification stream Notification stream dynamic sorting fields ( sparse but a lot of them ) (mysql db) Snapshot SOLR Master External Field , consumed through DIH Solr Slaves
  • 7. Why are Partial updates a challenge in Lucene ? 1. Update a. Lucene doesn’t support partial updates. Tough to do with inverted index. It is because all terms for that document needs to be updated. Lots of open tickets b. LUCENE-4272 (term vector based), LUCENE-3837, LUCENE-4258 (overlay segment based) , Incremental Field Updates through Stacked Segments c. Document @ t1 → Term vectors {T1, T2, T3, T4, T5} d. Document @ t2 → Term vectors { T1, T4, T10 } e. Inverted index actually stores the posting list for its terms. These posting lists are quite sparse and compressed using delta encodings for efficiency reasons. f. T1 → {1, 5, 7 } etc g. T2 → {2, 5, 6} h. To support partial update, the document has to be removed from posting listing of all its previous terms .. That is non-trivial. Because that will involve remembering and storing all terms for a given document. i. So instead Lucene and inverted index systems, mark old document as deleted in another data structure (live docs)
  • 8. Why are Partial updates a challenge in Lucene ? 1. What it means is a update in actually a. Delete + Add . ( Regardless of which attribute changed) b. Deleted documents are compacted by a background merge thread. 2. Updates become only after a commit c. Soft commit will create a new segment in memory. d. Hard commit will do a fsync to directory.
  • 9. But do we need to re-index a document ? Lets evaluate 1. Lucene might hold 3 kinds of data a. Data used for actual search ( analyzed, converted into tokens ) b. Data used for plain filtering ( not analyzed, e.g. price, discount) c. Data used for ranking ( e.g. relevancy signals and there can be a lot of them) 2. Searchable Attributes ⇒ Need be to inverted. ⇒ Slow Changing. a. Pipeline can be spam filtering → text cleaning → duplicate detection → NLP → Entity extraction etc etc 3. Facetable/Filterable Attributes ⇒ Little Analysis ⇒ Numeric or Tags , usually with enumerated values a. Can be dynamic b. Can be governed by policies and business constraints.
  • 10. But do we need to re-index a document ? Lets evaluate 1. Ranking Signals ⇒ Needs to be row oriented. a. Can be batch update (e.g. category specific ranks, ratings) or real time updates e.g. availability. b. Lucene actually un-inverts such fields using FieldCache c. Doc values were introduced to manage the cost of FieldCache and better provide updatability. d. updatable NumericDocValues (LUCENE-5189, since 4.6) , updatable binary doc values (LUCENE-5513, since 4.8) e. Solr still doesn’t have updatabale doc values. Jira ticket open, but issues around update/write-ahead logs. ( SOLR- 5944)
  • 11. First Approach : Leverage Updatable Numeric DocValues 1. Solr Limitation : Easily overcome in master slave model by plugging your own update chain and accessing IndexWriter directly. 2. But : a. You need a commit for docvalues to reflect. ( Not real time !! ) b. Filtering on DocValues : is inefficient. Specially on Numeric Fields. c. Making it work is solr cloud is non trivial. For details please see SOLR-5944. d. Docvalues are dense. Updates are not stacked. It always dumps the full view of modified field doc value on every commit. (optimizing for search performance) (http://shaierera. blogspot.in/2014/04/updatable-docvalues-under-hood.html) e. But what if we had 500 fields doc values for millions of docs.
  • 12. First Approach : Leverage Updatable Numeric DocValues 1. Commit caveats: a. Soft commits is NOT FREE. Soft-commit in solr = IndexWriter.getReader() in lucene == flush + open . There is NRTCachingDirectory, which caches the small segment produced and makes it cheaper to do soft commits. Details can found in McCandless’s post. b. In Solr invalidate all caches and they have to be re- generated on every commit. Some caches like filterCache have a huge impact on performance. Warming them up itself might take 2-3 minutes at times. c. Warmup puts memory pressure on jvm and builds spikes in allocations. Some caches like documentCache can’t even be warmed up. d. More commits ⇒ more segments ⇒ more merges
  • 13. 2nd Approach. : NRT Store and Value Sources http://lucene.apache.org/core/4_10_0/queries/org/apache/lucene/queries/function/ValueSource.html - abstract FunctionValues getValues(Map context, AtomicReaderContext readerContext) Gets the values for this reader and the context that was previously passed to createWeight() http://lucene.apache.org/core/4_10_0/queries/org/apache/lucene/queries/function/FunctionValues.htm FunctionValues - boolean exists(int doc) : Returns true if there is a value for this document - double doubleVal(int doc) Value Sources Allowed us to Plug External Data sources right inside Solr. These external data need not be part of the index themselves, but should be easily retrievable. Because they would be called millions of times and right inside a loop.
  • 14. The Challenge 1. Entries in Solr caches have really no expiry time and have no way to invalidate entries. 2. Solution : Get rid of query cache altogether. But still, we have filterCache. 3. So now : matching and scoring had to be really fast. a. Calls to value source need to be extremely fast. We have optimized them out, so that they are as fast as accessing doc values. b. The cost of ranking functions themselves. Some of the optimizations involved getting and reducing cost of Math functions themselves
  • 15. So the learnings 1. Understand your data, change rate and what you want to do with your data 2. Solr / Lucene have really good abstractions both around indexing and query. Both provide you with a lot of hooks and plugins. Think through and take advantage of them. 3. Experiment, profile and benchmark. Delve into the APIs and internals. 4. The experts do help. The dense docValues and softcommits not being free, were direct contributions of discussions with Shalin. 5. Learnt the hard way : It is really difficult to keep inverted index in sync. We actually built a lucene-codecs (which built and updated inverted index in redis).