SlideShare une entreprise Scribd logo
1  sur  31
Spell Correction Systems for E-commerce engines
Anjan Goswami HuiZhong Duan
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 1 / 31
The Spell correction problem
Rich literature [KCG90, Pet80].
Active research area [CB04].
Combination of NLP, Machine Learning [DH11, BB01, LDZ12] and
Systems problems [Kuk92].
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 2 / 31
Spell correction for e-commerce
Critical site feature for e-commerce.
Impact of ML based spell correction
Adds revenue.
Reduces bounce rate.
Reduces null Results.
Departments such as pharmacy can have huge gain in revenue with
Spell Correction.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 3 / 31
Spell correction for e-commerce
Science part is same as any other large scale spell correction systems.
Demand and supply side corpus.
Conversion focus.
User Interfaces.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 4 / 31
Spell correction Evaluation
Accuracy for misspelled queries.
Accuracy for correctly spelled queries.
Business metrics.
Coverage.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 5 / 31
The problem
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 6 / 31
The problem
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 7 / 31
The problem
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 8 / 31
The problem
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 9 / 31
The problem
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 10 / 31
Error statistics
Approximately 26% queries have spelling error in web queries [JM].
E-com data can be expected to be similar.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 11 / 31
Error Types
Typographic errors: Covr ← Cover
Cognitive errors: Visio Tv ← Vizio Tv
Non-english word errors: X345678 ← X345677
Contextual errors: life of Pie ← Life of Pi
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 12 / 31
Challenges
General Challenges
Large candidate pool: queries
Open dictionary: all terms are feasible
Efficiency: happens before search is executed
User behavior: query formulation is different from typical writing
Devices: different device may cause different types of typos
Under-correction: even a term is in correct form, it may need
correction
Over-correction: a term that doesn’t appear correct could still be
good search term
Languages: Different languages have different challenges.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 13 / 31
Query Spelling Challenges
Special Challenges (and Opportunities) in e-Commerce
optimization target: linguistic correct or conversion?
unique dictionary: model numbers, etc.
high cost for over-correction
availability of inventory data
availability of conversion data
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 14 / 31
General problems
Error modeling
Candidate generation
Ranking and selection of the best candidate.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 15 / 31
Modeling
A Noisy Channel Framework
Given user input query q, for every candidate correction c, compute the
conditional probability p(c|q)
p(c|q) =
p(q|c) · p(c)
p(q)
∝ p(q|c) · p(c) (1)
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 16 / 31
Modeling
A Noisy Channel Framework (cont.)
Source model p(c)
Captures: how likely user will pick query c in the first place
Typically: language model
Rationale: common phrases have high probabilities
Error model p(q|c)
Captures: how likely c is misspelled as q
Straightforward model: edit distance
Rationale: misspelled query should not be too different from original
query
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 17 / 31
Modeling
A Noisy Channel Framework (cont.)
More on Source model p(c)
Linguistic correction is important
Should also reflect query popularity
In e-Commerce, we also need to consider query conversion, and query
revenue
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 18 / 31
Modeling
A Noisy Channel Framework (cont.)
Language Model
n-gram language model: data sparsity as n goes up
backoff to/interpolation with lower-gram is necessary
smoothing is important
Good Turing smoothing: use 1-frequency items to estimate 0-frequency
probabilities
Additive smoothing: add pseudo count to terms/phrases
Knesser-Ney Smoothing: smart way of backoff and interpolation
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 19 / 31
Modeling
A Noisy Channel Framework (cont.)
More on Error model p(q|c)
Weighted edit model is better: p( a → e ) > p( a → n )
Context matters: p( a → e |context = ”be...”)
Multi-word errors need to be considered: p(”gopro”|”go pro”), can
be modeled by HMM, joint sequence model, etc.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 20 / 31
Modeling
A Noisy Channel Framework (cont.)
Hierarchical Error models
Character level error model
p( a → e |context = ”be...”)
generalizes well
less accurate
Syllable level error model
Word level error model
p( pi → pie |context = ”life of ...”)
sparse data
more accurate
Phrase level error model
...
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 21 / 31
Modeling
Discriminative Models
Why?
Noisy channel model is a generative framework
Multiplication is difficult as probabilities are estimated in different
ways
How to merge signals in one probability estimation is unknown (e.g.
linguistic correction vs. popularity vs. revenue)
There are other heuristic features and domain specific features that
cannot be subsumed in noisy channel model
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 22 / 31
Modeling
Discriminative Models (cont.)
How?
Learn to score < q, c > pair so that best correction has highest score
Challenges
Obtaining large scale training data: text parsing, human annotation
Learning methods
Classification
Learning to Rank
Structural learning
Efficiency: use noisy channel model to retrieve a handful candidates
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 23 / 31
Modeling
Discriminative Models (cont.)
Typically discriminative models such as SVM can also be used to
rerank the spelling candidates.
Recent successes with deep neural net.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 24 / 31
Modeling
Systems for Spelling Correction
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 25 / 31
Modeling
Candidate generation for Spelling Correction
Given a word find out all neighboring words under k edit distance.
Given a word find out potential close matches by hashing trick.
Generate candidates by using heuristic rules for common errors.
N-gram based techniques.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 26 / 31
Modeling
Candidate generation scaling up
Distributed implementation.
Hashing tricks.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 27 / 31
Modeling
Spell correction for E-commerce
UI for the spell correction.
Input data: Whether to include item titles or not?
Impact of autocorrection on conversion.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 28 / 31
Modeling
References I
Michele Banko and Eric Brill, Scaling to very very large corpora for
natural language disambiguation, Proceedings of the 39th Annual
Meeting on Association for Computational Linguistics, Association for
Computational Linguistics, 2001, pp. 26–33.
Silviu Cucerzan and Eric Brill, Spelling correction as an iterative
process that exploits the collective knowledge of web users., EMNLP,
vol. 4, 2004, pp. 293–300.
Huizhong Duan and Bo-June Paul Hsu, Online spelling correction for
query completion, Proceedings of the 20th international conference on
World wide web, ACM, 2011, pp. 117–126.
Daniel Jurafsky and James H Martin, Speech and language processing:
An introduction to natural language processing, computational
linguistics, and speech recognition.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 29 / 31
Modeling
References II
Mark D Kernighan, Kenneth W Church, and William A Gale, A
spelling correction program based on a noisy channel model,
Proceedings of the 13th conference on Computational
linguistics-Volume 2, Association for Computational Linguistics, 1990,
pp. 205–210.
Karen Kukich, Techniques for automatically correcting words in text,
ACM Computing Surveys (CSUR) 24 (1992), no. 4, 377–439.
Yanen Li, Huizhong Duan, and ChengXiang Zhai, A generalized
hidden markov model with discriminative training for query spelling
correction, Proceedings of the 35th international ACM SIGIR
conference on Research and development in information retrieval,
ACM, 2012, pp. 611–620.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 30 / 31
Modeling
References III
James L Peterson, Computer programs for detecting and correcting
spelling errors, Communications of the ACM 23 (1980), no. 12,
676–687.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 31 / 31

Contenu connexe

Tendances

Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach
Neo4j
 

Tendances (20)

Deep learning for real life applications
Deep learning for real life applicationsDeep learning for real life applications
Deep learning for real life applications
 
Causality without headaches
Causality without headachesCausality without headaches
Causality without headaches
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.net
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural language processing
Natural language processing Natural language processing
Natural language processing
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysis
 
Engagement, Metrics & Personalisation at Scale
Engagement, Metrics &  Personalisation at ScaleEngagement, Metrics &  Personalisation at Scale
Engagement, Metrics & Personalisation at Scale
 
presentation.pdf
presentation.pdfpresentation.pdf
presentation.pdf
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Beyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingBeyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modeling
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Foundation Models in Recommender Systems
Foundation Models in Recommender SystemsFoundation Models in Recommender Systems
Foundation Models in Recommender Systems
 
Artificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language ProcessingArtificial Intelligence: Natural Language Processing
Artificial Intelligence: Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach
 
Terminology Management
Terminology ManagementTerminology Management
Terminology Management
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 

En vedette

Lewisham chswg terms of reference (1)
Lewisham chswg terms of reference (1)Lewisham chswg terms of reference (1)
Lewisham chswg terms of reference (1)
Felicia Samuel
 
$$$$Rafael e luiz$$$$
$$$$Rafael e luiz$$$$$$$$Rafael e luiz$$$$
$$$$Rafael e luiz$$$$
rafaella1997
 
Innovative Strategies
Innovative StrategiesInnovative Strategies
Innovative Strategies
rohtashmal
 
AncientEgyptPsychiatry
AncientEgyptPsychiatryAncientEgyptPsychiatry
AncientEgyptPsychiatry
Sandra Knecht
 
Marriott International Capstone Research Paper
Marriott International Capstone Research PaperMarriott International Capstone Research Paper
Marriott International Capstone Research Paper
Natalia Poplawska
 
Thèse "Comment une marque peut intégrer une dimension émotionnelle grâce à la...
Thèse "Comment une marque peut intégrer une dimension émotionnelle grâce à la...Thèse "Comment une marque peut intégrer une dimension émotionnelle grâce à la...
Thèse "Comment une marque peut intégrer une dimension émotionnelle grâce à la...
Laurence Thébault
 

En vedette (17)

Lewisham chswg terms of reference (1)
Lewisham chswg terms of reference (1)Lewisham chswg terms of reference (1)
Lewisham chswg terms of reference (1)
 
$$$$Rafael e luiz$$$$
$$$$Rafael e luiz$$$$$$$$Rafael e luiz$$$$
$$$$Rafael e luiz$$$$
 
Art fx programme_20h_blender
Art fx programme_20h_blenderArt fx programme_20h_blender
Art fx programme_20h_blender
 
Innovative Strategies
Innovative StrategiesInnovative Strategies
Innovative Strategies
 
Sergio Baonza Presentacion.
Sergio Baonza Presentacion.Sergio Baonza Presentacion.
Sergio Baonza Presentacion.
 
From Billions to Trillions - A report on Uganda's SDGs strategy
From Billions to Trillions - A report on Uganda's SDGs strategyFrom Billions to Trillions - A report on Uganda's SDGs strategy
From Billions to Trillions - A report on Uganda's SDGs strategy
 
AncientEgyptPsychiatry
AncientEgyptPsychiatryAncientEgyptPsychiatry
AncientEgyptPsychiatry
 
Presentation restaurant de la fin du monde
Presentation restaurant de la fin du mondePresentation restaurant de la fin du monde
Presentation restaurant de la fin du monde
 
Marketing : 25 utilisations de la réalité virtuelle par les marques !
Marketing : 25 utilisations de la réalité virtuelle par les marques !Marketing : 25 utilisations de la réalité virtuelle par les marques !
Marketing : 25 utilisations de la réalité virtuelle par les marques !
 
Så här hjäper vi ungdomar till sysselsättning!
Så här hjäper vi ungdomar till sysselsättning!Så här hjäper vi ungdomar till sysselsättning!
Så här hjäper vi ungdomar till sysselsättning!
 
SGF Veg Restaurant Presentation
SGF Veg Restaurant PresentationSGF Veg Restaurant Presentation
SGF Veg Restaurant Presentation
 
Underwriting
UnderwritingUnderwriting
Underwriting
 
Dominic Kniveton - Embracing uncertainty
Dominic Kniveton - Embracing uncertaintyDominic Kniveton - Embracing uncertainty
Dominic Kniveton - Embracing uncertainty
 
The Race
The RaceThe Race
The Race
 
Marriott International Capstone Research Paper
Marriott International Capstone Research PaperMarriott International Capstone Research Paper
Marriott International Capstone Research Paper
 
Thèse "Comment une marque peut intégrer une dimension émotionnelle grâce à la...
Thèse "Comment une marque peut intégrer une dimension émotionnelle grâce à la...Thèse "Comment une marque peut intégrer une dimension émotionnelle grâce à la...
Thèse "Comment une marque peut intégrer une dimension émotionnelle grâce à la...
 
MobiliteaTime #10 : Apple Pay & Apple Wallet
MobiliteaTime #10 : Apple Pay & Apple Wallet MobiliteaTime #10 : Apple Pay & Apple Wallet
MobiliteaTime #10 : Apple Pay & Apple Wallet
 

Similaire à Spelling correction systems for e-commerce platforms

taghelper-final.doc
taghelper-final.doctaghelper-final.doc
taghelper-final.doc
butest
 
Iterative usability evaluation of DSLs
Iterative usability evaluation of DSLsIterative usability evaluation of DSLs
Iterative usability evaluation of DSLs
Ankica Barisic
 
PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
butest
 
PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
butest
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
jcscholtes
 
Text, Tags and Thumbnails: Latest Trends in Bioscience Literature Search
Text, Tags and Thumbnails:Latest Trends in Bioscience Literature SearchText, Tags and Thumbnails:Latest Trends in Bioscience Literature Search
Text, Tags and Thumbnails: Latest Trends in Bioscience Literature Search
marti_hearst
 
Two Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationTwo Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query Translation
IJECEIAES
 

Similaire à Spelling correction systems for e-commerce platforms (20)

2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt
 
Maxmizing Profits with the Improvement in Product Composition - ICIEOM - Mer...
Maxmizing Profits with the Improvement in Product Composition  - ICIEOM - Mer...Maxmizing Profits with the Improvement in Product Composition  - ICIEOM - Mer...
Maxmizing Profits with the Improvement in Product Composition - ICIEOM - Mer...
 
taghelper-final.doc
taghelper-final.doctaghelper-final.doc
taghelper-final.doc
 
Iterative usability evaluation of DSLs
Iterative usability evaluation of DSLsIterative usability evaluation of DSLs
Iterative usability evaluation of DSLs
 
Re2018 Semios for Requirements
Re2018 Semios for RequirementsRe2018 Semios for Requirements
Re2018 Semios for Requirements
 
PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
 
PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
 
Binary search query classifier
Binary search query classifierBinary search query classifier
Binary search query classifier
 
Learning Content and Usage Factors Simultaneously
Learning Content and Usage Factors SimultaneouslyLearning Content and Usage Factors Simultaneously
Learning Content and Usage Factors Simultaneously
 
Question Answering System using machine learning approach
Question Answering System using machine learning approachQuestion Answering System using machine learning approach
Question Answering System using machine learning approach
 
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
Question Retrieval in Community Question Answering via NON-Negative Matrix Fa...
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to Hindi
 
Semantic Relatedness for All (Languages): A Comparative Analysis of Multiling...
Semantic Relatedness for All (Languages): A Comparative Analysis of Multiling...Semantic Relatedness for All (Languages): A Comparative Analysis of Multiling...
Semantic Relatedness for All (Languages): A Comparative Analysis of Multiling...
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
 
Evaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutionsEvaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutions
 
Question answering
Question answeringQuestion answering
Question answering
 
A Multilingual Spam Reviews Detection Based on Pre-Trained Word Embedding and...
A Multilingual Spam Reviews Detection Based on Pre-Trained Word Embedding and...A Multilingual Spam Reviews Detection Based on Pre-Trained Word Embedding and...
A Multilingual Spam Reviews Detection Based on Pre-Trained Word Embedding and...
 
Supervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured TextSupervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured Text
 
Text, Tags and Thumbnails: Latest Trends in Bioscience Literature Search
Text, Tags and Thumbnails:Latest Trends in Bioscience Literature SearchText, Tags and Thumbnails:Latest Trends in Bioscience Literature Search
Text, Tags and Thumbnails: Latest Trends in Bioscience Literature Search
 
Two Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationTwo Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query Translation
 

Plus de Anjan Goswami

Learning to Diversify for E-commerce Search with Multi-Armed Bandit}
Learning to Diversify for E-commerce Search with Multi-Armed Bandit}Learning to Diversify for E-commerce Search with Multi-Armed Bandit}
Learning to Diversify for E-commerce Search with Multi-Armed Bandit}
Anjan Goswami
 
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...
Anjan Goswami
 
Assessing product image quality for online shopping
Assessing product image quality for online shoppingAssessing product image quality for online shopping
Assessing product image quality for online shopping
Anjan Goswami
 
Clustering
ClusteringClustering
Clustering
Anjan Goswami
 

Plus de Anjan Goswami (8)

Learning to Diversify for E-commerce Search with Multi-Armed Bandit}
Learning to Diversify for E-commerce Search with Multi-Armed Bandit}Learning to Diversify for E-commerce Search with Multi-Armed Bandit}
Learning to Diversify for E-commerce Search with Multi-Armed Bandit}
 
Discovery In Commerce Search
Discovery In Commerce SearchDiscovery In Commerce Search
Discovery In Commerce Search
 
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...
 
Controlled Experiments for Decision-Making in e-Commerce Search
Controlled Experiments for Decision-Making in e-Commerce SearchControlled Experiments for Decision-Making in e-Commerce Search
Controlled Experiments for Decision-Making in e-Commerce Search
 
Reputation systems
Reputation systemsReputation systems
Reputation systems
 
Topic Models Based Understanding of Supply and Demand Side of an eCommerce En...
Topic Models Based Understanding of Supply and Demand Side of an eCommerce En...Topic Models Based Understanding of Supply and Demand Side of an eCommerce En...
Topic Models Based Understanding of Supply and Demand Side of an eCommerce En...
 
Assessing product image quality for online shopping
Assessing product image quality for online shoppingAssessing product image quality for online shopping
Assessing product image quality for online shopping
 
Clustering
ClusteringClustering
Clustering
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Spelling correction systems for e-commerce platforms

  • 1. Spell Correction Systems for E-commerce engines Anjan Goswami HuiZhong Duan Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 1 / 31
  • 2. The Spell correction problem Rich literature [KCG90, Pet80]. Active research area [CB04]. Combination of NLP, Machine Learning [DH11, BB01, LDZ12] and Systems problems [Kuk92]. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 2 / 31
  • 3. Spell correction for e-commerce Critical site feature for e-commerce. Impact of ML based spell correction Adds revenue. Reduces bounce rate. Reduces null Results. Departments such as pharmacy can have huge gain in revenue with Spell Correction. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 3 / 31
  • 4. Spell correction for e-commerce Science part is same as any other large scale spell correction systems. Demand and supply side corpus. Conversion focus. User Interfaces. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 4 / 31
  • 5. Spell correction Evaluation Accuracy for misspelled queries. Accuracy for correctly spelled queries. Business metrics. Coverage. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 5 / 31
  • 6. The problem Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 6 / 31
  • 7. The problem Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 7 / 31
  • 8. The problem Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 8 / 31
  • 9. The problem Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 9 / 31
  • 10. The problem Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 10 / 31
  • 11. Error statistics Approximately 26% queries have spelling error in web queries [JM]. E-com data can be expected to be similar. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 11 / 31
  • 12. Error Types Typographic errors: Covr ← Cover Cognitive errors: Visio Tv ← Vizio Tv Non-english word errors: X345678 ← X345677 Contextual errors: life of Pie ← Life of Pi Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 12 / 31
  • 13. Challenges General Challenges Large candidate pool: queries Open dictionary: all terms are feasible Efficiency: happens before search is executed User behavior: query formulation is different from typical writing Devices: different device may cause different types of typos Under-correction: even a term is in correct form, it may need correction Over-correction: a term that doesn’t appear correct could still be good search term Languages: Different languages have different challenges. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 13 / 31
  • 14. Query Spelling Challenges Special Challenges (and Opportunities) in e-Commerce optimization target: linguistic correct or conversion? unique dictionary: model numbers, etc. high cost for over-correction availability of inventory data availability of conversion data Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 14 / 31
  • 15. General problems Error modeling Candidate generation Ranking and selection of the best candidate. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 15 / 31
  • 16. Modeling A Noisy Channel Framework Given user input query q, for every candidate correction c, compute the conditional probability p(c|q) p(c|q) = p(q|c) · p(c) p(q) ∝ p(q|c) · p(c) (1) Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 16 / 31
  • 17. Modeling A Noisy Channel Framework (cont.) Source model p(c) Captures: how likely user will pick query c in the first place Typically: language model Rationale: common phrases have high probabilities Error model p(q|c) Captures: how likely c is misspelled as q Straightforward model: edit distance Rationale: misspelled query should not be too different from original query Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 17 / 31
  • 18. Modeling A Noisy Channel Framework (cont.) More on Source model p(c) Linguistic correction is important Should also reflect query popularity In e-Commerce, we also need to consider query conversion, and query revenue Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 18 / 31
  • 19. Modeling A Noisy Channel Framework (cont.) Language Model n-gram language model: data sparsity as n goes up backoff to/interpolation with lower-gram is necessary smoothing is important Good Turing smoothing: use 1-frequency items to estimate 0-frequency probabilities Additive smoothing: add pseudo count to terms/phrases Knesser-Ney Smoothing: smart way of backoff and interpolation Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 19 / 31
  • 20. Modeling A Noisy Channel Framework (cont.) More on Error model p(q|c) Weighted edit model is better: p( a → e ) > p( a → n ) Context matters: p( a → e |context = ”be...”) Multi-word errors need to be considered: p(”gopro”|”go pro”), can be modeled by HMM, joint sequence model, etc. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 20 / 31
  • 21. Modeling A Noisy Channel Framework (cont.) Hierarchical Error models Character level error model p( a → e |context = ”be...”) generalizes well less accurate Syllable level error model Word level error model p( pi → pie |context = ”life of ...”) sparse data more accurate Phrase level error model ... Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 21 / 31
  • 22. Modeling Discriminative Models Why? Noisy channel model is a generative framework Multiplication is difficult as probabilities are estimated in different ways How to merge signals in one probability estimation is unknown (e.g. linguistic correction vs. popularity vs. revenue) There are other heuristic features and domain specific features that cannot be subsumed in noisy channel model Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 22 / 31
  • 23. Modeling Discriminative Models (cont.) How? Learn to score < q, c > pair so that best correction has highest score Challenges Obtaining large scale training data: text parsing, human annotation Learning methods Classification Learning to Rank Structural learning Efficiency: use noisy channel model to retrieve a handful candidates Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 23 / 31
  • 24. Modeling Discriminative Models (cont.) Typically discriminative models such as SVM can also be used to rerank the spelling candidates. Recent successes with deep neural net. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 24 / 31
  • 25. Modeling Systems for Spelling Correction Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 25 / 31
  • 26. Modeling Candidate generation for Spelling Correction Given a word find out all neighboring words under k edit distance. Given a word find out potential close matches by hashing trick. Generate candidates by using heuristic rules for common errors. N-gram based techniques. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 26 / 31
  • 27. Modeling Candidate generation scaling up Distributed implementation. Hashing tricks. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 27 / 31
  • 28. Modeling Spell correction for E-commerce UI for the spell correction. Input data: Whether to include item titles or not? Impact of autocorrection on conversion. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 28 / 31
  • 29. Modeling References I Michele Banko and Eric Brill, Scaling to very very large corpora for natural language disambiguation, Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2001, pp. 26–33. Silviu Cucerzan and Eric Brill, Spelling correction as an iterative process that exploits the collective knowledge of web users., EMNLP, vol. 4, 2004, pp. 293–300. Huizhong Duan and Bo-June Paul Hsu, Online spelling correction for query completion, Proceedings of the 20th international conference on World wide web, ACM, 2011, pp. 117–126. Daniel Jurafsky and James H Martin, Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 29 / 31
  • 30. Modeling References II Mark D Kernighan, Kenneth W Church, and William A Gale, A spelling correction program based on a noisy channel model, Proceedings of the 13th conference on Computational linguistics-Volume 2, Association for Computational Linguistics, 1990, pp. 205–210. Karen Kukich, Techniques for automatically correcting words in text, ACM Computing Surveys (CSUR) 24 (1992), no. 4, 377–439. Yanen Li, Huizhong Duan, and ChengXiang Zhai, A generalized hidden markov model with discriminative training for query spelling correction, Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, ACM, 2012, pp. 611–620. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 30 / 31
  • 31. Modeling References III James L Peterson, Computer programs for detecting and correcting spelling errors, Communications of the ACM 23 (1980), no. 12, 676–687. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 31 / 31