This is a presentation on building a scalable machine learned spell correction system for an e-commerce site. However, most of the techniques are also generally applicable for any large consumer site.
Spelling correction systems for e-commerce platforms
1. Spell Correction Systems for E-commerce engines
Anjan Goswami HuiZhong Duan
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 1 / 31
2. The Spell correction problem
Rich literature [KCG90, Pet80].
Active research area [CB04].
Combination of NLP, Machine Learning [DH11, BB01, LDZ12] and
Systems problems [Kuk92].
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 2 / 31
3. Spell correction for e-commerce
Critical site feature for e-commerce.
Impact of ML based spell correction
Adds revenue.
Reduces bounce rate.
Reduces null Results.
Departments such as pharmacy can have huge gain in revenue with
Spell Correction.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 3 / 31
4. Spell correction for e-commerce
Science part is same as any other large scale spell correction systems.
Demand and supply side corpus.
Conversion focus.
User Interfaces.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 4 / 31
5. Spell correction Evaluation
Accuracy for misspelled queries.
Accuracy for correctly spelled queries.
Business metrics.
Coverage.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 5 / 31
11. Error statistics
Approximately 26% queries have spelling error in web queries [JM].
E-com data can be expected to be similar.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 11 / 31
12. Error Types
Typographic errors: Covr ← Cover
Cognitive errors: Visio Tv ← Vizio Tv
Non-english word errors: X345678 ← X345677
Contextual errors: life of Pie ← Life of Pi
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 12 / 31
13. Challenges
General Challenges
Large candidate pool: queries
Open dictionary: all terms are feasible
Efficiency: happens before search is executed
User behavior: query formulation is different from typical writing
Devices: different device may cause different types of typos
Under-correction: even a term is in correct form, it may need
correction
Over-correction: a term that doesn’t appear correct could still be
good search term
Languages: Different languages have different challenges.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 13 / 31
14. Query Spelling Challenges
Special Challenges (and Opportunities) in e-Commerce
optimization target: linguistic correct or conversion?
unique dictionary: model numbers, etc.
high cost for over-correction
availability of inventory data
availability of conversion data
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 14 / 31
15. General problems
Error modeling
Candidate generation
Ranking and selection of the best candidate.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 15 / 31
16. Modeling
A Noisy Channel Framework
Given user input query q, for every candidate correction c, compute the
conditional probability p(c|q)
p(c|q) =
p(q|c) · p(c)
p(q)
∝ p(q|c) · p(c) (1)
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 16 / 31
17. Modeling
A Noisy Channel Framework (cont.)
Source model p(c)
Captures: how likely user will pick query c in the first place
Typically: language model
Rationale: common phrases have high probabilities
Error model p(q|c)
Captures: how likely c is misspelled as q
Straightforward model: edit distance
Rationale: misspelled query should not be too different from original
query
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 17 / 31
18. Modeling
A Noisy Channel Framework (cont.)
More on Source model p(c)
Linguistic correction is important
Should also reflect query popularity
In e-Commerce, we also need to consider query conversion, and query
revenue
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 18 / 31
19. Modeling
A Noisy Channel Framework (cont.)
Language Model
n-gram language model: data sparsity as n goes up
backoff to/interpolation with lower-gram is necessary
smoothing is important
Good Turing smoothing: use 1-frequency items to estimate 0-frequency
probabilities
Additive smoothing: add pseudo count to terms/phrases
Knesser-Ney Smoothing: smart way of backoff and interpolation
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 19 / 31
20. Modeling
A Noisy Channel Framework (cont.)
More on Error model p(q|c)
Weighted edit model is better: p( a → e ) > p( a → n )
Context matters: p( a → e |context = ”be...”)
Multi-word errors need to be considered: p(”gopro”|”go pro”), can
be modeled by HMM, joint sequence model, etc.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 20 / 31
21. Modeling
A Noisy Channel Framework (cont.)
Hierarchical Error models
Character level error model
p( a → e |context = ”be...”)
generalizes well
less accurate
Syllable level error model
Word level error model
p( pi → pie |context = ”life of ...”)
sparse data
more accurate
Phrase level error model
...
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 21 / 31
22. Modeling
Discriminative Models
Why?
Noisy channel model is a generative framework
Multiplication is difficult as probabilities are estimated in different
ways
How to merge signals in one probability estimation is unknown (e.g.
linguistic correction vs. popularity vs. revenue)
There are other heuristic features and domain specific features that
cannot be subsumed in noisy channel model
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 22 / 31
23. Modeling
Discriminative Models (cont.)
How?
Learn to score < q, c > pair so that best correction has highest score
Challenges
Obtaining large scale training data: text parsing, human annotation
Learning methods
Classification
Learning to Rank
Structural learning
Efficiency: use noisy channel model to retrieve a handful candidates
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 23 / 31
24. Modeling
Discriminative Models (cont.)
Typically discriminative models such as SVM can also be used to
rerank the spelling candidates.
Recent successes with deep neural net.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 24 / 31
25. Modeling
Systems for Spelling Correction
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 25 / 31
26. Modeling
Candidate generation for Spelling Correction
Given a word find out all neighboring words under k edit distance.
Given a word find out potential close matches by hashing trick.
Generate candidates by using heuristic rules for common errors.
N-gram based techniques.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 26 / 31
28. Modeling
Spell correction for E-commerce
UI for the spell correction.
Input data: Whether to include item titles or not?
Impact of autocorrection on conversion.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 28 / 31
29. Modeling
References I
Michele Banko and Eric Brill, Scaling to very very large corpora for
natural language disambiguation, Proceedings of the 39th Annual
Meeting on Association for Computational Linguistics, Association for
Computational Linguistics, 2001, pp. 26–33.
Silviu Cucerzan and Eric Brill, Spelling correction as an iterative
process that exploits the collective knowledge of web users., EMNLP,
vol. 4, 2004, pp. 293–300.
Huizhong Duan and Bo-June Paul Hsu, Online spelling correction for
query completion, Proceedings of the 20th international conference on
World wide web, ACM, 2011, pp. 117–126.
Daniel Jurafsky and James H Martin, Speech and language processing:
An introduction to natural language processing, computational
linguistics, and speech recognition.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 29 / 31
30. Modeling
References II
Mark D Kernighan, Kenneth W Church, and William A Gale, A
spelling correction program based on a noisy channel model,
Proceedings of the 13th conference on Computational
linguistics-Volume 2, Association for Computational Linguistics, 1990,
pp. 205–210.
Karen Kukich, Techniques for automatically correcting words in text,
ACM Computing Surveys (CSUR) 24 (1992), no. 4, 377–439.
Yanen Li, Huizhong Duan, and ChengXiang Zhai, A generalized
hidden markov model with discriminative training for query spelling
correction, Proceedings of the 35th international ACM SIGIR
conference on Research and development in information retrieval,
ACM, 2012, pp. 611–620.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 30 / 31
31. Modeling
References III
James L Peterson, Computer programs for detecting and correcting
spelling errors, Communications of the ACM 23 (1980), no. 12,
676–687.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 31 / 31