Grokking TechTalk #35: Efficient spellchecking

PO Department
PEOPLE OPERATION’S
MONTHLY UPDATE
09/2019
1
CPU and memory efficient
spellchecker implementation in TIKI

3
Results for “ipohne” without spellchecker

4
Results for “ipohne” with spellchecker

5
General approach
words, result = (tokenize(query), [])
for w in words:
candidates = generate_candidates(w)
best_c, best_score = (None, 0.)
for c in candidates:
score = spellchecker_score(w, c)
if score > best_score:
best_c, best_score = (c, score)
result.append(best_c)

6
Generate candidates
Generate all possible similar words:
- Need to define a measure of similarity - we use Damerau-Levenshtein distance
- It allows insertions, deletions, substitutions and transpositions of symbols
- We limit maximum allowed distance depending on the length of the word
- Then just generate all edits out of 4 possible types (CPU greedy)
- We will optimize this approach later
Examples of Damerau-Levenshtein distance:
- distance(nguyễn, nguyên) = 1 (one substitution)
- distance(nguyễn, nguyeenx) = 3 (one substitution, two insertions)
- distance(behaivour, behaviour) = 1 (one transposition)

7
Spellchecker score
“Noisy channel” model:
- Bayesian formula: P(c|w) = P(w|c) * P(c) / P(w)
- Need to find candidate c which maximizes P(c|w)
- Can simplify to P(w|c) * P(c) because P(w) is constant for all candidates
Used probabilities:
- P(c|w) - probability of c being intended when w was observed
- P(w|c) - probability of the word w to be a misspelling of c - error model
- P(c) - probability to observe c - language model

8
Building the language model
N-gram model:
- Building a 2-gram dictionary
- Remove 2-grams below a certain threshold
Used data:
- All product contents on Tiki
- All Tiki search queries for a year
- Some randomly crawled texts from the Vietnamese Web
- Total: 5.5Gb gzip-ed

9
Building the language model (example)
Data (queries on Tiki):
máy rửa mặt
máy rửa mắt
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy xay sinh tố
máy sấy tóc
...
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy rửa mắt
máy xay sinh tố
máy sấy tóc
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố

10
Counted queries:
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
Language model:
410 <
410 >
410 máy
410 < máy
205 máy rửa
100 máy sấy
105 máy xay
105 tóc >
100 sấy tóc
5 xay tóc
105 tóc
...
We just count all possible single words and
word pairs from our counted queries data and
write it down into language model.
This will let us calculate the probability of the
word to be observed without a context or with
a context of 1 word before or after it.

11
Language model:
410 <
410 >
410 máy
410 < máy
205 máy rửa
100 máy sấy
105 máy xay
105 tóc >
100 sấy tóc
5 xay tóc
105 tóc
...
Query: máy => “< máy >"
P(máy) = 0.5 * (P(< máy) + P(máy >))
= 0.5 * (410/410+0/410) = 0.5
Query: máy xay tóc
P(xay) = 0.5 * (P(máy xay) + P(xay tóc))
= 0.5 * (105/410+5/105) ~ 0.30
P(sấy) = 0.5 * (P(máy sấy) + P(sấy tóc))
= 0.5 * (100/410+100/105) ~ 0.60
Language model here suggests that the
probability to see “sấy” in this context is
higher than the probability to see “xay”.

12
Building the error model
Automatic extraction of P(w|c):
- Extract triplets (w1, w2, w3) from our texts set
- Group triplets by (w1, *, w3) and sort by descending popularity
- Remove groupings below a certain threshold
- Remove samples where w2 words are too far from each other (using
Damerau-Levenshtein distance)
- Remove samples with popularity comparable to the most popular sample in this
grouping
- Write w2 words from all left samples into error model mapping as triplets of
(observed word, intended word, count)
Used data:
- Same as for the language model

13
Building the error model (example)
Data (queries on Tiki):
máy rửa mặt
máy rửa mắt
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy xay sinh tố
máy sấy tóc
...
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy rửa mắt
máy xay sinh tố
máy sấy tóc
Counted queries:
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc

14
Counted queries:
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
Triplets:
205 < máy rửa
200 rửa mặt >
5 rửa mắt >
100 máy sấy tóc
5 máy xay tóc
5 máy rửa mắt
105 < máy xay
100 sinh tố >
...
We count all possible triplets from our counted
queries data.

15
Triplets (grouped):
rửa * >
200 rửa mặt >
5 rửa mắt >
máy * tóc
100 máy sấy tóc
5 máy xay tóc
máy * sinh
100 máy xay sinh
sinh * >
100 sinh tố >
...
Error model:
200 mặt mặt
5 mắt mặt
100 sấy sấy
5 xay sấy
100 xay xay
100 tố tố
...
Format:
count
observed_word
intended_word

16
Query: kem rửa mắt
P(mắt|mắt) = 0/5 = 0.0 - we divide the number of
times “mắt" was intended when "mắt" was
observed in error model to just the total number of
times when "mắt" was observed in error model.
P(mắt|mặt) = 5/5 = 1.0 - again, we divide the
number of times "mặt" was intended when "mắt"
was observed in error model to just the total
number of times when "mắt" was observed in error
model.
This means that according to error model built
on our data, it is extremely likely for “mắt" to
be a misspelling of “mặt".
Error model:
200 mặt mặt
5 mắt mặt
100 sấy sấy
5 xay sấy
100 xay xay
100 tố tố
...
Format:
count
observed_word
intended_word

17
Quality optimizations
Idea:
- Language model is more important in bigger context
- Instead of P(w|c)*P(c) use P(w|c)*pow(P(c),lambda)
- Lambda depends on the length of available context
Results:
- Using bigger lambda for longer context => better test result (idea works!)
- For bigger N-gram need to use machine learning to optimize lambdas

18
Performance optimizations
Important fact:
It is possible to prove that if Damerau-Levenshtein distance(w, c) = N, then for any w
and c we can find a combination of no more than N deletes of a single character from
each side, which will lead to the same result. Examples below:
distance(iphone, iphobee) = 2 (one insertion, one substitution)
iphone -> iphoe VS iphobee -> iphoee -> iphoe (match!)
distance(iphone, pihoone) = 2 (one transposition, one insertion)
iphone -> ihone VS pihoone -> ihoone -> ihone (match!)
Let’s use it to optimize candidates generation!

19
Performance optimizations
Problem 1 - generating candidates is CPU greedy:
- Precompute “deletes” dictionary
- Use only delete operations from both sides
- Need to double-check the distance (can be up to 2N, but we need N)
- Fast, but requires RAM
Problem 2 - having “deletes” dictionary requires RAM:
- Use different data compression techniques
- From what we’ve tried, Judy dynamic arrays work the best
- We decreased RAM requirements from 10.5Gb to 2.3Gb

20
Testing results
Testing set:
- 5,000 random queries, 10,000 misspelled queries
- Suggestions collected through Google API and then manually checked
- Only one marker per query
Results:
- Slightly (10-12%) worse than Google (ok for such RAM requirements)
- In A/B test shows 3-9% purchases increase

21
Future plans
Implementation:
- Use 3-gram data (still trying to keep it RAM-optimal)
Testing:
- Use multi-marker test set
- Properly handle cases when spellchecker returns multiple variants

Grokking TechTalk #35: Efficient spellchecking

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Grokking TechTalk #35: Efficient spellchecking

Similaire à Grokking TechTalk #35: Efficient spellchecking (20)

Plus de Grokking VN

Plus de Grokking VN (20)

Dernier

Dernier (20)

Grokking TechTalk #35: Efficient spellchecking