SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
PO Department
PEOPLE OPERATION’S
MONTHLY UPDATE
09/2019
1
CPU and memory efficient
spellchecker implementation in TIKI
2
Results for “iphone”
3
Results for “ipohne” without spellchecker
4
Results for “ipohne” with spellchecker
5
General approach
words, result = (tokenize(query), [])
for w in words:
candidates = generate_candidates(w)
best_c, best_score = (None, 0.)
for c in candidates:
score = spellchecker_score(w, c)
if score > best_score:
best_c, best_score = (c, score)
result.append(best_c)
6
Generate candidates
Generate all possible similar words:
- Need to define a measure of similarity - we use Damerau-Levenshtein distance
- It allows insertions, deletions, substitutions and transpositions of symbols
- We limit maximum allowed distance depending on the length of the word
- Then just generate all edits out of 4 possible types (CPU greedy)
- We will optimize this approach later
Examples of Damerau-Levenshtein distance:
- distance(nguyễn, nguyên) = 1 (one substitution)
- distance(nguyễn, nguyeenx) = 3 (one substitution, two insertions)
- distance(behaivour, behaviour) = 1 (one transposition)
7
Spellchecker score
“Noisy channel” model:
- Bayesian formula: P(c|w) = P(w|c) * P(c) / P(w)
- Need to find candidate c which maximizes P(c|w)
- Can simplify to P(w|c) * P(c) because P(w) is constant for all candidates
Used probabilities:
- P(c|w) - probability of c being intended when w was observed
- P(w|c) - probability of the word w to be a misspelling of c - error model
- P(c) - probability to observe c - language model
8
Building the language model
N-gram model:
- Building a 2-gram dictionary
- Remove 2-grams below a certain threshold
Used data:
- All product contents on Tiki
- All Tiki search queries for a year
- Some randomly crawled texts from the Vietnamese Web
- Total: 5.5Gb gzip-ed
9
Building the language model (example)
Data (queries on Tiki):
máy rửa mặt
máy rửa mắt
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy xay sinh tố
máy sấy tóc
...
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy rửa mắt
máy xay sinh tố
máy sấy tóc
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
10
Building the language model (example)
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
Language model:
410 <
410 >
410 máy
410 < máy
205 máy rửa
100 máy sấy
105 máy xay
105 tóc >
100 sấy tóc
5 xay tóc
105 tóc
...
We just count all possible single words and
word pairs from our counted queries data and
write it down into language model.
This will let us calculate the probability of the
word to be observed without a context or with
a context of 1 word before or after it.
11
Building the language model (example)
Language model:
410 <
410 >
410 máy
410 < máy
205 máy rửa
100 máy sấy
105 máy xay
105 tóc >
100 sấy tóc
5 xay tóc
105 tóc
...
Query: máy => “< máy >"
P(máy) = 0.5 * (P(< máy) + P(máy >))
= 0.5 * (410/410+0/410) = 0.5
Query: máy xay tóc
P(xay) = 0.5 * (P(máy xay) + P(xay tóc))
= 0.5 * (105/410+5/105) ~ 0.30
P(sấy) = 0.5 * (P(máy sấy) + P(sấy tóc))
= 0.5 * (100/410+100/105) ~ 0.60
Language model here suggests that the
probability to see “sấy” in this context is
higher than the probability to see “xay”.
12
Building the error model
Automatic extraction of P(w|c):
- Extract triplets (w1, w2, w3) from our texts set
- Group triplets by (w1, *, w3) and sort by descending popularity
- Remove groupings below a certain threshold
- Remove samples where w2 words are too far from each other (using
Damerau-Levenshtein distance)
- Remove samples with popularity comparable to the most popular sample in this
grouping
- Write w2 words from all left samples into error model mapping as triplets of
(observed word, intended word, count)
Used data:
- Same as for the language model
13
Building the error model (example)
Data (queries on Tiki):
máy rửa mặt
máy rửa mắt
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy xay sinh tố
máy sấy tóc
...
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy rửa mắt
máy xay sinh tố
máy sấy tóc
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
14
Building the error model (example)
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
Triplets:
205 < máy rửa
200 rửa mặt >
5 rửa mắt >
100 máy sấy tóc
5 máy xay tóc
200 máy rửa mặt
5 máy rửa mắt
105 < máy xay
100 sinh tố >
...
We count all possible triplets from our counted
queries data.
15
Building the error model (example)
Triplets (grouped):
rửa * >
200 rửa mặt >
5 rửa mắt >
máy * tóc
100 máy sấy tóc
5 máy xay tóc
máy * sinh
100 máy xay sinh
sinh * >
100 sinh tố >
...
Error model:
200 mặt mặt
5 mắt mặt
100 sấy sấy
5 xay sấy
100 xay xay
100 tố tố
...
Format:
count
observed_word
intended_word
16
Building the error model (example)
Query: kem rửa mắt
P(mắt|mắt) = 0/5 = 0.0 - we divide the number of
times “mắt" was intended when "mắt" was
observed in error model to just the total number of
times when "mắt" was observed in error model.
P(mắt|mặt) = 5/5 = 1.0 - again, we divide the
number of times "mặt" was intended when "mắt"
was observed in error model to just the total
number of times when "mắt" was observed in error
model.
This means that according to error model built
on our data, it is extremely likely for “mắt" to
be a misspelling of “mặt".
Error model:
200 mặt mặt
5 mắt mặt
100 sấy sấy
5 xay sấy
100 xay xay
100 tố tố
...
Format:
count
observed_word
intended_word
17
Quality optimizations
Idea:
- Language model is more important in bigger context
- Instead of P(w|c)*P(c) use P(w|c)*pow(P(c),lambda)
- Lambda depends on the length of available context
Results:
- Using bigger lambda for longer context => better test result (idea works!)
- For bigger N-gram need to use machine learning to optimize lambdas
18
Performance optimizations
Important fact:
It is possible to prove that if Damerau-Levenshtein distance(w, c) = N, then for any w
and c we can find a combination of no more than N deletes of a single character from
each side, which will lead to the same result. Examples below:
distance(iphone, iphobee) = 2 (one insertion, one substitution)
iphone -> iphoe VS iphobee -> iphoee -> iphoe (match!)
distance(iphone, pihoone) = 2 (one transposition, one insertion)
iphone -> ihone VS pihoone -> ihoone -> ihone (match!)
Let’s use it to optimize candidates generation!
19
Performance optimizations
Problem 1 - generating candidates is CPU greedy:
- Precompute “deletes” dictionary
- Use only delete operations from both sides
- Need to double-check the distance (can be up to 2N, but we need N)
- Fast, but requires RAM
Problem 2 - having “deletes” dictionary requires RAM:
- Use different data compression techniques
- From what we’ve tried, Judy dynamic arrays work the best
- We decreased RAM requirements from 10.5Gb to 2.3Gb
20
Testing results
Testing set:
- 5,000 random queries, 10,000 misspelled queries
- Suggestions collected through Google API and then manually checked
- Only one marker per query
Results:
- Slightly (10-12%) worse than Google (ok for such RAM requirements)
- In A/B test shows 3-9% purchases increase
21
Future plans
Implementation:
- Use 3-gram data (still trying to keep it RAM-optimal)
Testing:
- Use multi-marker test set
- Properly handle cases when spellchecker returns multiple variants
Thank you!
22

Contenu connexe

Tendances

Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023confluent
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoringGrokking VN
 
Tiki.vn - How we scale as a tech startup
Tiki.vn - How we scale as a tech startupTiki.vn - How we scale as a tech startup
Tiki.vn - How we scale as a tech startupTung Ns
 
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScyllaDB
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2ScyllaDB
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022HostedbyConfluent
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking VN
 
Using ScyllaDB for Distribution of Game Assets in Unreal Engine
Using ScyllaDB for Distribution of Game Assets in Unreal EngineUsing ScyllaDB for Distribution of Game Assets in Unreal Engine
Using ScyllaDB for Distribution of Game Assets in Unreal EngineScyllaDB
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisArnab Mitra
 
CAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain SyndromeCAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain SyndromeDilum Bandara
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureScyllaDB
 

Tendances (20)

Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoring
 
Mutiny + quarkus
Mutiny + quarkusMutiny + quarkus
Mutiny + quarkus
 
Tiki.vn - How we scale as a tech startup
Tiki.vn - How we scale as a tech startupTiki.vn - How we scale as a tech startup
Tiki.vn - How we scale as a tech startup
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
 
Distributed fun with etcd
Distributed fun with etcdDistributed fun with etcd
Distributed fun with etcd
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
Using ScyllaDB for Distribution of Game Assets in Unreal Engine
Using ScyllaDB for Distribution of Game Assets in Unreal EngineUsing ScyllaDB for Distribution of Game Assets in Unreal Engine
Using ScyllaDB for Distribution of Game Assets in Unreal Engine
 
ECMA Script
ECMA ScriptECMA Script
ECMA Script
 
Zuul @ Netflix SpringOne Platform
Zuul @ Netflix SpringOne PlatformZuul @ Netflix SpringOne Platform
Zuul @ Netflix SpringOne Platform
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
CAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain SyndromeCAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain Syndrome
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 
C# REST API
C# REST APIC# REST API
C# REST API
 

Similaire à Grokking TechTalk #35: Efficient spellchecking

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
c++ Data Types and Selection
c++ Data Types and Selectionc++ Data Types and Selection
c++ Data Types and SelectionAhmed Nobi
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I thinkWim Godden
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine LearningNarong Intiruk
 
Spock Framework - Slidecast
Spock Framework - SlidecastSpock Framework - Slidecast
Spock Framework - SlidecastDaniel Kolman
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I thinkWim Godden
 
Railway Oriented Programming in Elixir
Railway Oriented Programming in ElixirRailway Oriented Programming in Elixir
Railway Oriented Programming in ElixirMustafa TURAN
 
Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015François Scharffe
 
Network automation with Ansible and Python
Network automation with Ansible and PythonNetwork automation with Ansible and Python
Network automation with Ansible and PythonJisc
 
Django in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for FreeDjango in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for FreeHarvard Web Working Group
 
The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6Wim Godden
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I thinkWim Godden
 
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Codemotion
 
Dialog Engine for Product Information
Dialog Engine for Product InformationDialog Engine for Product Information
Dialog Engine for Product InformationVamsee Chamakura
 
Testing Adhearsion Applications
Testing Adhearsion ApplicationsTesting Adhearsion Applications
Testing Adhearsion ApplicationsLuca Pradovera
 
Logical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by ProfessionalsLogical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by ProfessionalsPVS-Studio
 

Similaire à Grokking TechTalk #35: Efficient spellchecking (20)

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
c++ Data Types and Selection
c++ Data Types and Selectionc++ Data Types and Selection
c++ Data Types and Selection
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine Learning
 
Spock Framework - Slidecast
Spock Framework - SlidecastSpock Framework - Slidecast
Spock Framework - Slidecast
 
Spock Framework
Spock FrameworkSpock Framework
Spock Framework
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Conf orm - explain
Conf orm - explainConf orm - explain
Conf orm - explain
 
Railway Oriented Programming in Elixir
Railway Oriented Programming in ElixirRailway Oriented Programming in Elixir
Railway Oriented Programming in Elixir
 
Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015
 
Network automation with Ansible and Python
Network automation with Ansible and PythonNetwork automation with Ansible and Python
Network automation with Ansible and Python
 
Django in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for FreeDjango in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for Free
 
The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
 
Dialog Engine for Product Information
Dialog Engine for Product InformationDialog Engine for Product Information
Dialog Engine for Product Information
 
Testing Adhearsion Applications
Testing Adhearsion ApplicationsTesting Adhearsion Applications
Testing Adhearsion Applications
 
Logical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by ProfessionalsLogical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by Professionals
 
Php optimization
Php optimizationPhp optimization
Php optimization
 
Php101
Php101Php101
Php101
 

Plus de Grokking VN

Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking VN
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking VN
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking VN
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking VN
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking VN
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problemGrokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...Grokking VN
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking VN
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design PatternsGrokking VN
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking VN
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking VN
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking VN
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking VN
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking VN
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking VN
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking VN
 
Grokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking VN
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking VN
 
Grokking TechTalk #19: Software Development Cycle In The International Moneta...
Grokking TechTalk #19: Software Development Cycle In The International Moneta...Grokking TechTalk #19: Software Development Cycle In The International Moneta...
Grokking TechTalk #19: Software Development Cycle In The International Moneta...Grokking VN
 

Plus de Grokking VN (20)

Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design Patterns
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the Magic
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platform
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
 
Grokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocolsGrokking TechTalk #24: Kafka's principles and protocols
Grokking TechTalk #24: Kafka's principles and protocols
 
Grokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer VisionGrokking TechTalk #21: Deep Learning in Computer Vision
Grokking TechTalk #21: Deep Learning in Computer Vision
 
Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101Grokking TechTalk #20: PostgreSQL Internals 101
Grokking TechTalk #20: PostgreSQL Internals 101
 
Grokking TechTalk #19: Software Development Cycle In The International Moneta...
Grokking TechTalk #19: Software Development Cycle In The International Moneta...Grokking TechTalk #19: Software Development Cycle In The International Moneta...
Grokking TechTalk #19: Software Development Cycle In The International Moneta...
 

Dernier

notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 

Dernier (20)

NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 

Grokking TechTalk #35: Efficient spellchecking

  • 1. PO Department PEOPLE OPERATION’S MONTHLY UPDATE 09/2019 1 CPU and memory efficient spellchecker implementation in TIKI
  • 3. 3 Results for “ipohne” without spellchecker
  • 4. 4 Results for “ipohne” with spellchecker
  • 5. 5 General approach words, result = (tokenize(query), []) for w in words: candidates = generate_candidates(w) best_c, best_score = (None, 0.) for c in candidates: score = spellchecker_score(w, c) if score > best_score: best_c, best_score = (c, score) result.append(best_c)
  • 6. 6 Generate candidates Generate all possible similar words: - Need to define a measure of similarity - we use Damerau-Levenshtein distance - It allows insertions, deletions, substitutions and transpositions of symbols - We limit maximum allowed distance depending on the length of the word - Then just generate all edits out of 4 possible types (CPU greedy) - We will optimize this approach later Examples of Damerau-Levenshtein distance: - distance(nguyễn, nguyên) = 1 (one substitution) - distance(nguyễn, nguyeenx) = 3 (one substitution, two insertions) - distance(behaivour, behaviour) = 1 (one transposition)
  • 7. 7 Spellchecker score “Noisy channel” model: - Bayesian formula: P(c|w) = P(w|c) * P(c) / P(w) - Need to find candidate c which maximizes P(c|w) - Can simplify to P(w|c) * P(c) because P(w) is constant for all candidates Used probabilities: - P(c|w) - probability of c being intended when w was observed - P(w|c) - probability of the word w to be a misspelling of c - error model - P(c) - probability to observe c - language model
  • 8. 8 Building the language model N-gram model: - Building a 2-gram dictionary - Remove 2-grams below a certain threshold Used data: - All product contents on Tiki - All Tiki search queries for a year - Some randomly crawled texts from the Vietnamese Web - Total: 5.5Gb gzip-ed
  • 9. 9 Building the language model (example) Data (queries on Tiki): máy rửa mặt máy rửa mắt máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy xay sinh tố máy sấy tóc ... máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy rửa mắt máy xay sinh tố máy sấy tóc Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố
  • 10. 10 Building the language model (example) Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố Language model: 410 < 410 > 410 máy 410 < máy 205 máy rửa 100 máy sấy 105 máy xay 105 tóc > 100 sấy tóc 5 xay tóc 105 tóc ... We just count all possible single words and word pairs from our counted queries data and write it down into language model. This will let us calculate the probability of the word to be observed without a context or with a context of 1 word before or after it.
  • 11. 11 Building the language model (example) Language model: 410 < 410 > 410 máy 410 < máy 205 máy rửa 100 máy sấy 105 máy xay 105 tóc > 100 sấy tóc 5 xay tóc 105 tóc ... Query: máy => “< máy >" P(máy) = 0.5 * (P(< máy) + P(máy >)) = 0.5 * (410/410+0/410) = 0.5 Query: máy xay tóc P(xay) = 0.5 * (P(máy xay) + P(xay tóc)) = 0.5 * (105/410+5/105) ~ 0.30 P(sấy) = 0.5 * (P(máy sấy) + P(sấy tóc)) = 0.5 * (100/410+100/105) ~ 0.60 Language model here suggests that the probability to see “sấy” in this context is higher than the probability to see “xay”.
  • 12. 12 Building the error model Automatic extraction of P(w|c): - Extract triplets (w1, w2, w3) from our texts set - Group triplets by (w1, *, w3) and sort by descending popularity - Remove groupings below a certain threshold - Remove samples where w2 words are too far from each other (using Damerau-Levenshtein distance) - Remove samples with popularity comparable to the most popular sample in this grouping - Write w2 words from all left samples into error model mapping as triplets of (observed word, intended word, count) Used data: - Same as for the language model
  • 13. 13 Building the error model (example) Data (queries on Tiki): máy rửa mặt máy rửa mắt máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy xay sinh tố máy sấy tóc ... máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy rửa mắt máy xay sinh tố máy sấy tóc Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố
  • 14. 14 Building the error model (example) Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố Triplets: 205 < máy rửa 200 rửa mặt > 5 rửa mắt > 100 máy sấy tóc 5 máy xay tóc 200 máy rửa mặt 5 máy rửa mắt 105 < máy xay 100 sinh tố > ... We count all possible triplets from our counted queries data.
  • 15. 15 Building the error model (example) Triplets (grouped): rửa * > 200 rửa mặt > 5 rửa mắt > máy * tóc 100 máy sấy tóc 5 máy xay tóc máy * sinh 100 máy xay sinh sinh * > 100 sinh tố > ... Error model: 200 mặt mặt 5 mắt mặt 100 sấy sấy 5 xay sấy 100 xay xay 100 tố tố ... Format: count observed_word intended_word
  • 16. 16 Building the error model (example) Query: kem rửa mắt P(mắt|mắt) = 0/5 = 0.0 - we divide the number of times “mắt" was intended when "mắt" was observed in error model to just the total number of times when "mắt" was observed in error model. P(mắt|mặt) = 5/5 = 1.0 - again, we divide the number of times "mặt" was intended when "mắt" was observed in error model to just the total number of times when "mắt" was observed in error model. This means that according to error model built on our data, it is extremely likely for “mắt" to be a misspelling of “mặt". Error model: 200 mặt mặt 5 mắt mặt 100 sấy sấy 5 xay sấy 100 xay xay 100 tố tố ... Format: count observed_word intended_word
  • 17. 17 Quality optimizations Idea: - Language model is more important in bigger context - Instead of P(w|c)*P(c) use P(w|c)*pow(P(c),lambda) - Lambda depends on the length of available context Results: - Using bigger lambda for longer context => better test result (idea works!) - For bigger N-gram need to use machine learning to optimize lambdas
  • 18. 18 Performance optimizations Important fact: It is possible to prove that if Damerau-Levenshtein distance(w, c) = N, then for any w and c we can find a combination of no more than N deletes of a single character from each side, which will lead to the same result. Examples below: distance(iphone, iphobee) = 2 (one insertion, one substitution) iphone -> iphoe VS iphobee -> iphoee -> iphoe (match!) distance(iphone, pihoone) = 2 (one transposition, one insertion) iphone -> ihone VS pihoone -> ihoone -> ihone (match!) Let’s use it to optimize candidates generation!
  • 19. 19 Performance optimizations Problem 1 - generating candidates is CPU greedy: - Precompute “deletes” dictionary - Use only delete operations from both sides - Need to double-check the distance (can be up to 2N, but we need N) - Fast, but requires RAM Problem 2 - having “deletes” dictionary requires RAM: - Use different data compression techniques - From what we’ve tried, Judy dynamic arrays work the best - We decreased RAM requirements from 10.5Gb to 2.3Gb
  • 20. 20 Testing results Testing set: - 5,000 random queries, 10,000 misspelled queries - Suggestions collected through Google API and then manually checked - Only one marker per query Results: - Slightly (10-12%) worse than Google (ok for such RAM requirements) - In A/B test shows 3-9% purchases increase
  • 21. 21 Future plans Implementation: - Use 3-gram data (still trying to keep it RAM-optimal) Testing: - Use multi-marker test set - Properly handle cases when spellchecker returns multiple variants