4. Ultimate Goal of Spelling Correction
Reducing spelling errors while the user types the same way
as before
Reducing spelling errors that occur at borders between keys
2016-06-14 4
5. Cause of Spelling Error
The difference among an individual’s touch distribution
The difference between a key’s area of recognition and an
individual’s touch distribution
2016-06-14 5
6. Review
Machine Learning
Learnthroughtraining data
Supervised Learning
Knowinga user’s intentionis the key to spelling correction
Supervisedmodel
- Refinedinput&answerinformation
2016-06-14 6
7. Review (Cont’d)
Problem
Difficult to differentiatewhich key the user pressedwhen he or she
pressesthe borderbetweenkeys
Other Algorithms
By trackingbackspace
- Inferringtheanswerinformation
- Learningthroughsupervisedlearning
Low accuracy
2016-06-14 7
8. Semi-supervised Learning
Supervised learning
A small amountof labeleddata(the answerinformation)
Unsupervised learning
A large amountof unlabeleddata(the distributionof pressedkeys)
A model that can learn without an answer information when
a user presses the borders between keys
2016-06-14 8
9. Clustering Algorithm
Grouping similar objects into a same group
Distribution-based clustering
Gaussian mixture models
- UsingtheExpectation-Maximizationalgorithm
2016-06-14 9
10. Clustering Algorithm (Cont’d)
Data near the key center
Intendedthat key
Used first-handto educatethe model
Data on key borders
Filed into the clustering algorithm
- Widenakey'sareaof recognition
2016-06-14 10
16. Problems or Limitations
Not possible to suggest correction on a contextual basis
When data set is small - High error rate when false data is
mistakenly input
2016-06-14 16
18. SwiftKey
Natural Language Processing (NLP) for predictions and
spelling corrections
Retroactive correction
2016-06-14 18
19. NLP – Types of Errors
Non word error (NWE)
bannana→ banana
Real word error (RWE)
Typographical
- two→ tow
Cognitive
- two→ too
2016-06-14 19
21. Candidate Generation
Words with similar spelling
Words with similar pronunciation ( for RWE )
The word itself ( for RWE )
2016-06-14 21
22. Candidate Generation
Words with similar spelling
Smallest edit distance between words where the edits of
letters are
Deletion
Insertion
Substitution
Reversal(Transposition)
80% to 95% of errors are within edit distance 1
2016-06-14 22
23. Candidate Generation
Example
Typo Candidate ti ci Type
acress
actress t Deletion
cress a Insertion
caress ac ca Reversal
access r c Substitution
across e o Substitution
acres s Insertion
acres s Insertion
2016-06-14 23Jurafsky2012
24. Candidate Selection
Select the candidate where the following is greatest:
𝑃 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 𝑡𝑦𝑝𝑜
=
𝑃 𝑡𝑦𝑝𝑜 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 𝑃(𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒)
𝑃(𝑡𝑦𝑝𝑜)
≈ 𝑃 𝑡𝑦𝑝𝑜 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 𝑃 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒
Bayes’ Theorem
Error Model Language Model
2016-06-14 24
25. Candidate Selection
Language Model
Unigram Model
𝑃(𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒)
The ratio of the frequencyof 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 and the total count of wordsin
the training set
n-gram Model
𝑃(𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒|𝑤𝑜𝑟𝑑1,…, 𝑤𝑜𝑟𝑑 𝑛−1)
The ratioof the frequencyof 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 with consideringn-1words
surroundingthe training set
2016-06-14 25
29. Candidate Generation
Example
Jurafsky2012
Typo Candidate ti ci Type
acress
actress t Deletion
cress a Insertion
caress ac ca Reversal
access r c Substitution
across e o Substitution
acres s Insertion
acres s Insertion
2016-06-14 29
Uses Natural Language Processing (NLP) to predict suggestions.
The same algorithm is also used to correct spelling errors.
Retroactively corrects words by selecting the best candidate out of a list of suggestions
Spelling errors can be classified into two types of errors: non word errors and real word errors
The difference between those non words and real words is whether the errored word is in the dictionary or not
To correct a non word error you first have to detect it before.
Like mentioned before, if the word is not in the dictionary then then the word is indeed a non word error. So in this case, the bigger the dictionary, the better in detection.
Next, you generate a list of candidates.
And finally, out of the candidates, you select the one which is the best.
In the step of generating candidates we make a list of words that includes the following:
words with similar spelling, words with similar pronunciation, and the word itself.
The last two are for real word errors.
In the case of words with similar spelling, we would find words in the dictionary that have the minimal edit distance between the errored word.
The edit distance between two words is the total count of deletion, insertion, substitution, reversal or transposition that happened.
It is statistically known that more than 80 percent of errors are within edit distance of 1. And almost all errors within 2.
So for a simplified spell checker program, it would generate a list of words with edit distance of 1.
Here is an example of a typo “A C R E S S”
We have candidates: actress, cress, caress, access, across, acres, acress that have the edit distance of 1
We can see the types in this coloumn and the t sub i and c sub i which we will later mention again why we need that.
Language Model: "how likely is candidate to appear in an English text?"
Error Model: "how likely is it that the author would type typo by mistake when candidate was intended?"