ELUTE

ELUTE
Essential Libraries and Utilities of
Text Engineering
Tian-Jian Jiang

Why bother?
Since we already have...

(lib)TaBE
• Traditional Chinese Word Segmentation
• with Big5 encoding
• Traditional Chinese Syllable-to-Word Conversion
• with Big5 encoding
• for bo-po-mo-fo transcription system

libchewing
• Now hacking for
• UTF-8 encoding
• Pinyin transcription system
• and looking for
• an alternative algorithm
• a better dictionary

We got a problem
•23:43 < s*****> 到底 (3023) + 螫舌 (0) 麼 (2536)
東西 (6024) = 11583
•23:43 < s*****> 到底是 (829) + 什麼東西 (337) =
1166
•23:43 < s*****> 到底螫舌麼東西大勝到底這什麼東
西
•00:02 < s*****> k***: 「什麼」會被「什麼東西」排
擠掉
•00:02 < s*****> k***: 結果是 20445 活生生的被
337 幹掉 :P

Heuristic Rules*
• Maximum matching -- Simple vs. Complex: 下雨天真正討厭
• 下雨天真正討厭 vs. 下雨天真正討厭
• Maximum average word length
• 國際化
• Minimum variance of word lengths
• 研究生命起源
• Maximum degree of morphemic freedom of single-character word
• 主要是因為
* Refer to MMSEG by C. H. Tsai: http://technology.chtsai.org/mmseg/

Graphical Models
• Markov chain family
• Statistical Language Model (SLM)
• Hidden Markov Model (HMM)
• Exponential models
• Maximum Entropy (ME)
• Conditional Random Fields (CRF)
• Applications
• Probabilistic Context-Free Grammar (PCFG) Parser
• Head-driven Phrase Structure Grammar (HPSG) Parser
• Link Grammar Parser

A probability distribution
over surface patterns of texts.

The Italian Who Went to Malta
•One day ima gonna Malta to bigga hotel.
•Ina morning I go down to eat breakfast.
•I tella waitress I wanna two pissis toasts.
•She brings me only one piss.
•I tella her I want two piss. She say go to the toilet.
•I say, you no understand, I wanna piss onna my plate.
•She say you better no piss onna plate, you sonna ma bitch.
•I don’t even know the lady and she call me sonna ma bitch!

P(“I want to piss”) > P(“I want two pieces”)
For that Malta waitress,

Do the Math
• Conditional probability:
•
• Bayes’ theorem:
•
• Information theory:
• Noisy channel model
•
• Language model: P(i)
Noisy channel
p(o|i)
Decoder
I O Î

Shannon’s Game
• Predict next word by history
•
• Maximum Likelihood Estimation
•
• C(w1…wn) : Frequency of n-gram w1…wn

Once in a Blue Moon
• A cat has seen...
• 10 sparrows
• 4 barn swallows
• 1 Chinese Bulbul
• 1 Pacific Swallow
• How likely is it that next
bird is unseen?

But I’ve seen a moon
and I’m blue
• Simple linear interpolation
• PLi(wn|wn-2 , wn-1) = λ1P1(wn) + λ2P2(wn|wn-1) + λ3P2(wn|wn-1 , wn-2)
• 0 ≤λi ≤ 1, Σiλi = 1
• Katz’s backing-off
• Back-off through progressively shorter histories.
• Pbo(wi|wi-(n-1)…wi-1) =
•
•

Good Luck!
• Place a bet remotely on a horse
race within 8 horses by passing
encoded messages.
• Past bet distribution
• horse 1: 1/2
• horse 2: 1/4
• horse 3: 1/8
• horse 4: 1/16
• the rest: 1/64
Foreversoul: http://flickr.com/photos/foreversouls/
CC: BY-NC-ND

3 bits? No, only 2!
0, 10, 110, 1110, 111100, 111101, 111110, 111111

Bi-gram MLE Flow Chart
Permute candidates
right_gram
In LM?
has
left_gram?
bi_gram
In LM?
left_gram
In LM?
temp_score =
LogProb(right_gram)
temp_score =
LogProb(bi_gram)
temp_score =
LogProb(left_gram) +
BackOff(right_gram)
temp_score =
LogProb(Unknown) +
BackOff(right_gram)
temp_score =
LogProb(Unknown)
Update scores
temp_score +=
previous_score
have 2 grams?
Yes
Yes
Yes
Yes Yes
No
No
No
No No

INPUT input_syllables; len = Length(input_syllables); Load(language_model);
scores[len + 1]; tracks[len + 1]; words[len + 1];
FOR i = 0 TO len
scores[i] = 0.0; tracks[i] = -1; words[i] = "";
FOR index = 1 TO len
best_score = 0.0; best_prefix = -1; best_word = "";
FOR prefix = index - 1 TO 0
right_grams[] = Homophones(Substring(input_syllabes, prefix, index - prefix));
FOR EACH right_gram IN right_grams[]
IF right_gram IN language_model
left = tracks[prefix];
IF left >= 0 AND left != prefix
left_grams[] = Homophones(Substring(input_syllables, left, prefix - left));
FOR EACH left_gram IN left_grams[]
temp_score = 0.0;
bigram = left_gram + " " + right_gram;
IF bigram IN language_model
bigram_score = LogProb(bigram);
temp_score += bigram_score;
ELSE IF left_gram IN language_model
bigram_backoff = LogProb(left_gram) + BackOff(right_gram);
temp_score += bigram_backoff;
ELSE
temp_score += LogProb(Unknown) + BackOff(right_gram);
temp_score += scores[prefix];
Scoring
ELSE
temp_score = LogProb(right_gram);
Scoring
ELSE
temp_score = LogProb(Unknown) + scores[prefix];
Scoring
scores[index] = best_score; tracks[index] = best_prefix_index; words[index] = best_prefix;
IF tracks[index] == -1
tracks[index] = index - 1;
boundary = len; output_words = "";
WHILE boundary > 0
output_words = words[boundary] + output_words;
boundary = tracks[boundary];
RETURN output_words;
SUBROUTINE Scoring
IF best_score == 0.0 OR temp_score > best_score
best_score = temp_score;
best_prefix = prefix;
best_word = right_gram;
Bi-gram Syllable-to-Word

And My Suggestions
• Convenient API
• Plain text I/O (in UTF-8)
• More linguistic information
• Algorithm: CRF
• Corpus: we needYOU!
• Flexible to different applications
• Composite, Iterator, and Adapter Patterns
• IDL support
• SWIG
• Open Source
• Open Corpus, too

ELUTE

Recommandé

Recommandé

Contenu connexe

Similaire à ELUTE

Similaire à ELUTE (14)

Plus de Mike Tian-Jian Jiang

Plus de Mike Tian-Jian Jiang (6)

Dernier

Dernier (20)

ELUTE

Notes de l'éditeur