This document discusses natural language processing and text segmentation. It introduces ELUTE (Essential Libraries and Utilities of Text Engineering) and some of its Chinese language processing tools. It then discusses word segmentation algorithms like maximum matching, hidden Markov models, and conditional random fields. Finally, it talks about building language models and the importance of having a large corpus to train models on.
3. (lib)TaBE
• Traditional Chinese Word Segmentation
• with Big5 encoding
• Traditional Chinese Syllable-to-Word Conversion
• with Big5 encoding
• for bo-po-mo-fo transcription system
9. Heuristic Rules*
• Maximum matching -- Simple vs. Complex: 下雨天真正討厭
• 下雨 天真 正 討厭 vs. 下雨天 真正 討厭
• Maximum average word length
• 國際化
• Minimum variance of word lengths
• 研究 生命 起源
• Maximum degree of morphemic freedom of single-character word
• 主要 是 因為
* Refer to MMSEG by C. H. Tsai: http://technology.chtsai.org/mmseg/
10. Graphical Models
• Markov chain family
• Statistical Language Model (SLM)
• Hidden Markov Model (HMM)
• Exponential models
• Maximum Entropy (ME)
• Conditional Random Fields (CRF)
• Applications
• Probabilistic Context-Free Grammar (PCFG) Parser
• Head-driven Phrase Structure Grammar (HPSG) Parser
• Link Grammar Parser
13. The Italian Who Went to Malta
•One day ima gonna Malta to bigga hotel.
•Ina morning I go down to eat breakfast.
•I tella waitress I wanna two pissis toasts.
•She brings me only one piss.
•I tella her I want two piss. She say go to the toilet.
•I say, you no understand, I wanna piss onna my plate.
•She say you better no piss onna plate, you sonna ma bitch.
•I don’t even know the lady and she call me sonna ma bitch!
14. P(“I want to piss”) > P(“I want two pieces”)
For that Malta waitress,
15. Do the Math
• Conditional probability:
•
• Bayes’ theorem:
•
• Information theory:
• Noisy channel model
•
• Language model: P(i)
Noisy channel
p(o|i)
Decoder
I O Î
16. Shannon’s Game
• Predict next word by history
•
• Maximum Likelihood Estimation
•
• C(w1…wn) : Frequency of n-gram w1…wn
17. Once in a Blue Moon
• A cat has seen...
• 10 sparrows
• 4 barn swallows
• 1 Chinese Bulbul
• 1 Pacific Swallow
• How likely is it that next
bird is unseen?
19. But I’ve seen a moon
and I’m blue
• Simple linear interpolation
• PLi(wn|wn-2 , wn-1) = λ1P1(wn) + λ2P2(wn|wn-1) + λ3P2(wn|wn-1 , wn-2)
• 0 ≤λi ≤ 1, Σiλi = 1
• Katz’s backing-off
• Back-off through progressively shorter histories.
• Pbo(wi|wi-(n-1)…wi-1) =
•
•
20. Good Luck!
• Place a bet remotely on a horse
race within 8 horses by passing
encoded messages.
• Past bet distribution
• horse 1: 1/2
• horse 2: 1/4
• horse 3: 1/8
• horse 4: 1/16
• the rest: 1/64
Foreversoul: http://flickr.com/photos/foreversouls/
CC: BY-NC-ND
21. 3 bits? No, only 2!
0, 10, 110, 1110, 111100, 111101, 111110, 111111
27. And My Suggestions
• Convenient API
• Plain text I/O (in UTF-8)
• More linguistic information
• Algorithm: CRF
• Corpus: we needYOU!
• Flexible to different applications
• Composite, Iterator, and Adapter Patterns
• IDL support
• SWIG
• Open Source
• Open Corpus, too
Maximum matching can also be “backward.”
Consider that if we try to diff and merge forward/backward maximum matching results...
Since we are not native speakers of English, it’s also a problem to us.
Oh, we got a problem, again!
Shannon’s noisy channel was modeled for a real world problem in Bell Lab.
It cares about not only error rates of “decoding” but also efficiencies of “encoding.”
This matches Zipf’s Law naturally.
Zipf’s Law, however, is EXPERIMENTAL, not theoretical.
In average, cross-entropy represents bit rates of encoding for noisy channel, and perplexity means branch (candidate) numbers.
8 horses are equally likely: 000, 001, 010, 011, 100, 101, 110, 111
8 horses are biased: 0, 10, 110, 1110, 111100, 111101, 111110, 111111.