Language modeling plays a critical role in many
natural language processing (NLP) tasks such as text prediction,
machine translation and speech recognition. Traditional
statistical language models (e.g. n-gram models) can only offer
words that have been seen before and can not capture long word
context. Neural language model provides a promising solution to
surpass this shortcoming of statistical language model. This paper
investigates Recurrent Neural Networks (RNNs) language model
for Vietnamese, at character and syllable-levels. Experiments
were conducted on a large dataset of 24M syllables, constructed
from 1,500 movie subtitles. The experimental results show that
our RNN-based language models yield reasonable performance
on the movie subtitle dataset. Concretely, our models outperform
n-gram language models in term of perplexity score.
A Vietnamese Language Model Based on Recurrent Neural Network
1. A Vietnamese Language Model
Based on
Recurrent Neural Network
Viet-Trung Tran, Kiem-Hieu Nguyen, Duc-Hanh Bui
Hanoi University of Science and Technology
1Friday, October 7, 16
3. Statistical language
model
A probability distribution of word sequence
E.g. “go to the airport”
? = P(“airport”|“go to the”)
Applications:
Spelling checkers, smart keyboards
Enhance speed recognition/machine translation
LABAN KEY
3
Friday, October 7, 16
5. Common approach
N-gram language model
Katz's back-off: estimates the conditional
probability of a word given its history in the n-gram
When trigram unavailable -> back-off to bi-gram
-> uni-gram
SOURCE: HTTPS://EN.WIKIPEDIA.ORG/WIKI/KATZ%27S_BACK-OFF_MODEL
5
Friday, October 7, 16
6. N-gram language model
Only see a few words back
Only predict words seen in the same context
6
Friday, October 7, 16
7. Deep learning for NLP
Word embedding
(SOCHER ET AL. (2013A))
MIKOLOV ET AL. (2013B).
7
Friday, October 7, 16
8. Recurrent neural
network for text
8
INPUT : GO TO THE
OUTPUT : TO THE SCHOOL
PROBABILITY (SCHOOL | GO TO THE)
Friday, October 7, 16
9. RNN vs. N-gram
Foldable word context vs. fix n-gam context
Personalization through continuous learning
More meaningful text suggestions
Naturally support phrase, terms suggestions
9
Friday, October 7, 16
10. RNN for Vietnamese
language model
Character level language model
{previous characters} -> next characters
Syllable level language model
{previous syllables} -> next syllables
10
Friday, October 7, 16
16. Conclusion
First neural language model for Vietnamese
Largest experimental dataset
Future work
Word embedding
Neural net compression
Conversational neural machine translation
16
Friday, October 7, 16
17. Thank you for your
attention
17
Friday, October 7, 16
18. Conversational
Chú hoài linh đẹp trai. Chú hoài linh
Chào buổi sáng
chị hát hay wa!! nghe thick a.
chị khởi my ơi e rất la hâm mộ
chú hoài linh thật đẹp zai và chú Trấn thành đẹp
qá
18
Friday, October 7, 16
19. lịch sử ghi nhớ năm 1979
tại hội nghị, đồng chí Phạm Ngọc Thủy Võ Văn
Kiệt
tại hội nghị, đồng chí Hồ Chí Minh nói
tại hội nghị, đồng chí Võ Nguyên Giáp và đồng chí
Hồ Chí Minh đã ngồi ở
tại đại hội Đảng lần thứ nhất vào năm 1945,
Ngay từ những ngày đầu, Đúng như nhận xét của
Giáo sư Nguyễn Văn Linh
19
Friday, October 7, 16