A Vietnamese Language Model Based on Recurrent Neural Network

•

7 j'aime•1,438 vues

Language modeling plays a critical role in many natural language processing (NLP) tasks such as text prediction, machine translation and speech recognition. Traditional statistical language models (e.g. n-gram models) can only offer words that have been seen before and can not capture long word context. Neural language model provides a promising solution to surpass this shortcoming of statistical language model. This paper investigates Recurrent Neural Networks (RNNs) language model for Vietnamese, at character and syllable-levels. Experiments were conducted on a large dataset of 24M syllables, constructed from 1,500 movie subtitles. The experimental results show that our RNN-based language models yield reasonable performance on the movie subtitle dataset. Concretely, our models outperform n-gram language models in term of perplexity score.

Données & analyses

A Vietnamese Language Model
Based on
Recurrent Neural Network
Viet-Trung Tran, Kiem-Hieu Nguyen, Duc-Hanh Bui
Hanoi University of Science and Technology
1Friday, October 7, 16

Outline
Statistical language model
Current state of the art
RNN for Vietnamese language model
Experimental results
Conclusion
2
Friday, October 7, 16

Statistical language
model
A probability distribution of word sequence
E.g. “go to the airport”
? = P(“airport”|“go to the”)
Applications:
Spelling checkers, smart keyboards
Enhance speed recognition/machine translation
LABAN KEY
3
Friday, October 7, 16

Challenges
Meaningful
grammatically correct
understandable
Context-aware
E.g. I am from Vietnam. My mother-tongue is Vietnamese
Out of vocabulary
Slang, abbreviations, etc.
4
Friday, October 7, 16

Common approach
N-gram language model
Katz's back-oﬀ: estimates the conditional
probability of a word given its history in the n-gram
When trigram unavailable -> back-oﬀ to bi-gram
-> uni-gram
SOURCE: HTTPS://EN.WIKIPEDIA.ORG/WIKI/KATZ%27S_BACK-OFF_MODEL
5
Friday, October 7, 16

N-gram language model
Only see a few words back
Only predict words seen in the same context
6
Friday, October 7, 16

Deep learning for NLP
Word embedding
(SOCHER ET AL. (2013A))
MIKOLOV ET AL. (2013B).
7
Friday, October 7, 16

Recurrent neural
network for text
8
INPUT : GO TO THE
OUTPUT : TO THE SCHOOL
PROBABILITY (SCHOOL | GO TO THE)
Friday, October 7, 16

RNN vs. N-gram
Foldable word context vs. ﬁx n-gam context
Personalization through continuous learning
More meaningful text suggestions
Naturally support phrase, terms suggestions
9
Friday, October 7, 16

RNN for Vietnamese
language model
Character level language model
{previous characters} -> next characters
Syllable level language model
{previous syllables} -> next syllables
10
Friday, October 7, 16

LSTM cell
SOURCE: HTTP://COLAH.GITHUB.IO/POSTS/2015-08-
UNDERSTANDING-LSTMS/
11
Friday, October 7, 16

Stacking multiple layers
12
Friday, October 7, 16

Experiments
1,500 MOVIES - 2.056.308 SENTENCES
13
Friday, October 7, 16

Experimental results
14
Friday, October 7, 16

Conclusion
First neural language model for Vietnamese
Largest experimental dataset
Future work
Word embedding
Neural net compression
Conversational neural machine translation
16
Friday, October 7, 16

Thank you for your
attention
17
Friday, October 7, 16

Conversational
Chú hoài linh đẹp trai. Chú hoài linh
Chào buổi sáng
chị hát hay wa!! nghe thick a.
chị khởi my ơi e rất la hâm mộ
chú hoài linh thật đẹp zai và chú Trấn thành đẹp
qá
18
Friday, October 7, 16

lịch sử ghi nhớ năm 1979
tại hội nghị, đồng chí Phạm Ngọc Thủy Võ Văn
Kiệt
tại hội nghị, đồng chí Hồ Chí Minh nói
tại hội nghị, đồng chí Võ Nguyên Giáp và đồng chí
Hồ Chí Minh đã ngồi ở
tại đại hội Đảng lần thứ nhất vào năm 1945,
Ngay từ những ngày đầu, Đúng như nhận xét của
Giáo sư Nguyễn Văn Linh
19
Friday, October 7, 16

Recommandé

Ngôn Ngữ “Thời @” Trên Mạng Và Trên Điện Thoại Di Động Của Học Sinh, Sinh Viênluanvantrust

luan van thac si tim hieu ngon ngu lap trinh python du bao gia chung khoanDịch vụ viết thuê Luận Văn - ZALO 0932091562

Quan hệ ngang bằng trong tài chính quốc tếpikachukt04

Difficulties In Translating Financial News Into Vietnamese.docsividocz

Topic 1.1 Perspectives Part 1.pdfNigel Gardner

Sentiment Analysis Using Hybrid Structure of Machine Learning AlgorithmsSangeeth Nagarajan

Bài Giảng Môn học: OTOMAT VÀ NGÔN NGỮ HÌNH THỨCtruongvanquan

Sử dụng mô hình ARCH và GARCH để phân tích và dự báo giá cổ phiếu trên thị t...BeriDang

Recommandé

Ngôn Ngữ “Thời @” Trên Mạng Và Trên Điện Thoại Di Động Của Học Sinh, Sinh Viênluanvantrust

luan van thac si tim hieu ngon ngu lap trinh python du bao gia chung khoanDịch vụ viết thuê Luận Văn - ZALO 0932091562

Quan hệ ngang bằng trong tài chính quốc tếpikachukt04

Difficulties In Translating Financial News Into Vietnamese.docsividocz

Topic 1.1 Perspectives Part 1.pdfNigel Gardner

Sentiment Analysis Using Hybrid Structure of Machine Learning AlgorithmsSangeeth Nagarajan

Bài Giảng Môn học: OTOMAT VÀ NGÔN NGỮ HÌNH THỨCtruongvanquan

Sử dụng mô hình ARCH và GARCH để phân tích và dự báo giá cổ phiếu trên thị t...BeriDang

Connectionist Temporal ClassificationJulius Hietala

Phát triển kỹ năng viết của sinh viên năm thứ nhất ĐH Thái NguyênDịch vụ viết bài trọn gói ZALO: 0909232620

Phân loại ngôn ngữCiel Bleu Translation

Natural language processingAbash shah

Chuong 5 loi nguyen tai nguyennhóc Ngố

Bài tập và bài giải Nghiệp vụ ngân hàng trung ương Sử dụng kèm theo giáo trì...Man_Ebook

Tỷ giá hối đoái thùy linh thanh trúcHothuylinh17

Đề tài: Phân tích hoạt động cho vay bất động sản cho đối tượng khách hàng cá ...Viết thuê trọn gói ZALO 0934573149

Chuong6 deadlock-091006115413-phpapp01Hai Nguyen

Ml pptAlpna Patel

Tiểu luận môn dẫn luận ngôn ngữ âm tiết và âm tố trong tiếng việthttps://www.facebook.com/garmentspace

Phát hiện và khắc phục phương sai thay đổi (heteroskedasticity) trên Eview, S...vietlod.com

A tutorial on Machine TranslationJaganadh Gopinadhan

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Viet-Trung TRAN

Dynamo: Amazon’s Highly Available Key-value StoreViet-Trung TRAN

Pregel: Hệ thống xử lý đồ thị lớnViet-Trung TRAN

Mapreduce simplified-data-processingViet-Trung TRAN

Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookViet-Trung TRAN

giasan.vn real-estate analytics: a Vietnam case studyViet-Trung TRAN

Giasan.vn @rstarsViet-Trung TRAN

A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN

Large-Scale Geographically Weighted Regression on SparkViet-Trung TRAN

Contenu connexe

Tendances

Connectionist Temporal ClassificationJulius Hietala

Phát triển kỹ năng viết của sinh viên năm thứ nhất ĐH Thái NguyênDịch vụ viết bài trọn gói ZALO: 0909232620

Phân loại ngôn ngữCiel Bleu Translation

Natural language processingAbash shah

Chuong 5 loi nguyen tai nguyennhóc Ngố

Bài tập và bài giải Nghiệp vụ ngân hàng trung ương Sử dụng kèm theo giáo trì...Man_Ebook

Tỷ giá hối đoái thùy linh thanh trúcHothuylinh17

Đề tài: Phân tích hoạt động cho vay bất động sản cho đối tượng khách hàng cá ...Viết thuê trọn gói ZALO 0934573149

Chuong6 deadlock-091006115413-phpapp01Hai Nguyen

Ml pptAlpna Patel

Tiểu luận môn dẫn luận ngôn ngữ âm tiết và âm tố trong tiếng việthttps://www.facebook.com/garmentspace

Phát hiện và khắc phục phương sai thay đổi (heteroskedasticity) trên Eview, S...vietlod.com

A tutorial on Machine TranslationJaganadh Gopinadhan

Tendances (13)

Connectionist Temporal Classification

Phát triển kỹ năng viết của sinh viên năm thứ nhất ĐH Thái Nguyên

Phân loại ngôn ngữ

Natural language processing

Chuong 5 loi nguyen tai nguyen

Bài tập và bài giải Nghiệp vụ ngân hàng trung ương Sử dụng kèm theo giáo trì...

Tỷ giá hối đoái thùy linh thanh trúc

Đề tài: Phân tích hoạt động cho vay bất động sản cho đối tượng khách hàng cá ...

Chuong6 deadlock-091006115413-phpapp01

Ml ppt

Tiểu luận môn dẫn luận ngôn ngữ âm tiết và âm tố trong tiếng việt

Phát hiện và khắc phục phương sai thay đổi (heteroskedasticity) trên Eview, S...

A tutorial on Machine Translation

Plus de Viet-Trung TRAN

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Viet-Trung TRAN

Dynamo: Amazon’s Highly Available Key-value StoreViet-Trung TRAN

Pregel: Hệ thống xử lý đồ thị lớnViet-Trung TRAN

Mapreduce simplified-data-processingViet-Trung TRAN

Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookViet-Trung TRAN

giasan.vn real-estate analytics: a Vietnam case studyViet-Trung TRAN

Giasan.vn @rstarsViet-Trung TRAN

A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN

Large-Scale Geographically Weighted Regression on SparkViet-Trung TRAN

Recent progress on distributing deep learningViet-Trung TRAN

success factors for project proposalsViet-Trung TRAN

GPSinsights posterViet-Trung TRAN

OCR processing with deep learning: Apply to Vietnamese documents Viet-Trung TRAN

Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Viet-Trung TRAN

Deep learning for nlpViet-Trung TRAN

Introduction to BigData @TCTK2015Viet-Trung TRAN

From neural networks to deep learningViet-Trung TRAN

From decision trees to random forestsViet-Trung TRAN

Recommender systems: Content-based and collaborative filteringViet-Trung TRAN

3 - Finding similar itemsViet-Trung TRAN

Plus de Viet-Trung TRAN (20)

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017

Dynamo: Amazon’s Highly Available Key-value Store

Pregel: Hệ thống xử lý đồ thị lớn

Mapreduce simplified-data-processing

Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook

giasan.vn real-estate analytics: a Vietnam case study

Giasan.vn @rstars

A Vietnamese Language Model Based on Recurrent Neural Network

Large-Scale Geographically Weighted Regression on Spark

Recent progress on distributing deep learning

success factors for project proposals

GPSinsights poster

OCR processing with deep learning: Apply to Vietnamese documents

Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...

Deep learning for nlp

Introduction to BigData @TCTK2015

From neural networks to deep learning

From decision trees to random forests

Recommender systems: Content-based and collaborative filtering

3 - Finding similar items

Dernier

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

Machine learning classification ppt.pptamreenkhanum0307

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Semantic Shed - Squashing and Squeezing.pptxMike Bennett

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07

Multiple time frame trading analysis -brianshannon.pdfchwongval

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

ASML's Taxonomy Adventure by Daniel Cantervoginip

How we prevented account sharing with MFAAndrei Kaleshka

Dernier (20)

Data Factory in Microsoft Fabric (MsBIP #82)

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

Machine learning classification ppt.ppt

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

Advanced Machine Learning for Business Professionals

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

Semantic Shed - Squashing and Squeezing.pptx

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING

Multiple time frame trading analysis -brianshannon.pdf

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

GA4 Without Cookies [Measure Camp AMS]

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一

Identifying Appropriate Test Statistics Involving Population Mean

RABBIT: A CLI tool for identifying bots based on their GitHub events.

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

Call Girls in Saket 99530🔝 56974 Escort Service

ASML's Taxonomy Adventure by Daniel Canter

How we prevented account sharing with MFA

A Vietnamese Language Model Based on Recurrent Neural Network

1. A Vietnamese Language Model Based on Recurrent Neural Network Viet-Trung Tran, Kiem-Hieu Nguyen, Duc-Hanh Bui Hanoi University of Science and Technology 1Friday, October 7, 16

2. Outline Statistical language model Current state of the art RNN for Vietnamese language model Experimental results Conclusion 2 Friday, October 7, 16

3. Statistical language model A probability distribution of word sequence E.g. “go to the airport” ? = P(“airport”|“go to the”) Applications: Spelling checkers, smart keyboards Enhance speed recognition/machine translation LABAN KEY 3 Friday, October 7, 16

4. Challenges Meaningful grammatically correct understandable Context-aware E.g. I am from Vietnam. My mother-tongue is Vietnamese Out of vocabulary Slang, abbreviations, etc. 4 Friday, October 7, 16

5. Common approach N-gram language model Katz's back-oﬀ: estimates the conditional probability of a word given its history in the n-gram When trigram unavailable -> back-oﬀ to bi-gram -> uni-gram SOURCE: HTTPS://EN.WIKIPEDIA.ORG/WIKI/KATZ%27S_BACK-OFF_MODEL 5 Friday, October 7, 16

6. N-gram language model Only see a few words back Only predict words seen in the same context 6 Friday, October 7, 16

7. Deep learning for NLP Word embedding (SOCHER ET AL. (2013A)) MIKOLOV ET AL. (2013B). 7 Friday, October 7, 16

8. Recurrent neural network for text 8 INPUT : GO TO THE OUTPUT : TO THE SCHOOL PROBABILITY (SCHOOL | GO TO THE) Friday, October 7, 16

9. RNN vs. N-gram Foldable word context vs. ﬁx n-gam context Personalization through continuous learning More meaningful text suggestions Naturally support phrase, terms suggestions 9 Friday, October 7, 16

10. RNN for Vietnamese language model Character level language model {previous characters} -> next characters Syllable level language model {previous syllables} -> next syllables 10 Friday, October 7, 16

11. LSTM cell SOURCE: HTTP://COLAH.GITHUB.IO/POSTS/2015-08- UNDERSTANDING-LSTMS/ 11 Friday, October 7, 16

12. Stacking multiple layers 12 Friday, October 7, 16

13. Experiments 1,500 MOVIES - 2.056.308 SENTENCES 13 Friday, October 7, 16

14. Experimental results 14 Friday, October 7, 16

15. 15 Friday, October 7, 16

16. Conclusion First neural language model for Vietnamese Largest experimental dataset Future work Word embedding Neural net compression Conversational neural machine translation 16 Friday, October 7, 16

17. Thank you for your attention 17 Friday, October 7, 16

18. Conversational Chú hoài linh đẹp trai. Chú hoài linh Chào buổi sáng chị hát hay wa!! nghe thick a. chị khởi my ơi e rất la hâm mộ chú hoài linh thật đẹp zai và chú Trấn thành đẹp qá 18 Friday, October 7, 16

19. lịch sử ghi nhớ năm 1979 tại hội nghị, đồng chí Phạm Ngọc Thủy Võ Văn Kiệt tại hội nghị, đồng chí Hồ Chí Minh nói tại hội nghị, đồng chí Võ Nguyên Giáp và đồng chí Hồ Chí Minh đã ngồi ở tại đại hội Đảng lần thứ nhất vào năm 1945, Ngay từ những ngày đầu, Đúng như nhận xét của Giáo sư Nguyễn Văn Linh 19 Friday, October 7, 16