SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Khmer ASR system
Sethserey SAM
sam.sethserey@itc.edu.kh
Part I: ASR in general
o    Definition
o    Type of ASR
o    ASR flow chart
o    Data requirement
o    Performance of ASR systems
o    Fundamental methods to create ASR system

                                            2
What is ASR system?
o  ASR: Automatic speech recognition
   system
o  ASR: A system or tool that can
   convert audio flow contained speech
   to text.
                               Seven
                               Seven days
                ASR System     Zaven
                                :
                               :

                              Text output

                                            3
ASR: what for?
o  ASR systems improve your life (works ,
   business, communication ,etc.)
Typology of ASR systems
o  Speaker-dependent vs. -independent

o  Language constraints:                   + Vocabulary:
  n    isolated word recognition
  n    connected word                        small (100),
  n    keyword spotting                      medium (5 000),
                                              large (50 000)
  n    continuous speech recognition


o  Robustness constraints
  n    laboratory (office) conditions: imposed
  n    microphone, channel noise …

                                                                5
Levels of complexity




                       6
ASR flow chart



                             s
                             e                        Seven
                             v                        Seven days
                                                      Zaven
                             e
                                                       :
                             n
                                                      :


     Signal processing           Decoding/Searching
       (digitalizing &
     feature extraction)
                           ASR system

                                                                   7
ASR data requirement
o  To train AM and ML models, huge amount of
   data (text & audio) are needed.

                         Pronunciation
         Audio +           dictionary
                                         Text data
    transcription data




                                                     8
ASR Performance
o    English ASR system Evaluations at National Institute of
     Standards and Technology (NIST)




                                                               9
Causes of ASR’s error rate
                         “seven”




o  The current ASR for continuous speech
   can not reach 0% of WER, why ?
  n  Acoustic model is affected by human character and
      environment: gender, age, emotion, pitch, accent,
      physical state, channel noise, etc.
  n  Lexical model is affected by incorrect word
      pronunciation.
  n  Language model : incorrect usage of words,
      grammar mistakes.
                                                      10
Three fundamental methods for
creating a new ASR system

o  Enough training data è bootstraping
o  Small amount of data è adaptation
o  No data è cross-language transfer




                                          11
Part II:
Khmer language & its processing
o  Khmer language
o  Why research on Khmer ASR?




                                12
Khmer Language
o    Official	
  language	
  of	
  Cambodia	
  
o    Spoken	
  by	
  more	
  than	
  15	
  M	
  people	
  
o    An	
  atonal	
  language	
  
o    Wri>ng	
  system	
  
     n  33	
  Consonants,	
  23	
  dependent	
  vowels	
  
     n  14	
  independent	
  vowels,	
  13	
  diacri>cs	
  and	
  various	
  signs	
  	
  	
  
     n  No	
  explicit	
  word	
  boundary	
  	
  
     	
  


                                                                                              13
Why research on Khmer ASR?
o  An	
  under-­‐resourced	
  language	
  	
  
    n  Lack	
  of	
  text	
  and	
  speech	
  data	
  in	
  digital	
  form	
  
    n  Lack	
  of	
  linguis>c	
  documents	
  (both	
  soK	
  and	
  hard	
  
        copies)	
  
o  Lacking	
  explicit	
  Word	
  Segmenta>on	
  	
  
    n  Automa>c	
  Word	
  Segmenta>on	
  is	
  needed	
  
    n  State-­‐of-­‐the-­‐art	
  method	
  of	
  	
  segmenta>on	
  uses	
  	
  
        –  hand-­‐craKed	
  lexicons,	
  word	
  frequencies,	
  	
  
        –  op>miza>on	
  criteria	
  …	
  
o  Others	
  under-­‐resourced,	
  unsegmented	
  
   languages	
  in	
  the	
  region	
  :	
  Burmese,	
  Laos,	
  Thai	
  
   Vietnamese	
  	
  	
  	
  
                                                                                    14
Part III:
    Khmer ASR at the glance
o  Corpus
  o  Speech corpus setup
  o  Text corpus setup
  o  General overview
o  Current ASR system
o  Future work


                              15
Corpus: Speeh corpus setup
o  Two types of corpus:
  n  small transcribed corpus (2007-2008)
     o  Transcribed manually by Engineering students at ITC
     o  only 6 hours of transcribed signal
     o  Nature: radio signal (poor quality) downloaded from
        radio australie, radio free asia and voice of america

  n  Large transcribed corpus (2011)
     o    Already have text and speech corresponding
     o    Students help verifying the transcription
     o    21 hours of transcribed signal
     o    Nature: reading speech from newspaper


                                                                16
Corpus: Text corpus setup
o  Retrieving	
  text	
  from	
  the	
  Web	
  is	
  becoming	
  a	
  common	
  approach	
  
o  Well	
  selected	
  rich-­‐content	
  websites	
  Vs	
  crawling	
  the	
  Web	
  
o  Adap>ng	
  ClipsTextTk,	
  an	
  open	
  source	
  tool	
  for	
  corpus	
  crea>on	
  for	
  
   Khmer	
  language	
  
      n    Conversion	
  from	
  legacy	
  character	
  encoding	
  to	
  Unicode	
  
      n    Automa>c	
  Segmenta>on	
  	
  
      n    Conversion	
  of	
  special	
  sign	
  and	
  number	
  to	
  text	
  
      n    Normaliza>on	
  of	
  word	
  spelling	
  
o  Text	
  Corpus	
  obtained	
  from	
  5	
  sites	
  :	
  
      n    2,5000	
  html	
  pages	
  retrieved	
  	
  
      n    AKer	
  processing	
  :	
  0.5	
  M	
  sentences,	
  15	
  M	
  words	
  
      n    Dura>on	
  :	
  November	
  2007	
  –	
  January	
  2008	
  	
  	
  

                                                                                              17
Corpus-Oveview
o  Description of Khmer ASR corpus
 Type               Small corpus         Large corpus
 Signal             ~6h of transcribed   ~20h of
 (acoustic model)   signal (radio)       transcribed
                                         signal (reading
                                         speech)
 Text                0,5 millions of     to be improved
 (language model)   phrase
                    ~ 15,5 millions of
                    words
 Pronunciation      ~ 20 000 words       To be improved
 Dictionary
 (lexical model)
                                                           18
Current ASR system
Continue ASR       Training &          Word Error Rate (%)
  System         tasting corpus
                                     Context       Context
                                    Dependent     Dependent
                                     (8gau)        (16gau)
Khmer ASR v1   - LM: 15.5M words      42.5           40.3
               - Training AM: 5h
               - Testing: 172p
Khmer ASR v2   - LM: 15M words        36.4            35
               - Training AM: 20h
               - Testing: 290 p




                                                             19
Future Work
o  Collect more text data for language
   model
o  Next challenge: How to improve
   Khmer ASR for independent speakers
   and in different environments?




                                     20
THANK YOU!!




              21

Contenu connexe

Tendances

Tafsir Ahsan-ul-Bayan┇Para 5┇والمحصنت
Tafsir Ahsan-ul-Bayan┇Para 5┇والمحصنتTafsir Ahsan-ul-Bayan┇Para 5┇والمحصنت
Tafsir Ahsan-ul-Bayan┇Para 5┇والمحصنت
Quran Juz (Para)
 
Tafsir Ahsan-ul-Bayan┇Para 4┇لن تنالوا البر
Tafsir Ahsan-ul-Bayan┇Para 4┇لن تنالوا البرTafsir Ahsan-ul-Bayan┇Para 4┇لن تنالوا البر
Tafsir Ahsan-ul-Bayan┇Para 4┇لن تنالوا البر
Quran Juz (Para)
 
Tafsir Ahsan-ul-Bayan┇Para 3┇تلک الرسل
Tafsir Ahsan-ul-Bayan┇Para 3┇تلک الرسلTafsir Ahsan-ul-Bayan┇Para 3┇تلک الرسل
Tafsir Ahsan-ul-Bayan┇Para 3┇تلک الرسل
Quran Juz (Para)
 
Tafsir Ahsan-ul-Bayan┇Para 17┇اقترب للناس
Tafsir Ahsan-ul-Bayan┇Para 17┇اقترب للناسTafsir Ahsan-ul-Bayan┇Para 17┇اقترب للناس
Tafsir Ahsan-ul-Bayan┇Para 17┇اقترب للناس
Quran Juz (Para)
 
حج اور عمرہ کیسے کیے جائیں اور قرآن کی روشنی میں حج اور عمرہ کے مقاصد کیا ہیں ؟
حج اور عمرہ کیسے کیے جائیں اور قرآن کی روشنی میں حج اور عمرہ کے مقاصد کیا ہیں ؟حج اور عمرہ کیسے کیے جائیں اور قرآن کی روشنی میں حج اور عمرہ کے مقاصد کیا ہیں ؟
حج اور عمرہ کیسے کیے جائیں اور قرآن کی روشنی میں حج اور عمرہ کے مقاصد کیا ہیں ؟
Dr Kashif Khan
 
ច្បាប់ស្តីពីការបង្រ្កាបអំពើជួញដូរមនុស្សនិង អំពើធ្វើអាជីវកម្មផ្លូវភេទ
ច្បាប់ស្តីពីការបង្រ្កាបអំពើជួញដូរមនុស្សនិង អំពើធ្វើអាជីវកម្មផ្លូវភេទច្បាប់ស្តីពីការបង្រ្កាបអំពើជួញដូរមនុស្សនិង អំពើធ្វើអាជីវកម្មផ្លូវភេទ
ច្បាប់ស្តីពីការបង្រ្កាបអំពើជួញដូរមនុស្សនិង អំពើធ្វើអាជីវកម្មផ្លូវភេទ
ខ្មែរមហានគរ
 
မက္မေျပ
မက္မေျပမက္မေျပ
မက္မေျပ
koluzoe
 
မိန္းမလွကြ်န္း
မိန္းမလွကြ်န္းမိန္းမလွကြ်န္း
မိန္းမလွကြ်န္း
koluzoe
 
အလွမျပည့္ခင္ေၾကြတဲ့ပန္း
အလွမျပည့္ခင္ေၾကြတဲ့ပန္းအလွမျပည့္ခင္ေၾကြတဲ့ပန္း
အလွမျပည့္ခင္ေၾကြတဲ့ပန္း
koluzoe
 

Tendances (20)

Tafsir Ahsan-ul-Bayan┇Para 5┇والمحصنت
Tafsir Ahsan-ul-Bayan┇Para 5┇والمحصنتTafsir Ahsan-ul-Bayan┇Para 5┇والمحصنت
Tafsir Ahsan-ul-Bayan┇Para 5┇والمحصنت
 
Cuando te beso arr césar a carrillo - versión definitiva
Cuando te beso   arr césar a carrillo - versión definitivaCuando te beso   arr césar a carrillo - versión definitiva
Cuando te beso arr césar a carrillo - versión definitiva
 
Tafsir Ahsan-ul-Bayan┇Para 4┇لن تنالوا البر
Tafsir Ahsan-ul-Bayan┇Para 4┇لن تنالوا البرTafsir Ahsan-ul-Bayan┇Para 4┇لن تنالوا البر
Tafsir Ahsan-ul-Bayan┇Para 4┇لن تنالوا البر
 
Tafsir Ahsan-ul-Bayan┇Para 3┇تلک الرسل
Tafsir Ahsan-ul-Bayan┇Para 3┇تلک الرسلTafsir Ahsan-ul-Bayan┇Para 3┇تلک الرسل
Tafsir Ahsan-ul-Bayan┇Para 3┇تلک الرسل
 
Tafsir Ahsan-ul-Bayan┇Para 17┇اقترب للناس
Tafsir Ahsan-ul-Bayan┇Para 17┇اقترب للناسTafsir Ahsan-ul-Bayan┇Para 17┇اقترب للناس
Tafsir Ahsan-ul-Bayan┇Para 17┇اقترب للناس
 
MVC + ORM (with project implementation)
MVC + ORM (with project implementation)MVC + ORM (with project implementation)
MVC + ORM (with project implementation)
 
حج اور عمرہ کیسے کیے جائیں اور قرآن کی روشنی میں حج اور عمرہ کے مقاصد کیا ہیں ؟
حج اور عمرہ کیسے کیے جائیں اور قرآن کی روشنی میں حج اور عمرہ کے مقاصد کیا ہیں ؟حج اور عمرہ کیسے کیے جائیں اور قرآن کی روشنی میں حج اور عمرہ کے مقاصد کیا ہیں ؟
حج اور عمرہ کیسے کیے جائیں اور قرآن کی روشنی میں حج اور عمرہ کے مقاصد کیا ہیں ؟
 
ច្បាប់ស្តីពីការបង្រ្កាបអំពើជួញដូរមនុស្សនិង អំពើធ្វើអាជីវកម្មផ្លូវភេទ
ច្បាប់ស្តីពីការបង្រ្កាបអំពើជួញដូរមនុស្សនិង អំពើធ្វើអាជីវកម្មផ្លូវភេទច្បាប់ស្តីពីការបង្រ្កាបអំពើជួញដូរមនុស្សនិង អំពើធ្វើអាជីវកម្មផ្លូវភេទ
ច្បាប់ស្តីពីការបង្រ្កាបអំពើជួញដូរមនុស្សនិង អំពើធ្វើអាជីវកម្មផ្លូវភេទ
 
မက္မေျပ
မက္မေျပမက္မေျပ
မက္မေျပ
 
2 سەرەتاکانی دیزاین
2 سەرەتاکانی دیزاین2 سەرەتاکانی دیزاین
2 سەرەتاکانی دیزاین
 
Health and Safety in the Workplace
Health and Safety in the WorkplaceHealth and Safety in the Workplace
Health and Safety in the Workplace
 
PRAXIS Workshop
PRAXIS WorkshopPRAXIS Workshop
PRAXIS Workshop
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
မိန္းမလွကြ်န္း
မိန္းမလွကြ်န္းမိန္းမလွကြ်န္း
မိန္းမလွကြ်န္း
 
Microsoft excel xclusive by tanbircox
Microsoft excel xclusive by tanbircoxMicrosoft excel xclusive by tanbircox
Microsoft excel xclusive by tanbircox
 
دعاء ألم نشرح لك صدرك
دعاء ألم نشرح لك صدركدعاء ألم نشرح لك صدرك
دعاء ألم نشرح لك صدرك
 
ESAPI
ESAPIESAPI
ESAPI
 
Machine verification and identification of telugu metrical poetry 1.1
Machine verification and identification of telugu metrical poetry 1.1Machine verification and identification of telugu metrical poetry 1.1
Machine verification and identification of telugu metrical poetry 1.1
 
အလွမျပည့္ခင္ေၾကြတဲ့ပန္း
အလွမျပည့္ခင္ေၾကြတဲ့ပန္းအလွမျပည့္ခင္ေၾကြတဲ့ပန္း
အလွမျပည့္ခင္ေၾကြတဲ့ပန္း
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 

Similaire à Khmer ASR

Sltu12
Sltu12Sltu12
Sltu12
tihtow
 
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Lviv Startup Club
 

Similaire à Khmer ASR (20)

"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi..."Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
 
Speech totext
Speech totextSpeech totext
Speech totext
 
Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020Understanding Names with Neural Networks - May 2020
Understanding Names with Neural Networks - May 2020
 
Sltu12
Sltu12Sltu12
Sltu12
 
Asr
AsrAsr
Asr
 
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
 
Build your own ASR engine
Build your own ASR engineBuild your own ASR engine
Build your own ASR engine
 
Speech recognition techniques
Speech recognition techniquesSpeech recognition techniques
Speech recognition techniques
 
Asr
AsrAsr
Asr
 
Speech Technology Overview
Speech Technology OverviewSpeech Technology Overview
Speech Technology Overview
 
Sslis
SslisSslis
Sslis
 
Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana Artificial Intelligence Meetup 11/30/16Intel Nervana Artificial Intelligence Meetup 11/30/16
Intel Nervana Artificial Intelligence Meetup 11/30/16
 
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
 
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
 
Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu ...
Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu ...Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu ...
Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu ...
 
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKSTUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
 
Tuning Dari Speech Classification Employing Deep Neural Networks
Tuning Dari Speech Classification Employing Deep Neural NetworksTuning Dari Speech Classification Employing Deep Neural Networks
Tuning Dari Speech Classification Employing Deep Neural Networks
 
Speech-Recognition.pptx
Speech-Recognition.pptxSpeech-Recognition.pptx
Speech-Recognition.pptx
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognition
 
Speech processing
Speech processingSpeech processing
Speech processing
 

Plus de Bill Chea

Xen cloud platform
Xen cloud platformXen cloud platform
Xen cloud platform
Bill Chea
 
Save time by using sass to develop css
Save time by using sass to develop cssSave time by using sass to develop css
Save time by using sass to develop css
Bill Chea
 
Safety social media for positive social change
Safety social media for positive social changeSafety social media for positive social change
Safety social media for positive social change
Bill Chea
 
Open street map
Open street mapOpen street map
Open street map
Bill Chea
 
Open development cambodia
Open development cambodiaOpen development cambodia
Open development cambodia
Bill Chea
 
Job hunting & career development
Job hunting & career developmentJob hunting & career development
Job hunting & career development
Bill Chea
 
Internet security
Internet securityInternet security
Internet security
Bill Chea
 
How to build up communication skill
How to build up communication skillHow to build up communication skill
How to build up communication skill
Bill Chea
 
Google mapmaker
Google mapmakerGoogle mapmaker
Google mapmaker
Bill Chea
 
Financial job study travel planning
Financial job study travel planningFinancial job study travel planning
Financial job study travel planning
Bill Chea
 
ERP web based system
ERP web based systemERP web based system
ERP web based system
Bill Chea
 
10 golden features of business website
10 golden features of business website10 golden features of business website
10 golden features of business website
Bill Chea
 
UrbanVoicePDF
UrbanVoicePDFUrbanVoicePDF
UrbanVoicePDF
Bill Chea
 
4 hour-workweek-blogger
4 hour-workweek-blogger4 hour-workweek-blogger
4 hour-workweek-blogger
Bill Chea
 

Plus de Bill Chea (20)

Xen cloud platform
Xen cloud platformXen cloud platform
Xen cloud platform
 
Why ruby
Why rubyWhy ruby
Why ruby
 
Unix tc
Unix tcUnix tc
Unix tc
 
Sithi hub
Sithi hubSithi hub
Sithi hub
 
Save time by using sass to develop css
Save time by using sass to develop cssSave time by using sass to develop css
Save time by using sass to develop css
 
Safety social media for positive social change
Safety social media for positive social changeSafety social media for positive social change
Safety social media for positive social change
 
Open street map
Open street mapOpen street map
Open street map
 
Open development cambodia
Open development cambodiaOpen development cambodia
Open development cambodia
 
Less css
Less cssLess css
Less css
 
Job hunting & career development
Job hunting & career developmentJob hunting & career development
Job hunting & career development
 
Internet security
Internet securityInternet security
Internet security
 
How to build up communication skill
How to build up communication skillHow to build up communication skill
How to build up communication skill
 
Google mapmaker
Google mapmakerGoogle mapmaker
Google mapmaker
 
Financial job study travel planning
Financial job study travel planningFinancial job study travel planning
Financial job study travel planning
 
Khmer TTS
Khmer TTSKhmer TTS
Khmer TTS
 
Khmer OCR
Khmer OCRKhmer OCR
Khmer OCR
 
ERP web based system
ERP web based systemERP web based system
ERP web based system
 
10 golden features of business website
10 golden features of business website10 golden features of business website
10 golden features of business website
 
UrbanVoicePDF
UrbanVoicePDFUrbanVoicePDF
UrbanVoicePDF
 
4 hour-workweek-blogger
4 hour-workweek-blogger4 hour-workweek-blogger
4 hour-workweek-blogger
 

Khmer ASR

  • 1. Khmer ASR system Sethserey SAM sam.sethserey@itc.edu.kh
  • 2. Part I: ASR in general o  Definition o  Type of ASR o  ASR flow chart o  Data requirement o  Performance of ASR systems o  Fundamental methods to create ASR system 2
  • 3. What is ASR system? o  ASR: Automatic speech recognition system o  ASR: A system or tool that can convert audio flow contained speech to text. Seven Seven days ASR System Zaven : : Text output 3
  • 4. ASR: what for? o  ASR systems improve your life (works , business, communication ,etc.)
  • 5. Typology of ASR systems o  Speaker-dependent vs. -independent o  Language constraints: + Vocabulary: n  isolated word recognition n  connected word small (100), n  keyword spotting medium (5 000), large (50 000) n  continuous speech recognition o  Robustness constraints n  laboratory (office) conditions: imposed n  microphone, channel noise … 5
  • 7. ASR flow chart s e Seven v Seven days Zaven e : n : Signal processing Decoding/Searching (digitalizing & feature extraction) ASR system 7
  • 8. ASR data requirement o  To train AM and ML models, huge amount of data (text & audio) are needed. Pronunciation Audio + dictionary Text data transcription data 8
  • 9. ASR Performance o  English ASR system Evaluations at National Institute of Standards and Technology (NIST) 9
  • 10. Causes of ASR’s error rate “seven” o  The current ASR for continuous speech can not reach 0% of WER, why ? n  Acoustic model is affected by human character and environment: gender, age, emotion, pitch, accent, physical state, channel noise, etc. n  Lexical model is affected by incorrect word pronunciation. n  Language model : incorrect usage of words, grammar mistakes. 10
  • 11. Three fundamental methods for creating a new ASR system o  Enough training data è bootstraping o  Small amount of data è adaptation o  No data è cross-language transfer 11
  • 12. Part II: Khmer language & its processing o  Khmer language o  Why research on Khmer ASR? 12
  • 13. Khmer Language o  Official  language  of  Cambodia   o  Spoken  by  more  than  15  M  people   o  An  atonal  language   o  Wri>ng  system   n  33  Consonants,  23  dependent  vowels   n  14  independent  vowels,  13  diacri>cs  and  various  signs       n  No  explicit  word  boundary       13
  • 14. Why research on Khmer ASR? o  An  under-­‐resourced  language     n  Lack  of  text  and  speech  data  in  digital  form   n  Lack  of  linguis>c  documents  (both  soK  and  hard   copies)   o  Lacking  explicit  Word  Segmenta>on     n  Automa>c  Word  Segmenta>on  is  needed   n  State-­‐of-­‐the-­‐art  method  of    segmenta>on  uses     –  hand-­‐craKed  lexicons,  word  frequencies,     –  op>miza>on  criteria  …   o  Others  under-­‐resourced,  unsegmented   languages  in  the  region  :  Burmese,  Laos,  Thai   Vietnamese         14
  • 15. Part III: Khmer ASR at the glance o  Corpus o  Speech corpus setup o  Text corpus setup o  General overview o  Current ASR system o  Future work 15
  • 16. Corpus: Speeh corpus setup o  Two types of corpus: n  small transcribed corpus (2007-2008) o  Transcribed manually by Engineering students at ITC o  only 6 hours of transcribed signal o  Nature: radio signal (poor quality) downloaded from radio australie, radio free asia and voice of america n  Large transcribed corpus (2011) o  Already have text and speech corresponding o  Students help verifying the transcription o  21 hours of transcribed signal o  Nature: reading speech from newspaper 16
  • 17. Corpus: Text corpus setup o  Retrieving  text  from  the  Web  is  becoming  a  common  approach   o  Well  selected  rich-­‐content  websites  Vs  crawling  the  Web   o  Adap>ng  ClipsTextTk,  an  open  source  tool  for  corpus  crea>on  for   Khmer  language   n  Conversion  from  legacy  character  encoding  to  Unicode   n  Automa>c  Segmenta>on     n  Conversion  of  special  sign  and  number  to  text   n  Normaliza>on  of  word  spelling   o  Text  Corpus  obtained  from  5  sites  :   n  2,5000  html  pages  retrieved     n  AKer  processing  :  0.5  M  sentences,  15  M  words   n  Dura>on  :  November  2007  –  January  2008       17
  • 18. Corpus-Oveview o  Description of Khmer ASR corpus Type Small corpus Large corpus Signal ~6h of transcribed ~20h of (acoustic model) signal (radio) transcribed signal (reading speech) Text 0,5 millions of to be improved (language model) phrase ~ 15,5 millions of words Pronunciation ~ 20 000 words To be improved Dictionary (lexical model) 18
  • 19. Current ASR system Continue ASR Training & Word Error Rate (%) System tasting corpus Context Context Dependent Dependent (8gau) (16gau) Khmer ASR v1 - LM: 15.5M words 42.5 40.3 - Training AM: 5h - Testing: 172p Khmer ASR v2 - LM: 15M words 36.4 35 - Training AM: 20h - Testing: 290 p 19
  • 20. Future Work o  Collect more text data for language model o  Next challenge: How to improve Khmer ASR for independent speakers and in different environments? 20