SlideShare une entreprise Scribd logo
1  sur  8
Télécharger pour lire hors ligne
Journal of Chinese Language and Computing 15 (4): (203-210)




    The Research and Development of Computer Aided
     Contemporary Uighur Language Tagging System
                   Yusup Abaidula,Rezwangul, Abdiryim Sali
             Department of Computer Science, School of Physi-Math Information
               Xinjiang Normal University, Urumqi, Xinjiang, China 830054
                                 yusup2002@sohu.com

______________________________________________________________________

Abstract

The research on the contemporary Uyghur information processing in the 20th century can be
dated back to the beginning of 1980‘s. There are a lot of achievements up to now ,for
example, fundamental theory and basic utility are established, various corpus are built, code
standard is designed, Uyghur grammatical attribution research is carried out。Because the
Uyghur language has its own nature,the Uyghur information processing has its own
characteristics and methods 。 This paper briefly states the research work on Uyghur
morphological analyzer and the computer-aided contemporary Uyghur corpus processing
system1。

Keywords

Uighur language, Text processing, Tagging system, Corpus
______________________________________________________________________


1. Preface

Contemporary Uyghur information processing includes the input, output, recognition, and
understanding of this language. In the course of analyzing the Uyghur characters, words,
sentences and paragraphs, computer is used to process the Uyghur phonetics, morphology,
semantics etc. Uyghur language has its own character set, so they have its own nature and
needs different processing methods. From the 1980’s to now, a lot of work on the character
level process of the language has been done. Now, more researches are focus on the
processing of words and sentences. We also had some progress on this direction.


1
    This research is sponsored by the programs No. 60163002 and No. 60463005, National
Natural Science Foundation of China (NSFC), and the Startup Fund of the Special Training
Program for the Minority Talents by the Science Department of Xinjiang Uyghur
Autonomous Region.
204                                             Yusup Abaidula,Rezwangul, Abdiryim Sali




2. The subsystem for Uighur file format conversion

The first work of this research is to prepare the electronic text data for contemporary Uyghur
tagging corpus. This work is done with the support from some publishers, such as “Xinjiang
Daily”, Xinjiang Science and Technology publisher, Xinjiang Education Publisher, Xinjiang
People’s Publisher. They provided us compiled electronic Uyghur language materials. Since
these text materials are only suitable for DOS system, not for Windows. So we have to
convert the format of the text. Our system has a module to for this conversion. By using this
system, we carried out the format transformation of the Uyghur materials, and worked out a
corpus of more than 8 million words.

These materials cover almost all the words in the text book of the elementary, junior, high,
technical secondary schools and university as well as words used in the fields of law,
literature, health, technology (especially the agriculture technology), history, and economics.
This contemporary Uyghur language corpus is the basis for our further information
processing. The information processing not only needs to summarize and define surface
structure rules but also deep structure analysis. Next step, we are going to do statistics and
processing for 4 million phrases. We divided the corpus into four parts, which are religious
20%, popular science 20%, political commentary 20%, and literature 40%.


3. Uighur word Tagging Standard

The most important work in setting up Uyghur tagging corpus is to create Uyghur tag set and
standard. In word tagging, we started with a small tagging set. From the viewpoint of
computer linguistics, and by reference to Contemporary Uyghur grammar, we added some
tagging symbols. Finally we have 24 tagging types.

The tagging symbols of noun are up to 31 ,for example:
      •   Personal name      Nk
      •   Place name         Ny

      •   Organ name         Nt
      •   Time name          Nw

According to the semantic and grammatical function, verbs are classified as personal,
non-personal and auxiliary. According to the tense of the verb, it is divided into more than 80
types, for example:
     • Verb                   V
      •   Personal verb        V1
      •   Imperative verb      V101

In Uyghur language, non-personal verb differs from personal verb in their grammatical
function. The grammatical function of the non-personal verb is similar to the nouns, the
adjectives and the adverbs. If the grammatical function is similar to the nouns, it is called
gerund. If the grammatical function is similar to the adjectives, it is called verbal. If the
Computer-Aided Contemporary Uighur Language Tagging System                                  205




grammatical function is similar to the adverbs, it is called adverbial. According to their basic
function, gerund is tagged by Vn, verbal is tagged by Va, adverbial is tagged by Vd.
    • Non-personal verb          V2
    •    Verbal                  V21
    •    Perfect participle      V211

In Uyghur information processing, we created a small tagging set [9] and used it in the
tagging of the Uyghur words.


4. Uighur Electronic Information Dictionaries

Our Uyghur Electronic Information Dictionary includes word-root dictionary, phrase
dictionary and suffix dictionary. The vocabulary in the word-root dictionary phrase
dictionary has about 60,000 items. Word dictionary includes 25,000 words. Every word at
present has 16 attributes. In the word-root dictionary, the word-root structure and word-root
format is as in Table 1.




                          Table 1. Example of word-root dictionary
206                                             Yusup Abaidula,Rezwangul, Abdiryim Sali




                          Table 2 Example of the suffix dictionary

The structure of the machine electronic phrase dictionary and the format of the phrase is
similar to those of word-root dictionary. Electronic suffix dictionary at present has 130,000
suffixes. Every suffix has 16 attributes. The electronic suffix dictionary and the format of
suffix is as in Table 2:


5. The subsystem of the Uighur corpus management

The Uyghur corpus is not only a collection of text. It also contains a lot of original and
structured texts as part of the corpus. Some structure information is saved in the corpus, for
example, chapter name, paragraph name and its location. The function of this information is
not only for searching text according to certain standard, but also a part of the corpus.


6. Uighur word tagging and segmentation rules

The words segmentation rules we used follow Contemporary Uyghur theography
dictionary .At this point, the purpose of the Contemporary Uyghur word segmentation is to
do word tagging. We produced a contemporary Uyghur electronic dictionary. This dictionary
includes word-root dictionary and word -suffix dictionary, and is being used as a reference
for segmenting word-root and suffix.

The segmented units are basically word-roots, word-affixes and word- suffixes. They are the
basic units that are used in the information processing, and it has certain semantic and
grammatical functions.

According to the statistics on the corpus, we found out the rules for forming words. IN the
following examples: U stands for sentence word, A stands for word-root, B stands for word-
affix, C stands for word- suffix . The rules of words forming are as in Figure 1.
Computer-Aided Contemporary Uighur Language Tagging System                                    207




                            Figure 1. The rules for forming words


On the basis of above-mentioned concepts, we have the following four rules for sentence
elements formation, they are:




                           Figure 2. The rules for forming sentence.


7. The sub system for the character statistics

Character frequency statistics subsystem is composed of three modules: text reading module,
statistic computing module, and statistic data management module. The language of our
program is Borland C++ Builder, the background database is Microsoft access 2000.When
we analyzed the statistic results, we found that statistics effectively reflected the frequency of
Uyghur letters in the Uyghur text. Through statistics of 4 million words, we obtained the
frequency of the contemporary Uyghur characters. The frequency of contemporary Uyghur
character is as shown in Figure 3.
208                                             Yusup Abaidula,Rezwangul, Abdiryim Sali




                  Figure 3. Frequency of contemporary Uighur characters

We can see, the most frequent one appeared 2842094 times, which is about 17% of all the
occurrences. The second one appeared 1547472 times, which 9% of the whole occureances.
The third one is about 7%. The fourth one about 6%.The least used one appeared 2223 times,
which is 0.0129%. The second least appeared 11079 times, or 0.064%. The third least used
one appeared 84583, or 0.49%.


8. The subsystem of the syllable statistics

Syllable frequency statistic subsystem is composed of three modules: text reading module,
syllable statistic compute module, and syllable statistic data management module. The
program language we used is Borland C++ Builder, the background database is Microsoft
access2000. Trough statistics of the 4 million words’ real corpus , we found 7000 syllables in
Uyghur. By analyzing the statistic results, we found the frequency of the 7000 syllables in
Uyghur. Among them, 6065 syllables appeared more than one time. 3730 syllables appeared
over three times. 1682 syllables appeared 100 times. 138 syllables appeared 1000 times, and
9 syllables appeared over 110000 times. The 9 most frequent syllables and their frequencies
are as shown in the Figure 4..




                   Figure 4. Frequency of contemporary Uighur syllables


9. The subsystem of the word statistics

In this real corpus of 4 million words, we found 240,000 unique words. At the same time,
among this real corpus, we found that the words that appeared more than two times occupied
44%. The rest appeared only once.
Computer-Aided Contemporary Uighur Language Tagging System                              209




The original electronic dictionary has about 120,000 word-roots, but in the real corpus of 4
million words, 140,000 word-root appeared. Only about 20%of the 10,000 suffixes that we
had appeared two or more times. The rest in this corpus appeared only once or did not appear
at all. This explains that only 20% suffixes are commonly used in the written language. The
rest are used in spoken language.


10. The sub system of Uighur tagging

When building the Uyghur corpus, we also developed an automatic word tagging system
based on the Uyghur information electronic dictionary, and the Uyghur suffix structure. In
addition, to improve the correctness of the electronic word tagging, we created a tool to
manually correct the corpus. Through these methods, we can guarantee the correctness and
consistance of tagged results. The function of these auxiliary tools includes tagging,
restricting the dictionaries and online checking of the compatible words (antonyms or
synonymous). It also makes corrections using some rules. Nowadays, the subsystem of the
Uyghur tagging is under testing.




11. Conculusion

Through nearly 20 years of development, Xinjiang minority nationality information
processing work obtained some promising results. However, compared to other languages , it
still needs a lot of work, especially on the basic theory studies and basic application
technique development. As far as Uyghur grammar and semantic research of information
processing are concerned, it is still in the primary stage. It still does not have its own
theoretical system. Also, this is a very big project that needs too much investment. This is
also the main obstacle in the other minority nationalities information processing research.
Apart from these, Uyghur information processing also has a lot of work to be done. For
example: Uyghur grammar information dictionary, reference database for Uyghur phonetics,
linguistics studies of Uyghur corpus, Uyghur grammar attribution studies of information
processing, Uyghur semantic studies of information processing, Uyghur machine translation
etc. I hope the research of the Uyghur language information processing will get more and
more support from researchers and officials.




References

[1] Yusup Aibaidulla and Kim-Teng Lua, The development of Tagged Uyghur Corpus,
    Proceedings of PACLIC17, 1-3 October 2003, Sentosa, Singapore, P228-234.
[2] Yusup Aibaidulla etc. Contemporary Uighur Corpus Managing, China Artificial
    Intelligent Developing 2003, Beijing Post University Press, P1007-1010.
[3] Yusup Aibaidulla, Semantic Problem Solution Research of Uighur Sentence Grammar
    Analyzer, Computer Application and Software, Vol 4. 2002, P59-62.
210                                        Yusup Abaidula,Rezwangul, Abdiryim Sali




[4] Yusup Ebeydulla, Progress In System Design of Contemporary Uyghur Corpus. Journal
    of Chinese Language and Computing, Volume 13, 2003, P283—301.
[5] Yusup Ebeydulla, An Uyghur Syntax Parser. Journal of Chinese Language and
    Computing. Volume 13, 2003. P273-282.
[6] Yusup Ebeydulla , Askhar, The transform Subsystem of Uighur File Style of the
    Contemporary Uighur Corpus System, Journal of Xinjiang Normal University(Natural
    Science Version)vol. 1, 2003, P16-19.
[7] Yusup Abaidula,Abdiryim Sali, Discussion on the Corpus Linguistics, Journal of
    Xinjiang Normal University(Social Science Version), Vol. 4, 2003, P76-79.
[8] Yusup Abaydul Research on System of Contemporary Uyghur Word Frequency Statistics
    and High Frequency Words, Procceedings of the International Conference on Chinese
    Computing 2005, 21-23 March 2005 Singapore.
[9] Xinjiang Normal University Information processing with contemporary Uyghur words
    tagging sets standard version 1.0.

Contenu connexe

Tendances

CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...kevig
 
A Computational Model of Yoruba Morphology Lexical Analyzer
A Computational Model of Yoruba Morphology Lexical AnalyzerA Computational Model of Yoruba Morphology Lexical Analyzer
A Computational Model of Yoruba Morphology Lexical AnalyzerWaqas Tariq
 
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...IJCI JOURNAL
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...ijnlc
 
Development of text to speech system for yoruba language
Development of text to speech system for yoruba languageDevelopment of text to speech system for yoruba language
Development of text to speech system for yoruba languageAlexander Decker
 
Design of A Spell Corrector For Hausa Language
Design of A Spell Corrector For Hausa LanguageDesign of A Spell Corrector For Hausa Language
Design of A Spell Corrector For Hausa LanguageWaqas Tariq
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
 
Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...IJECEIAES
 
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATANAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
 
Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficultiesijtsrd
 
Optical character recognition for Ge'ez characters
Optical character recognition for Ge'ez charactersOptical character recognition for Ge'ez characters
Optical character recognition for Ge'ez charactershadmac
 
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...iosrjce
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
 
Nonparametric Bayesian Word Discovery for Symbol Emergence in Robotics
Nonparametric Bayesian Word Discovery for Symbol Emergence in RoboticsNonparametric Bayesian Word Discovery for Symbol Emergence in Robotics
Nonparametric Bayesian Word Discovery for Symbol Emergence in RoboticsTadahiro Taniguchi
 
Symbol Emergence in Robotics: Language Acquisition via Real-world Sensorimoto...
Symbol Emergence in Robotics: Language Acquisition via Real-world Sensorimoto...Symbol Emergence in Robotics: Language Acquisition via Real-world Sensorimoto...
Symbol Emergence in Robotics: Language Acquisition via Real-world Sensorimoto...Tadahiro Taniguchi
 

Tendances (16)

CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
 
A Computational Model of Yoruba Morphology Lexical Analyzer
A Computational Model of Yoruba Morphology Lexical AnalyzerA Computational Model of Yoruba Morphology Lexical Analyzer
A Computational Model of Yoruba Morphology Lexical Analyzer
 
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
 
Development of text to speech system for yoruba language
Development of text to speech system for yoruba languageDevelopment of text to speech system for yoruba language
Development of text to speech system for yoruba language
 
Design of A Spell Corrector For Hausa Language
Design of A Spell Corrector For Hausa LanguageDesign of A Spell Corrector For Hausa Language
Design of A Spell Corrector For Hausa Language
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...
 
Cf32516518
Cf32516518Cf32516518
Cf32516518
 
Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar named entity corpus and its use in syllable-based neural named entity...
 
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATANAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
 
Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficulties
 
Optical character recognition for Ge'ez characters
Optical character recognition for Ge'ez charactersOptical character recognition for Ge'ez characters
Optical character recognition for Ge'ez characters
 
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
 
Nonparametric Bayesian Word Discovery for Symbol Emergence in Robotics
Nonparametric Bayesian Word Discovery for Symbol Emergence in RoboticsNonparametric Bayesian Word Discovery for Symbol Emergence in Robotics
Nonparametric Bayesian Word Discovery for Symbol Emergence in Robotics
 
Symbol Emergence in Robotics: Language Acquisition via Real-world Sensorimoto...
Symbol Emergence in Robotics: Language Acquisition via Real-world Sensorimoto...Symbol Emergence in Robotics: Language Acquisition via Real-world Sensorimoto...
Symbol Emergence in Robotics: Language Acquisition via Real-world Sensorimoto...
 

En vedette

History of Cipher System
History of Cipher SystemHistory of Cipher System
History of Cipher SystemAsad Ali
 
Cryptography and Network Security
Cryptography and Network SecurityCryptography and Network Security
Cryptography and Network SecurityPa Van Tanku
 
SCRABBLE - MAPEH 8 (Physical Education 3rd Quarter)
SCRABBLE - MAPEH 8 (Physical Education 3rd Quarter)SCRABBLE - MAPEH 8 (Physical Education 3rd Quarter)
SCRABBLE - MAPEH 8 (Physical Education 3rd Quarter)Carlo Luna
 
Present Simple And Adverbs Of Frequency
Present Simple And Adverbs Of FrequencyPresent Simple And Adverbs Of Frequency
Present Simple And Adverbs Of FrequencyCalisto y Melibea
 

En vedette (6)

History of Cipher System
History of Cipher SystemHistory of Cipher System
History of Cipher System
 
Ch02...1
Ch02...1Ch02...1
Ch02...1
 
Cryptography and Network Security
Cryptography and Network SecurityCryptography and Network Security
Cryptography and Network Security
 
Cryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie BrownCryptography and Network Security William Stallings Lawrie Brown
Cryptography and Network Security William Stallings Lawrie Brown
 
SCRABBLE - MAPEH 8 (Physical Education 3rd Quarter)
SCRABBLE - MAPEH 8 (Physical Education 3rd Quarter)SCRABBLE - MAPEH 8 (Physical Education 3rd Quarter)
SCRABBLE - MAPEH 8 (Physical Education 3rd Quarter)
 
Present Simple And Adverbs Of Frequency
Present Simple And Adverbs Of FrequencyPresent Simple And Adverbs Of Frequency
Present Simple And Adverbs Of Frequency
 

Similaire à JCLC_V15_N4_3

Outlining Bangla Word Dictionary for Universal Networking Language
Outlining Bangla Word Dictionary for Universal Networking  LanguageOutlining Bangla Word Dictionary for Universal Networking  Language
Outlining Bangla Word Dictionary for Universal Networking LanguageIOSR Journals
 
An Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop StemmerAn Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop Stemmerkevig
 
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...Waqas Tariq
 
Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Univ...
Design Analysis Rules to Identify Proper Noun  from Bengali Sentence for Univ...Design Analysis Rules to Identify Proper Noun  from Bengali Sentence for Univ...
Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Univ...Syeful Islam
 
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorWaqas Tariq
 
American Standard Sign Language Representation Using Speech Recognition
American Standard Sign Language Representation Using Speech RecognitionAmerican Standard Sign Language Representation Using Speech Recognition
American Standard Sign Language Representation Using Speech Recognitionpaperpublications3
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET Journal
 
Car-Following Parameters by Means of Cellular Automata in the Case of Evacuation
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCar-Following Parameters by Means of Cellular Automata in the Case of Evacuation
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCSCJournals
 
Diacritic Oriented Arabic Information Retrieval System
Diacritic Oriented Arabic Information Retrieval SystemDiacritic Oriented Arabic Information Retrieval System
Diacritic Oriented Arabic Information Retrieval SystemCSCJournals
 
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...Syeful Islam
 
IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...
IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...
IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...IRJET Journal
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...Syeful Islam
 
IRJET- Communication Aid for Deaf and Dumb People
IRJET- Communication Aid for Deaf and Dumb PeopleIRJET- Communication Aid for Deaf and Dumb People
IRJET- Communication Aid for Deaf and Dumb PeopleIRJET Journal
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionaryEditor IJMTER
 
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHDICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
 
Dictionary based concept mining an application for turkish
Dictionary based concept mining  an application for turkishDictionary based concept mining  an application for turkish
Dictionary based concept mining an application for turkishcsandit
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...ijaia
 
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...gerogepatton
 

Similaire à JCLC_V15_N4_3 (20)

Outlining Bangla Word Dictionary for Universal Networking Language
Outlining Bangla Word Dictionary for Universal Networking  LanguageOutlining Bangla Word Dictionary for Universal Networking  Language
Outlining Bangla Word Dictionary for Universal Networking Language
 
An Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop StemmerAn Unsupervised Approach to Develop Stemmer
An Unsupervised Approach to Develop Stemmer
 
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...
Dictionary Entries for Bangla Consonant Ended Roots in Universal Networking L...
 
Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Univ...
Design Analysis Rules to Identify Proper Noun  from Bengali Sentence for Univ...Design Analysis Rules to Identify Proper Noun  from Bengali Sentence for Univ...
Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Univ...
 
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
 
American Standard Sign Language Representation Using Speech Recognition
American Standard Sign Language Representation Using Speech RecognitionAmerican Standard Sign Language Representation Using Speech Recognition
American Standard Sign Language Representation Using Speech Recognition
 
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language ModelsIRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
 
Car-Following Parameters by Means of Cellular Automata in the Case of Evacuation
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCar-Following Parameters by Means of Cellular Automata in the Case of Evacuation
Car-Following Parameters by Means of Cellular Automata in the Case of Evacuation
 
Diacritic Oriented Arabic Information Retrieval System
Diacritic Oriented Arabic Information Retrieval SystemDiacritic Oriented Arabic Information Retrieval System
Diacritic Oriented Arabic Information Retrieval System
 
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...
A New Approach: Automatically Identify Proper Noun from Bengali Sentence for ...
 
IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...
IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...
IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
 
Accessing database using nlp
Accessing database using nlpAccessing database using nlp
Accessing database using nlp
 
IRJET- Communication Aid for Deaf and Dumb People
IRJET- Communication Aid for Deaf and Dumb PeopleIRJET- Communication Aid for Deaf and Dumb People
IRJET- Communication Aid for Deaf and Dumb People
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
 
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHDICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISH
 
G1803013542
G1803013542G1803013542
G1803013542
 
Dictionary based concept mining an application for turkish
Dictionary based concept mining  an application for turkishDictionary based concept mining  an application for turkish
Dictionary based concept mining an application for turkish
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
 
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
 

Plus de tughchi

Resume_Mehmud
Resume_MehmudResume_Mehmud
Resume_Mehmudtughchi
 
Prayer Profile - Uighur
Prayer Profile - UighurPrayer Profile - Uighur
Prayer Profile - Uighurtughchi
 
ug_ozumge_xitap
ug_ozumge_xitapug_ozumge_xitap
ug_ozumge_xitaptughchi
 
Microsoft Word - mso4F
Microsoft Word - mso4FMicrosoft Word - mso4F
Microsoft Word - mso4Ftughchi
 
hon-TohtiTunyaz
hon-TohtiTunyazhon-TohtiTunyaz
hon-TohtiTunyaztughchi
 
9- oghuz 60-62
9- oghuz 60-629- oghuz 60-62
9- oghuz 60-62tughchi
 
Uyghurs And Nowruz
Uyghurs And NowruzUyghurs And Nowruz
Uyghurs And Nowruztughchi
 
parchilar-ijadiyet
parchilar-ijadiyetparchilar-ijadiyet
parchilar-ijadiyettughchi
 
Ghalip_barat_ark eserliri_1
Ghalip_barat_ark eserliri_1Ghalip_barat_ark eserliri_1
Ghalip_barat_ark eserliri_1tughchi
 
_ghalib ark eserliri
_ghalib ark  eserliri_ghalib ark  eserliri
_ghalib ark eserliritughchi
 
adil tuniyaz dastanliri
adil tuniyaz dastanliriadil tuniyaz dastanliri
adil tuniyaz dastanliritughchi
 
15句让女生甜到晕的情话_doc
15句让女生甜到晕的情话_doc15句让女生甜到晕的情话_doc
15句让女生甜到晕的情话_doctughchi
 
ug_iman_ajizliqining_sawapliri
ug_iman_ajizliqining_sawapliriug_iman_ajizliqining_sawapliri
ug_iman_ajizliqining_sawapliritughchi
 
en_The_Prohibition_of_Music
en_The_Prohibition_of_Musicen_The_Prohibition_of_Music
en_The_Prohibition_of_Musictughchi
 
tb51pisk
tb51pisktb51pisk
tb51pisktughchi
 
ug_quran_sunnette_maxtalghan_ilimlar
ug_quran_sunnette_maxtalghan_ilimlarug_quran_sunnette_maxtalghan_ilimlar
ug_quran_sunnette_maxtalghan_ilimlartughchi
 
Bir_Oqung
Bir_OqungBir_Oqung
Bir_Oqungtughchi
 

Plus de tughchi (20)

Resume_Mehmud
Resume_MehmudResume_Mehmud
Resume_Mehmud
 
noruz
noruznoruz
noruz
 
rom1_ug
rom1_ugrom1_ug
rom1_ug
 
Prayer Profile - Uighur
Prayer Profile - UighurPrayer Profile - Uighur
Prayer Profile - Uighur
 
ug_ozumge_xitap
ug_ozumge_xitapug_ozumge_xitap
ug_ozumge_xitap
 
Microsoft Word - mso4F
Microsoft Word - mso4FMicrosoft Word - mso4F
Microsoft Word - mso4F
 
hon-TohtiTunyaz
hon-TohtiTunyazhon-TohtiTunyaz
hon-TohtiTunyaz
 
9- oghuz 60-62
9- oghuz 60-629- oghuz 60-62
9- oghuz 60-62
 
Uyghurs And Nowruz
Uyghurs And NowruzUyghurs And Nowruz
Uyghurs And Nowruz
 
parchilar-ijadiyet
parchilar-ijadiyetparchilar-ijadiyet
parchilar-ijadiyet
 
Ghalip_barat_ark eserliri_1
Ghalip_barat_ark eserliri_1Ghalip_barat_ark eserliri_1
Ghalip_barat_ark eserliri_1
 
_ghalib ark eserliri
_ghalib ark  eserliri_ghalib ark  eserliri
_ghalib ark eserliri
 
adil tuniyaz dastanliri
adil tuniyaz dastanliriadil tuniyaz dastanliri
adil tuniyaz dastanliri
 
15句让女生甜到晕的情话_doc
15句让女生甜到晕的情话_doc15句让女生甜到晕的情话_doc
15句让女生甜到晕的情话_doc
 
ug_iman_ajizliqining_sawapliri
ug_iman_ajizliqining_sawapliriug_iman_ajizliqining_sawapliri
ug_iman_ajizliqining_sawapliri
 
en_The_Prohibition_of_Music
en_The_Prohibition_of_Musicen_The_Prohibition_of_Music
en_The_Prohibition_of_Music
 
beller
bellerbeller
beller
 
tb51pisk
tb51pisktb51pisk
tb51pisk
 
ug_quran_sunnette_maxtalghan_ilimlar
ug_quran_sunnette_maxtalghan_ilimlarug_quran_sunnette_maxtalghan_ilimlar
ug_quran_sunnette_maxtalghan_ilimlar
 
Bir_Oqung
Bir_OqungBir_Oqung
Bir_Oqung
 

Dernier

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Dernier (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

JCLC_V15_N4_3

  • 1. Journal of Chinese Language and Computing 15 (4): (203-210) The Research and Development of Computer Aided Contemporary Uighur Language Tagging System Yusup Abaidula,Rezwangul, Abdiryim Sali Department of Computer Science, School of Physi-Math Information Xinjiang Normal University, Urumqi, Xinjiang, China 830054 yusup2002@sohu.com ______________________________________________________________________ Abstract The research on the contemporary Uyghur information processing in the 20th century can be dated back to the beginning of 1980‘s. There are a lot of achievements up to now ,for example, fundamental theory and basic utility are established, various corpus are built, code standard is designed, Uyghur grammatical attribution research is carried out。Because the Uyghur language has its own nature,the Uyghur information processing has its own characteristics and methods 。 This paper briefly states the research work on Uyghur morphological analyzer and the computer-aided contemporary Uyghur corpus processing system1。 Keywords Uighur language, Text processing, Tagging system, Corpus ______________________________________________________________________ 1. Preface Contemporary Uyghur information processing includes the input, output, recognition, and understanding of this language. In the course of analyzing the Uyghur characters, words, sentences and paragraphs, computer is used to process the Uyghur phonetics, morphology, semantics etc. Uyghur language has its own character set, so they have its own nature and needs different processing methods. From the 1980’s to now, a lot of work on the character level process of the language has been done. Now, more researches are focus on the processing of words and sentences. We also had some progress on this direction. 1 This research is sponsored by the programs No. 60163002 and No. 60463005, National Natural Science Foundation of China (NSFC), and the Startup Fund of the Special Training Program for the Minority Talents by the Science Department of Xinjiang Uyghur Autonomous Region.
  • 2. 204 Yusup Abaidula,Rezwangul, Abdiryim Sali 2. The subsystem for Uighur file format conversion The first work of this research is to prepare the electronic text data for contemporary Uyghur tagging corpus. This work is done with the support from some publishers, such as “Xinjiang Daily”, Xinjiang Science and Technology publisher, Xinjiang Education Publisher, Xinjiang People’s Publisher. They provided us compiled electronic Uyghur language materials. Since these text materials are only suitable for DOS system, not for Windows. So we have to convert the format of the text. Our system has a module to for this conversion. By using this system, we carried out the format transformation of the Uyghur materials, and worked out a corpus of more than 8 million words. These materials cover almost all the words in the text book of the elementary, junior, high, technical secondary schools and university as well as words used in the fields of law, literature, health, technology (especially the agriculture technology), history, and economics. This contemporary Uyghur language corpus is the basis for our further information processing. The information processing not only needs to summarize and define surface structure rules but also deep structure analysis. Next step, we are going to do statistics and processing for 4 million phrases. We divided the corpus into four parts, which are religious 20%, popular science 20%, political commentary 20%, and literature 40%. 3. Uighur word Tagging Standard The most important work in setting up Uyghur tagging corpus is to create Uyghur tag set and standard. In word tagging, we started with a small tagging set. From the viewpoint of computer linguistics, and by reference to Contemporary Uyghur grammar, we added some tagging symbols. Finally we have 24 tagging types. The tagging symbols of noun are up to 31 ,for example: • Personal name Nk • Place name Ny • Organ name Nt • Time name Nw According to the semantic and grammatical function, verbs are classified as personal, non-personal and auxiliary. According to the tense of the verb, it is divided into more than 80 types, for example: • Verb V • Personal verb V1 • Imperative verb V101 In Uyghur language, non-personal verb differs from personal verb in their grammatical function. The grammatical function of the non-personal verb is similar to the nouns, the adjectives and the adverbs. If the grammatical function is similar to the nouns, it is called gerund. If the grammatical function is similar to the adjectives, it is called verbal. If the
  • 3. Computer-Aided Contemporary Uighur Language Tagging System 205 grammatical function is similar to the adverbs, it is called adverbial. According to their basic function, gerund is tagged by Vn, verbal is tagged by Va, adverbial is tagged by Vd. • Non-personal verb V2 • Verbal V21 • Perfect participle V211 In Uyghur information processing, we created a small tagging set [9] and used it in the tagging of the Uyghur words. 4. Uighur Electronic Information Dictionaries Our Uyghur Electronic Information Dictionary includes word-root dictionary, phrase dictionary and suffix dictionary. The vocabulary in the word-root dictionary phrase dictionary has about 60,000 items. Word dictionary includes 25,000 words. Every word at present has 16 attributes. In the word-root dictionary, the word-root structure and word-root format is as in Table 1. Table 1. Example of word-root dictionary
  • 4. 206 Yusup Abaidula,Rezwangul, Abdiryim Sali Table 2 Example of the suffix dictionary The structure of the machine electronic phrase dictionary and the format of the phrase is similar to those of word-root dictionary. Electronic suffix dictionary at present has 130,000 suffixes. Every suffix has 16 attributes. The electronic suffix dictionary and the format of suffix is as in Table 2: 5. The subsystem of the Uighur corpus management The Uyghur corpus is not only a collection of text. It also contains a lot of original and structured texts as part of the corpus. Some structure information is saved in the corpus, for example, chapter name, paragraph name and its location. The function of this information is not only for searching text according to certain standard, but also a part of the corpus. 6. Uighur word tagging and segmentation rules The words segmentation rules we used follow Contemporary Uyghur theography dictionary .At this point, the purpose of the Contemporary Uyghur word segmentation is to do word tagging. We produced a contemporary Uyghur electronic dictionary. This dictionary includes word-root dictionary and word -suffix dictionary, and is being used as a reference for segmenting word-root and suffix. The segmented units are basically word-roots, word-affixes and word- suffixes. They are the basic units that are used in the information processing, and it has certain semantic and grammatical functions. According to the statistics on the corpus, we found out the rules for forming words. IN the following examples: U stands for sentence word, A stands for word-root, B stands for word- affix, C stands for word- suffix . The rules of words forming are as in Figure 1.
  • 5. Computer-Aided Contemporary Uighur Language Tagging System 207 Figure 1. The rules for forming words On the basis of above-mentioned concepts, we have the following four rules for sentence elements formation, they are: Figure 2. The rules for forming sentence. 7. The sub system for the character statistics Character frequency statistics subsystem is composed of three modules: text reading module, statistic computing module, and statistic data management module. The language of our program is Borland C++ Builder, the background database is Microsoft access 2000.When we analyzed the statistic results, we found that statistics effectively reflected the frequency of Uyghur letters in the Uyghur text. Through statistics of 4 million words, we obtained the frequency of the contemporary Uyghur characters. The frequency of contemporary Uyghur character is as shown in Figure 3.
  • 6. 208 Yusup Abaidula,Rezwangul, Abdiryim Sali Figure 3. Frequency of contemporary Uighur characters We can see, the most frequent one appeared 2842094 times, which is about 17% of all the occurrences. The second one appeared 1547472 times, which 9% of the whole occureances. The third one is about 7%. The fourth one about 6%.The least used one appeared 2223 times, which is 0.0129%. The second least appeared 11079 times, or 0.064%. The third least used one appeared 84583, or 0.49%. 8. The subsystem of the syllable statistics Syllable frequency statistic subsystem is composed of three modules: text reading module, syllable statistic compute module, and syllable statistic data management module. The program language we used is Borland C++ Builder, the background database is Microsoft access2000. Trough statistics of the 4 million words’ real corpus , we found 7000 syllables in Uyghur. By analyzing the statistic results, we found the frequency of the 7000 syllables in Uyghur. Among them, 6065 syllables appeared more than one time. 3730 syllables appeared over three times. 1682 syllables appeared 100 times. 138 syllables appeared 1000 times, and 9 syllables appeared over 110000 times. The 9 most frequent syllables and their frequencies are as shown in the Figure 4.. Figure 4. Frequency of contemporary Uighur syllables 9. The subsystem of the word statistics In this real corpus of 4 million words, we found 240,000 unique words. At the same time, among this real corpus, we found that the words that appeared more than two times occupied 44%. The rest appeared only once.
  • 7. Computer-Aided Contemporary Uighur Language Tagging System 209 The original electronic dictionary has about 120,000 word-roots, but in the real corpus of 4 million words, 140,000 word-root appeared. Only about 20%of the 10,000 suffixes that we had appeared two or more times. The rest in this corpus appeared only once or did not appear at all. This explains that only 20% suffixes are commonly used in the written language. The rest are used in spoken language. 10. The sub system of Uighur tagging When building the Uyghur corpus, we also developed an automatic word tagging system based on the Uyghur information electronic dictionary, and the Uyghur suffix structure. In addition, to improve the correctness of the electronic word tagging, we created a tool to manually correct the corpus. Through these methods, we can guarantee the correctness and consistance of tagged results. The function of these auxiliary tools includes tagging, restricting the dictionaries and online checking of the compatible words (antonyms or synonymous). It also makes corrections using some rules. Nowadays, the subsystem of the Uyghur tagging is under testing. 11. Conculusion Through nearly 20 years of development, Xinjiang minority nationality information processing work obtained some promising results. However, compared to other languages , it still needs a lot of work, especially on the basic theory studies and basic application technique development. As far as Uyghur grammar and semantic research of information processing are concerned, it is still in the primary stage. It still does not have its own theoretical system. Also, this is a very big project that needs too much investment. This is also the main obstacle in the other minority nationalities information processing research. Apart from these, Uyghur information processing also has a lot of work to be done. For example: Uyghur grammar information dictionary, reference database for Uyghur phonetics, linguistics studies of Uyghur corpus, Uyghur grammar attribution studies of information processing, Uyghur semantic studies of information processing, Uyghur machine translation etc. I hope the research of the Uyghur language information processing will get more and more support from researchers and officials. References [1] Yusup Aibaidulla and Kim-Teng Lua, The development of Tagged Uyghur Corpus, Proceedings of PACLIC17, 1-3 October 2003, Sentosa, Singapore, P228-234. [2] Yusup Aibaidulla etc. Contemporary Uighur Corpus Managing, China Artificial Intelligent Developing 2003, Beijing Post University Press, P1007-1010. [3] Yusup Aibaidulla, Semantic Problem Solution Research of Uighur Sentence Grammar Analyzer, Computer Application and Software, Vol 4. 2002, P59-62.
  • 8. 210 Yusup Abaidula,Rezwangul, Abdiryim Sali [4] Yusup Ebeydulla, Progress In System Design of Contemporary Uyghur Corpus. Journal of Chinese Language and Computing, Volume 13, 2003, P283—301. [5] Yusup Ebeydulla, An Uyghur Syntax Parser. Journal of Chinese Language and Computing. Volume 13, 2003. P273-282. [6] Yusup Ebeydulla , Askhar, The transform Subsystem of Uighur File Style of the Contemporary Uighur Corpus System, Journal of Xinjiang Normal University(Natural Science Version)vol. 1, 2003, P16-19. [7] Yusup Abaidula,Abdiryim Sali, Discussion on the Corpus Linguistics, Journal of Xinjiang Normal University(Social Science Version), Vol. 4, 2003, P76-79. [8] Yusup Abaydul Research on System of Contemporary Uyghur Word Frequency Statistics and High Frequency Words, Procceedings of the International Conference on Chinese Computing 2005, 21-23 March 2005 Singapore. [9] Xinjiang Normal University Information processing with contemporary Uyghur words tagging sets standard version 1.0.