SlideShare a Scribd company logo
1 of 6
Download to read offline
Tamil Shallow Parser using Machine Learning
                         Approach

           Dhanalakshmi V1, Anand Kumar M1, Soman K P1 and Rajendran S2
                                 1Computational   Engineering and Networking
                              Amrita Vishwa Vidyapeetham Coimbatore, India
                         {m_anandkumar,v_dhanalakshmi, kp_soman} @cb.amrita.edu
                                     2Tamil   University, Thanjavur, India




Abstract
This paper presents the Shallow Parser for Tamil using machine learning approach. Tamil Shallow
Parser is an important module in Machine Translation from Tamil to any other language. It is also a
key component in all NLP applications. It is used to understand natural language by machine and also
useful for second language learners. The Tamil Shallow Parser was developed using the new and state
of the art machine learning approach. The POS Tagger, Chunker, Morphological Analyzer and
Dependency Parser were built for implementing the Tamil Shallow Parser. The above modules gives
an encouraging result.

Introduction
Partial or Shallow Parsing is the task of recovering a limited amount of syntactic information from a
natural language sentence. A full parser often provides more information than needed and sometimes
it may also give less information. For example, in Information Retrieval, it may be enough to find
simple NPs (Noun Phrases) and VPs (Verb Phrases). In Information Extraction, Summary Generation,
and Question Answering System, information about special syntactico-semantic relations such as
subject, object, location, time, etc, are needed than elaborate configurational syntactic analyses. In full
parsing, grammar and search strategies are used to assign a complete syntactic structure to sentences.
The main problem here is to select the most possible syntactic analysis to be obtained from thousands
of possible analyses a typical parser with a sophisticated grammar may return. This complexity of the
task makes machine learning an attractive option in comparison to the handcrafted rules.

Methodology
Machine learning approach is applied here to develop the shallow parser for Tamil. Part of speech
tagger for Tamil has been generated using Support Vector Machine approach [Dhanalakshmi V e.tal.,
2009]. A novel approach using machine learning has been built for developing morphological analyzer
for Tamil [Anand kumar M e.tal., 2009]. Tamil Chunker has been developed using CRF++ tool
[Dhanalakshmi V e.tal., 2009]. And finally, Tamil Dependency parser, which is used to find syntactico-
semantic relations such as subject, object, location, time, etc, is built using MALT Parser
[Dhanalakshmi V e.tal., 2011].




                                                      175
General Framework and Modules

     •   The general block diagram for Tamil Shallow parser is given in Figure 1.



                                               Input
                                              Sentenc



                                            Tokenization




                                           POS Tagging



                              Chunking                      Morphological
                                                              Analyzer



                                         Format Conversion


                                           MALT
                                           Parser for
                                           Relation



                                             Shallow

                                             Parsed

                         Figure.1. General Framework for Tamil Shallow Parser

•    Tamil Part-of-Speech Tagger [Dhanalakshmi V e.tal., 2009]: The Part of Speech (POS) tagging is
     the process of labeling a part of speech or other lexical class marker (noun, verb, adjective, etc.)
     to each and every word in a sentence. POS tagger was developed for Tamil language using
     SVMTool [Jes´us Gim´enez and Llu´ıs M`arquez, 2004].

•    Tamil Morphological Analyzer [Anand Kumar M e.tal., 2009]: Morphological Analysis is the
     process of breaking down morphologically complex words into their constituent morphemes. It
     is the primary step for word formation analysis of any language. Morphological Analyzer was
     developed using a novel machine learning approach and was implemented using SVMTool.

•    Tamil Chunker [Dhanalakshmi V e.tal., 2009]: Chunks are normally taken to be non recursive
     correlated group of words. Chunker divides a sentence into its major-non-overlapping phrases


                                                 176
(noun phrase, verb phrase, etc.) and attaches a label to each chunk. Chunker for Tamil language
      was developed using CRF++ Tool[Sha F and Pereira F, 2003].
•     Tamil Dependency Parser for Relation finding [Dhanalakshmi V e.tal., 2011]: Given the POS
      tag, Morphological information and chunks in a sentence, this decides which relations they
      have with the main verb (subject, object, location, etc.). Dependency parser was developed for
      Tamil language using Malt Parser tool [Joakim Nivre and Johan Hall, 2005].
Dependency Parsing using Malt Parser
MALT Parser Tool is used for dependency parsing, which uses supervised machine learning
algorithm. Using this tool dependency relations and position of the head are obtained for Tamil
sentence. There are 10 tuples used in the training data that can be user define. For Tamil dependency
parsing, the following features are defined and others are set as NULL and are mentioned as ‘_’ in the
training data format.

        WordID: Position of each word in the input sentence.

        Words: Each word in the input sentence.

        CPos Tag and Pos Tag: Defines the Parts Of Speech of each word.

        Head: The position of the parent of each word.

        Lemma: The lemma of the word.

        Morph Features The Morphological features of the word.

        Chunk The chunk information of the word.

        Dependency Relation: The terminology given for each parent – child relation.

        Sample Training Data

              1 அவ      _ <PRP> <PRP> 8 <N.SUB> _ _

              2     ைடகைள _ <NN> <NN> 3 <D.OBJ> _ _

              3 வா கி _ <VNAV> <VNAV> 4 <ATT> _ _

              4 சைம       _ <VNAV> <VNAV> 6 <VNAV.MOD>_ _

              5த        _ <NN> <NN> 6 <NST.MOD> _ _

              6 ேபா      _ <VNAV> <VNAV> 8 <V.COMP> _ _

              7 உன       _ <PRP> <PRP> 8 <I.OBJ> _ _

              8 ெகா     கி   றா   _ <VF> <VF> 0 <ROOT> _ _

              9 . <DOT> <DOT> 8 <SYM> _ _

For Tamil language, a corpus of three thousand sentences is annotated with dependency relations and
labels using the customized tag set (Table.1). The corpus is trained using the MALT Parser tool which
generates a model. Using this model the new input sentences are tested.



                                                  177
S.No          Tags           Description               S.No            Tags       Description
        1        ROOT           Head word                        5        NST-MOD    Spatial Time
                                                                                     Modifier
        2        N-SUB          Subject                          6        SYM        Symbols
        3        D-OBJ          Direct Object                    7        X          Others
        4        I-OBJ          Indirect Object

                                      Table.1 Shallow Dependency Tagset

Application of Shallow Parser
Shallow parsers were used in Verbmobil project [Wahlster W, 2000], to add robustness to a large
speech-to-speech translation system. Shallow parsers are also typically used to reduce the search space
for full-blown, `deep' parsers [Collins, 1999]. Yet another application of shallow parsing is question-
answering on the World Wide Web, where there is a need to efficiently process large quantities of ill-
formed documents [Buchholz and Daelemans, 2001] and more generally, all text mining applications,
e.g. in biology [Sekimizu et al., 1998].

The developed Tamil Shallow Parser can be used to develop the following systems for Tamil
language.

    •   Information extraction and retrieval system for Tamil.

    •   Simple Tamil Machine Translation system.

    •   Tamil Grammar checker.

    •   Automatic Tamil Sentence Structure Analyzer.

    •   Language based educational exercises for Tamil language learners.

Conclusion
Shallow Parsing has proved to be a useful technology for written and spoken language domains. Full
parsing is expensive, and is not very robust. Partial parsing has proved to be much faster and more
robust. Dependency parser is better suited than phrase structure parser for languages with free or
flexible word order like Tamil. Fully functional Shallow Parser for Tamil gives reliable results. The
Shallow Parser system developed for Tamil is an important tool for Machine Translation between
Tamil and other languages.

References
            Anand kumar M, Dhanalakshmi V , Soman K P and Rajendran S (2009) , “A Novel Approach
            for Tamil Morphological Analyzer”, Proceedings of the 8th Tamil Internet Conference 2009,
            Cologne, Germany.

            Buchholz Sabine and Daelemans Walter (2001), “Complex Answers: A Case Study using a
            WWW Question Answering System”, Natural Language Engineering.

            Collins M (1999), “Head-Driven Statistical Models for Natural Language Parsing”, Ph.D
            Thesis, University of Pennsylvania.

                                                    178
Dhanalakshmi V, Anand Kumar M, Vijaya M S, Loganathan R, Soman K P, Rajendran S
(2008), “Tamil Part-of-Speech tagger based on SVMTool”, Proceedings of the COLIPS
International Conference on natural language processing(IALP), Chiang Mai, Thailand.

Dhanalakshmi V, Anand kumar M, Soman K P and Rajendran S (2009), “POS Tagger and
Chunker for Tamil Language”, Proceedings of the 8th Tamil Internet Conference, Cologne,
Germany.

Dhanalakshmi V, Anand Kumar M, Rekha R U, Soman K.P and Rajendran S (2011), “Data
driven Dependency Parser for Tamil and Malayalam” NCILC-2011, Cochin University of
Science & Technology, India.

Jes´us Gim´enez and Llu´ıs M`arquez.(2004) SVMTool: A general pos tagger generator based on
support vector machines.In Proceedings of the 4th LREC Conference, 2004.

Joakim Nivre and Johan Hall, MaltParser: A language-independent system for data-driven
dependency parsing. In Proceedings of the Fourth Workshop on Treebanks and Linguistic
Theories (TLT), 2005.

Sekimizu T, Park H and Tsujii J (1998), “Identifying the interaction between genes and gene
products based on frequently seen verbs in Medline abstracts”, Genome Informatics,
Universal Academy Press.

Sha F and Pereira F (2003), “Shallow Parsing with Conditional Random Fields”, Proceedings
of Human Language Technology Coference’2003, Canada.

Wahlster W (2000), “VERBMOBIL: Foundations of Speech-to-Speech Translation”, Springer-
Verlag.




                                        179

More Related Content

What's hot

A New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in MalayalamA New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in Malayalam
ijcsit
 

What's hot (17)

SMT3
SMT3SMT3
SMT3
 
A Review on a web based Punjabi t o English Machine Transliteration System
A Review on a web based Punjabi t o English Machine Transliteration SystemA Review on a web based Punjabi t o English Machine Transliteration System
A Review on a web based Punjabi t o English Machine Transliteration System
 
C7 agramakirshnan2
C7 agramakirshnan2C7 agramakirshnan2
C7 agramakirshnan2
 
**JUNK** (no subject)
**JUNK** (no subject)**JUNK** (no subject)
**JUNK** (no subject)
 
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
 
Implementation of English-Text to Marathi-Speech (ETMS) Synthesizer
Implementation of English-Text to Marathi-Speech (ETMS) SynthesizerImplementation of English-Text to Marathi-Speech (ETMS) Synthesizer
Implementation of English-Text to Marathi-Speech (ETMS) Synthesizer
 
Implementation of Text To Speech for Marathi Language Using Transcriptions Co...
Implementation of Text To Speech for Marathi Language Using Transcriptions Co...Implementation of Text To Speech for Marathi Language Using Transcriptions Co...
Implementation of Text To Speech for Marathi Language Using Transcriptions Co...
 
Presentation1
Presentation1Presentation1
Presentation1
 
Hindi –tamil text translation
Hindi –tamil text translationHindi –tamil text translation
Hindi –tamil text translation
 
Verb based manipuri sentiment analysis
Verb based manipuri sentiment analysisVerb based manipuri sentiment analysis
Verb based manipuri sentiment analysis
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
 
A New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in MalayalamA New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in Malayalam
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
 
Ijartes v1-i1-002
Ijartes v1-i1-002Ijartes v1-i1-002
Ijartes v1-i1-002
 
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCEDETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
 
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTKPUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 

Viewers also liked

Viewers also liked (10)

Complex predicate meghaditya
Complex predicate meghadityaComplex predicate meghaditya
Complex predicate meghaditya
 
Transform your State \/ Err
Transform your State \/ ErrTransform your State \/ Err
Transform your State \/ Err
 
Domain Cartridge: Unsupervised Framework for Shallow Domain Ontology Construc...
Domain Cartridge: Unsupervised Framework for Shallow Domain Ontology Construc...Domain Cartridge: Unsupervised Framework for Shallow Domain Ontology Construc...
Domain Cartridge: Unsupervised Framework for Shallow Domain Ontology Construc...
 
OpenNLP demo
OpenNLP demoOpenNLP demo
OpenNLP demo
 
Compiler unit 2&3
Compiler unit 2&3Compiler unit 2&3
Compiler unit 2&3
 
Lexical analyzer
Lexical analyzerLexical analyzer
Lexical analyzer
 
Role-of-lexical-analysis
Role-of-lexical-analysisRole-of-lexical-analysis
Role-of-lexical-analysis
 
The sixth sense technology complete ppt
The sixth sense technology complete pptThe sixth sense technology complete ppt
The sixth sense technology complete ppt
 
Deep C
Deep CDeep C
Deep C
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 

Similar to D3 dhanalakshmi

5215ijcseit01
5215ijcseit015215ijcseit01
5215ijcseit01
ijcsit
 

Similar to D3 dhanalakshmi (20)

Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
 
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
 
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
Parsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function TaggingParsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function Tagging
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...
Duration for Classification and Regression Treefor Marathi Textto- Speech Syn...
 
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARSYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
 
5215ijcseit01
5215ijcseit015215ijcseit01
5215ijcseit01
 
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARSYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
 
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARSYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
 
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMARSYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
 
Improving a Lightweight Stemmer for Gujarati Language
Improving a Lightweight Stemmer for Gujarati LanguageImproving a Lightweight Stemmer for Gujarati Language
Improving a Lightweight Stemmer for Gujarati Language
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKSENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
 
I1 geetha3 revathi
I1 geetha3 revathiI1 geetha3 revathi
I1 geetha3 revathi
 
Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi Language
 
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCESSTATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
 

More from Jasline Presilda (20)

I6 mala3 sowmya
I6 mala3 sowmyaI6 mala3 sowmya
I6 mala3 sowmya
 
I5 geetha4 suraiya
I5 geetha4 suraiyaI5 geetha4 suraiya
I5 geetha4 suraiya
 
I4 madankarky3 subalalitha
I4 madankarky3 subalalithaI4 madankarky3 subalalitha
I4 madankarky3 subalalitha
 
I3 madankarky2 karthika
I3 madankarky2 karthikaI3 madankarky2 karthika
I3 madankarky2 karthika
 
I2 madankarky1 jharibabu
I2 madankarky1 jharibabuI2 madankarky1 jharibabu
I2 madankarky1 jharibabu
 
Hari tamil-complete details
Hari tamil-complete detailsHari tamil-complete details
Hari tamil-complete details
 
H4 neelavathy
H4 neelavathyH4 neelavathy
H4 neelavathy
 
H3 anuraj
H3 anurajH3 anuraj
H3 anuraj
 
H2 gunasekaran
H2 gunasekaranH2 gunasekaran
H2 gunasekaran
 
H1 iniya nehru
H1 iniya nehruH1 iniya nehru
H1 iniya nehru
 
G3 chandrakala
G3 chandrakalaG3 chandrakala
G3 chandrakala
 
G2 selvakumar
G2 selvakumarG2 selvakumar
G2 selvakumar
 
G1 nmurugaiyan
G1 nmurugaiyanG1 nmurugaiyan
G1 nmurugaiyan
 
Front matter
Front matterFront matter
Front matter
 
F2 pvairam sarathy
F2 pvairam sarathyF2 pvairam sarathy
F2 pvairam sarathy
 
F1 ferdinjoe
F1 ferdinjoeF1 ferdinjoe
F1 ferdinjoe
 
Emerging
EmergingEmerging
Emerging
 
E3 ilangkumaran
E3 ilangkumaranE3 ilangkumaran
E3 ilangkumaran
 
E2 tamilselvan
E2 tamilselvanE2 tamilselvan
E2 tamilselvan
 
E1 geetha2 karthikeyan
E1 geetha2 karthikeyanE1 geetha2 karthikeyan
E1 geetha2 karthikeyan
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

D3 dhanalakshmi

  • 1.
  • 2. Tamil Shallow Parser using Machine Learning Approach Dhanalakshmi V1, Anand Kumar M1, Soman K P1 and Rajendran S2 1Computational Engineering and Networking Amrita Vishwa Vidyapeetham Coimbatore, India {m_anandkumar,v_dhanalakshmi, kp_soman} @cb.amrita.edu 2Tamil University, Thanjavur, India Abstract This paper presents the Shallow Parser for Tamil using machine learning approach. Tamil Shallow Parser is an important module in Machine Translation from Tamil to any other language. It is also a key component in all NLP applications. It is used to understand natural language by machine and also useful for second language learners. The Tamil Shallow Parser was developed using the new and state of the art machine learning approach. The POS Tagger, Chunker, Morphological Analyzer and Dependency Parser were built for implementing the Tamil Shallow Parser. The above modules gives an encouraging result. Introduction Partial or Shallow Parsing is the task of recovering a limited amount of syntactic information from a natural language sentence. A full parser often provides more information than needed and sometimes it may also give less information. For example, in Information Retrieval, it may be enough to find simple NPs (Noun Phrases) and VPs (Verb Phrases). In Information Extraction, Summary Generation, and Question Answering System, information about special syntactico-semantic relations such as subject, object, location, time, etc, are needed than elaborate configurational syntactic analyses. In full parsing, grammar and search strategies are used to assign a complete syntactic structure to sentences. The main problem here is to select the most possible syntactic analysis to be obtained from thousands of possible analyses a typical parser with a sophisticated grammar may return. This complexity of the task makes machine learning an attractive option in comparison to the handcrafted rules. Methodology Machine learning approach is applied here to develop the shallow parser for Tamil. Part of speech tagger for Tamil has been generated using Support Vector Machine approach [Dhanalakshmi V e.tal., 2009]. A novel approach using machine learning has been built for developing morphological analyzer for Tamil [Anand kumar M e.tal., 2009]. Tamil Chunker has been developed using CRF++ tool [Dhanalakshmi V e.tal., 2009]. And finally, Tamil Dependency parser, which is used to find syntactico- semantic relations such as subject, object, location, time, etc, is built using MALT Parser [Dhanalakshmi V e.tal., 2011]. 175
  • 3. General Framework and Modules • The general block diagram for Tamil Shallow parser is given in Figure 1. Input Sentenc Tokenization POS Tagging Chunking Morphological Analyzer Format Conversion MALT Parser for Relation Shallow Parsed Figure.1. General Framework for Tamil Shallow Parser • Tamil Part-of-Speech Tagger [Dhanalakshmi V e.tal., 2009]: The Part of Speech (POS) tagging is the process of labeling a part of speech or other lexical class marker (noun, verb, adjective, etc.) to each and every word in a sentence. POS tagger was developed for Tamil language using SVMTool [Jes´us Gim´enez and Llu´ıs M`arquez, 2004]. • Tamil Morphological Analyzer [Anand Kumar M e.tal., 2009]: Morphological Analysis is the process of breaking down morphologically complex words into their constituent morphemes. It is the primary step for word formation analysis of any language. Morphological Analyzer was developed using a novel machine learning approach and was implemented using SVMTool. • Tamil Chunker [Dhanalakshmi V e.tal., 2009]: Chunks are normally taken to be non recursive correlated group of words. Chunker divides a sentence into its major-non-overlapping phrases 176
  • 4. (noun phrase, verb phrase, etc.) and attaches a label to each chunk. Chunker for Tamil language was developed using CRF++ Tool[Sha F and Pereira F, 2003]. • Tamil Dependency Parser for Relation finding [Dhanalakshmi V e.tal., 2011]: Given the POS tag, Morphological information and chunks in a sentence, this decides which relations they have with the main verb (subject, object, location, etc.). Dependency parser was developed for Tamil language using Malt Parser tool [Joakim Nivre and Johan Hall, 2005]. Dependency Parsing using Malt Parser MALT Parser Tool is used for dependency parsing, which uses supervised machine learning algorithm. Using this tool dependency relations and position of the head are obtained for Tamil sentence. There are 10 tuples used in the training data that can be user define. For Tamil dependency parsing, the following features are defined and others are set as NULL and are mentioned as ‘_’ in the training data format. WordID: Position of each word in the input sentence. Words: Each word in the input sentence. CPos Tag and Pos Tag: Defines the Parts Of Speech of each word. Head: The position of the parent of each word. Lemma: The lemma of the word. Morph Features The Morphological features of the word. Chunk The chunk information of the word. Dependency Relation: The terminology given for each parent – child relation. Sample Training Data 1 அவ _ <PRP> <PRP> 8 <N.SUB> _ _ 2 ைடகைள _ <NN> <NN> 3 <D.OBJ> _ _ 3 வா கி _ <VNAV> <VNAV> 4 <ATT> _ _ 4 சைம _ <VNAV> <VNAV> 6 <VNAV.MOD>_ _ 5த _ <NN> <NN> 6 <NST.MOD> _ _ 6 ேபா _ <VNAV> <VNAV> 8 <V.COMP> _ _ 7 உன _ <PRP> <PRP> 8 <I.OBJ> _ _ 8 ெகா கி றா _ <VF> <VF> 0 <ROOT> _ _ 9 . <DOT> <DOT> 8 <SYM> _ _ For Tamil language, a corpus of three thousand sentences is annotated with dependency relations and labels using the customized tag set (Table.1). The corpus is trained using the MALT Parser tool which generates a model. Using this model the new input sentences are tested. 177
  • 5. S.No Tags Description S.No Tags Description 1 ROOT Head word 5 NST-MOD Spatial Time Modifier 2 N-SUB Subject 6 SYM Symbols 3 D-OBJ Direct Object 7 X Others 4 I-OBJ Indirect Object Table.1 Shallow Dependency Tagset Application of Shallow Parser Shallow parsers were used in Verbmobil project [Wahlster W, 2000], to add robustness to a large speech-to-speech translation system. Shallow parsers are also typically used to reduce the search space for full-blown, `deep' parsers [Collins, 1999]. Yet another application of shallow parsing is question- answering on the World Wide Web, where there is a need to efficiently process large quantities of ill- formed documents [Buchholz and Daelemans, 2001] and more generally, all text mining applications, e.g. in biology [Sekimizu et al., 1998]. The developed Tamil Shallow Parser can be used to develop the following systems for Tamil language. • Information extraction and retrieval system for Tamil. • Simple Tamil Machine Translation system. • Tamil Grammar checker. • Automatic Tamil Sentence Structure Analyzer. • Language based educational exercises for Tamil language learners. Conclusion Shallow Parsing has proved to be a useful technology for written and spoken language domains. Full parsing is expensive, and is not very robust. Partial parsing has proved to be much faster and more robust. Dependency parser is better suited than phrase structure parser for languages with free or flexible word order like Tamil. Fully functional Shallow Parser for Tamil gives reliable results. The Shallow Parser system developed for Tamil is an important tool for Machine Translation between Tamil and other languages. References Anand kumar M, Dhanalakshmi V , Soman K P and Rajendran S (2009) , “A Novel Approach for Tamil Morphological Analyzer”, Proceedings of the 8th Tamil Internet Conference 2009, Cologne, Germany. Buchholz Sabine and Daelemans Walter (2001), “Complex Answers: A Case Study using a WWW Question Answering System”, Natural Language Engineering. Collins M (1999), “Head-Driven Statistical Models for Natural Language Parsing”, Ph.D Thesis, University of Pennsylvania. 178
  • 6. Dhanalakshmi V, Anand Kumar M, Vijaya M S, Loganathan R, Soman K P, Rajendran S (2008), “Tamil Part-of-Speech tagger based on SVMTool”, Proceedings of the COLIPS International Conference on natural language processing(IALP), Chiang Mai, Thailand. Dhanalakshmi V, Anand kumar M, Soman K P and Rajendran S (2009), “POS Tagger and Chunker for Tamil Language”, Proceedings of the 8th Tamil Internet Conference, Cologne, Germany. Dhanalakshmi V, Anand Kumar M, Rekha R U, Soman K.P and Rajendran S (2011), “Data driven Dependency Parser for Tamil and Malayalam” NCILC-2011, Cochin University of Science & Technology, India. Jes´us Gim´enez and Llu´ıs M`arquez.(2004) SVMTool: A general pos tagger generator based on support vector machines.In Proceedings of the 4th LREC Conference, 2004. Joakim Nivre and Johan Hall, MaltParser: A language-independent system for data-driven dependency parsing. In Proceedings of the Fourth Workshop on Treebanks and Linguistic Theories (TLT), 2005. Sekimizu T, Park H and Tsujii J (1998), “Identifying the interaction between genes and gene products based on frequently seen verbs in Medline abstracts”, Genome Informatics, Universal Academy Press. Sha F and Pereira F (2003), “Shallow Parsing with Conditional Random Fields”, Proceedings of Human Language Technology Coference’2003, Canada. Wahlster W (2000), “VERBMOBIL: Foundations of Speech-to-Speech Translation”, Springer- Verlag. 179