SlideShare une entreprise Scribd logo
1  sur  8
Télécharger pour lire hors ligne
.
               Compiling Apertium dictionaries with HFST
    leveraging generalised compilation formulas to get more and better end
                 applications with fewer language description
.

                                  Tommi A Pirinen, Francis Tyers
                                  tommi.pirinen@helsinki.fi

                                 University of Helsinki, Universitat d’Alacant


                                              May 22, 2012




                                                                      .      .   .      .     .     .
    Tommi A Pirinen (Helsinki)        Compiling apertium monodix with HFST           May 22, 2012       1/8
Outline




.
1    Introduction


.
2    Benefits of this work


.
3    Conclusion




                                                                 .      .   .      .     .     .
    Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       2/8
Finite-state automata and HFST and apertium

     Finite-state automata are one efficient way to encode dictionaries,
     morphological analysers etc.
     HFST stands for Helsinki Finite-State Technology— consisting of a
     library working as a compatibility layer between different open-source
     finite-state implementations,
            SFST
            OpenFST
            Foma
     Also a set of finite-state tools built on top of the library, and set of
     end products using the automata in real-world applications (sold
     separately)
     HFST is still a research project in a computational linguistics’ research
     group—not computer science or engineering
     apertium is a machine-translation platform that uses finite-state
     dictionaries
                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       3/8
Compiling apertium dictionaries with HFST—rationale

     “just an engineering exercise”
     getting all language descriptions to compile natively in HFST (as
     opposed to converting compiled automata)
     using existing (and future) HFST algorithms to improve the resulting
     automata
     using bits of linguistic information to get better auxiliary automata for
     HFST end applications — data that may not be possible to induct
     from converted compiled automata
     possibility to integrate more complex features in of finite-state
     morphology in apertium dictionaries—morphophonetics, reduplication
     etc. that may be supported by other HFST tools
     this paper fits nicely in my PhD thesis under “State of the art of in
     language models”

                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       4/8
Examples of immediate benefits to dictionary writers




     A lot of current work in building NLP software involves management
     of huge amounts of lexical data
     ...like generating different language models in different morphology
     programming formalisms: apertium, hunspell, xerox tools
     getting native and uniform compilation formulas for all lets you write
     dictionaries once and use everywhere
     or pick and mix tools and features from different formalisms




                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       5/8
Examples of additional applications that can be generated
from apertium dictionaries with this work



     Spell-checkers! A basic spell-checker with generic edit distance
     suggestion generator can be automatically generated—and used in
     majority of current open-source software without any extra effort
     Predictive text entry, for mobiles, such T9, XT9, possibly swype and
     keyboard as well
     Morphological analysers, lemmatisers, segmenters, tokenisers, etc.,
     obviously




                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       6/8
Examples of benefits that come for free—automatic
optimisation


     depending on library / end format you choose for compiled
     dictionaries, you get speed–space tradeoffs (or improvements in both)
     This is work-in-progress, but once done it can be used in all
     dictionaries without modifications to sources
     automatic flag diacritic induction
     hyperminimisation
     all this can be based on things like finding homomorphic components
     from the finite-state automaton
     the linguistic concepts present in source code but missing from the
     compiled automaton should prove very useful here!


                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       7/8
What now?

     The reference material for the article is in our svn http:
     //hfst.svn.sf.net/svnroot/hfst/trunk/lrec-2011-apertium,
     includes compilation of spell-checkers for most apertium dictionaries
     what do we do to remove duplicate work, duplicate versions of
     dictionaries, conversion scripts. . .
     more compilers? Conversion scripts? New programming languages?
     New “standards” that everyone will use?
     I’ll throw you this: I need more linguistic data and less engineering in
     the language model implementations to compile more applications
     from one source dictionary. Example: LR/RL concept in apertium or
     asymmetric flags in Xerox FSM is engineering hack POV; had the
     description called it substandard or dialectal word form it would
     already be usable in all applications!

                                                              .      .   .      .     .     .
 Tommi A Pirinen (Helsinki)   Compiling apertium monodix with HFST           May 22, 2012       8/8

Contenu connexe

Tendances

Using translation memory_to_speed_up_tra
Using translation memory_to_speed_up_traUsing translation memory_to_speed_up_tra
Using translation memory_to_speed_up_traCamillaTonanzi
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for TranslationRIILP
 
Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...ijnlc
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translationRushdi Shams
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...John Tinsley
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translationMarcis Pinnis
 
Lec 15,16,17 NLP.machine translation
Lec 15,16,17  NLP.machine translationLec 15,16,17  NLP.machine translation
Lec 15,16,17 NLP.machine translationguest873a50
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine TranslationJaganadh Gopinadhan
 
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...ijnlc
 
Use of ontologies in natural language processing
Use of ontologies in natural language processingUse of ontologies in natural language processing
Use of ontologies in natural language processingATHMAN HAJ-HAMOU
 
Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficultiesijtsrd
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...Editor IJARCET
 
Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic Jaganadh Gopinadhan
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmMeetupDataScienceRoma
 

Tendances (18)

Using translation memory_to_speed_up_tra
Using translation memory_to_speed_up_traUsing translation memory_to_speed_up_tra
Using translation memory_to_speed_up_tra
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...Robust extended tokenization framework for romanian by semantic parallel text...
Robust extended tokenization framework for romanian by semantic parallel text...
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translation
 
Machine translation
Machine translationMachine translation
Machine translation
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
Lec 15,16,17 NLP.machine translation
Lec 15,16,17  NLP.machine translationLec 15,16,17  NLP.machine translation
Lec 15,16,17 NLP.machine translation
 
Chat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian languageChat adapted pos tagger for romanian language
Chat adapted pos tagger for romanian language
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine Translation
 
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
 
Using ontology for natural language processing
Using ontology for natural language processingUsing ontology for natural language processing
Using ontology for natural language processing
 
Use of ontologies in natural language processing
Use of ontologies in natural language processingUse of ontologies in natural language processing
Use of ontologies in natural language processing
 
Natural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and DifficultiesNatural Language Processing Theory, Applications and Difficulties
Natural Language Processing Theory, Applications and Difficulties
 
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
 
Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic
 
Machine Translation
Machine TranslationMachine Translation
Machine Translation
 
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigmDeep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation - A dramatic turn of paradigm
 

En vedette

Lezione corso running
Lezione corso runningLezione corso running
Lezione corso runningReti
 
Prepostales
PrepostalesPrepostales
Prepostalesyinalis
 
2.2.crisostomo inovacao no semiarido
2.2.crisostomo inovacao no semiarido2.2.crisostomo inovacao no semiarido
2.2.crisostomo inovacao no semiaridosmtpinov
 
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облакаЭволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облакаSQALab
 
Meetingin finland
Meetingin finlandMeetingin finland
Meetingin finlandanglimo
 

En vedette (8)

Power To People
Power To PeoplePower To People
Power To People
 
Lezione corso running
Lezione corso runningLezione corso running
Lezione corso running
 
Prepostales
PrepostalesPrepostales
Prepostales
 
2.2.crisostomo inovacao no semiarido
2.2.crisostomo inovacao no semiarido2.2.crisostomo inovacao no semiarido
2.2.crisostomo inovacao no semiarido
 
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облакаЭволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
Эволюция ускорения юнит-тестов в Badoo - от баш-скриптов до облака
 
Meetingin finland
Meetingin finlandMeetingin finland
Meetingin finland
 
Cactus
CactusCactus
Cactus
 
Castilla La Mancha
Castilla La ManchaCastilla La Mancha
Castilla La Mancha
 

Similaire à Compiling Apertium Dictionaries with HFST

G2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageG2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageijnlc
 
MOLTO Annual Report 2011
MOLTO Annual Report 2011MOLTO Annual Report 2011
MOLTO Annual Report 2011Olga Caprotti
 
Lexigraf - a multilingual lexicography DTP engine
Lexigraf - a multilingual lexicography DTP engineLexigraf - a multilingual lexicography DTP engine
Lexigraf - a multilingual lexicography DTP engineYiannis Hatzopoulos
 
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
The Distributed Ontology Language (DOL): Use Cases, Syntax, and ExtensibilityThe Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
The Distributed Ontology Language (DOL): Use Cases, Syntax, and ExtensibilityChristoph Lange
 
A proposal for technique to use common terms among multiple systems
A proposal for technique to use common terms among multiple systemsA proposal for technique to use common terms among multiple systems
A proposal for technique to use common terms among multiple systemsMahara Hui
 
MOLTO poster for META Forum, Brussels 2010, Belgium.
MOLTO poster for META Forum, Brussels 2010, Belgium.MOLTO poster for META Forum, Brussels 2010, Belgium.
MOLTO poster for META Forum, Brussels 2010, Belgium.Olga Caprotti
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)kevig
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)kevig
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)kevig
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)kevig
 
International Journal on Natural Language Computing(IJNLC)
 International Journal on Natural Language Computing(IJNLC) International Journal on Natural Language Computing(IJNLC)
International Journal on Natural Language Computing(IJNLC)kevig
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)kevig
 
Call for papers - International Journal on Natural Language Computing (IJNLC)
Call for papers - International Journal on Natural Language Computing (IJNLC)Call for papers - International Journal on Natural Language Computing (IJNLC)
Call for papers - International Journal on Natural Language Computing (IJNLC)kevig
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)kevig
 

Similaire à Compiling Apertium Dictionaries with HFST (20)

Tlf2016
Tlf2016Tlf2016
Tlf2016
 
Lfnw2016
Lfnw2016Lfnw2016
Lfnw2016
 
G2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian languageG2 pil a grapheme to-phoneme conversion tool for the italian language
G2 pil a grapheme to-phoneme conversion tool for the italian language
 
MOLTO Annual Report 2011
MOLTO Annual Report 2011MOLTO Annual Report 2011
MOLTO Annual Report 2011
 
Lexigraf - a multilingual lexicography DTP engine
Lexigraf - a multilingual lexicography DTP engineLexigraf - a multilingual lexicography DTP engine
Lexigraf - a multilingual lexicography DTP engine
 
Olf2016
Olf2016Olf2016
Olf2016
 
Concordances
Concordances Concordances
Concordances
 
Lit mtap
Lit mtapLit mtap
Lit mtap
 
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
The Distributed Ontology Language (DOL): Use Cases, Syntax, and ExtensibilityThe Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
The Distributed Ontology Language (DOL): Use Cases, Syntax, and Extensibility
 
A proposal for technique to use common terms among multiple systems
A proposal for technique to use common terms among multiple systemsA proposal for technique to use common terms among multiple systems
A proposal for technique to use common terms among multiple systems
 
Pandoc: a universal document converter
Pandoc: a universal document converterPandoc: a universal document converter
Pandoc: a universal document converter
 
MOLTO poster for META Forum, Brussels 2010, Belgium.
MOLTO poster for META Forum, Brussels 2010, Belgium.MOLTO poster for META Forum, Brussels 2010, Belgium.
MOLTO poster for META Forum, Brussels 2010, Belgium.
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing(IJNLC)
 International Journal on Natural Language Computing(IJNLC) International Journal on Natural Language Computing(IJNLC)
International Journal on Natural Language Computing(IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 
Call for papers - International Journal on Natural Language Computing (IJNLC)
Call for papers - International Journal on Natural Language Computing (IJNLC)Call for papers - International Journal on Natural Language Computing (IJNLC)
Call for papers - International Journal on Natural Language Computing (IJNLC)
 
International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)International Journal on Natural Language Computing (IJNLC)
International Journal on Natural Language Computing (IJNLC)
 

Plus de Guy De Pauw

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageGuy De Pauw
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)Guy De Pauw
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusGuy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of SantomeGuy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionGuy De Pauw
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsGuy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemGuy De Pauw
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Guy De Pauw
 

Plus de Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
Bilingual Data Mining for the English-Amharic Statistical Machine Translation...
 

Dernier

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Dernier (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Compiling Apertium Dictionaries with HFST

  • 1. . Compiling Apertium dictionaries with HFST leveraging generalised compilation formulas to get more and better end applications with fewer language description . Tommi A Pirinen, Francis Tyers tommi.pirinen@helsinki.fi University of Helsinki, Universitat d’Alacant May 22, 2012 . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 1/8
  • 2. Outline . 1 Introduction . 2 Benefits of this work . 3 Conclusion . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 2/8
  • 3. Finite-state automata and HFST and apertium Finite-state automata are one efficient way to encode dictionaries, morphological analysers etc. HFST stands for Helsinki Finite-State Technology— consisting of a library working as a compatibility layer between different open-source finite-state implementations, SFST OpenFST Foma Also a set of finite-state tools built on top of the library, and set of end products using the automata in real-world applications (sold separately) HFST is still a research project in a computational linguistics’ research group—not computer science or engineering apertium is a machine-translation platform that uses finite-state dictionaries . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 3/8
  • 4. Compiling apertium dictionaries with HFST—rationale “just an engineering exercise” getting all language descriptions to compile natively in HFST (as opposed to converting compiled automata) using existing (and future) HFST algorithms to improve the resulting automata using bits of linguistic information to get better auxiliary automata for HFST end applications — data that may not be possible to induct from converted compiled automata possibility to integrate more complex features in of finite-state morphology in apertium dictionaries—morphophonetics, reduplication etc. that may be supported by other HFST tools this paper fits nicely in my PhD thesis under “State of the art of in language models” . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 4/8
  • 5. Examples of immediate benefits to dictionary writers A lot of current work in building NLP software involves management of huge amounts of lexical data ...like generating different language models in different morphology programming formalisms: apertium, hunspell, xerox tools getting native and uniform compilation formulas for all lets you write dictionaries once and use everywhere or pick and mix tools and features from different formalisms . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 5/8
  • 6. Examples of additional applications that can be generated from apertium dictionaries with this work Spell-checkers! A basic spell-checker with generic edit distance suggestion generator can be automatically generated—and used in majority of current open-source software without any extra effort Predictive text entry, for mobiles, such T9, XT9, possibly swype and keyboard as well Morphological analysers, lemmatisers, segmenters, tokenisers, etc., obviously . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 6/8
  • 7. Examples of benefits that come for free—automatic optimisation depending on library / end format you choose for compiled dictionaries, you get speed–space tradeoffs (or improvements in both) This is work-in-progress, but once done it can be used in all dictionaries without modifications to sources automatic flag diacritic induction hyperminimisation all this can be based on things like finding homomorphic components from the finite-state automaton the linguistic concepts present in source code but missing from the compiled automaton should prove very useful here! . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 7/8
  • 8. What now? The reference material for the article is in our svn http: //hfst.svn.sf.net/svnroot/hfst/trunk/lrec-2011-apertium, includes compilation of spell-checkers for most apertium dictionaries what do we do to remove duplicate work, duplicate versions of dictionaries, conversion scripts. . . more compilers? Conversion scripts? New programming languages? New “standards” that everyone will use? I’ll throw you this: I need more linguistic data and less engineering in the language model implementations to compile more applications from one source dictionary. Example: LR/RL concept in apertium or asymmetric flags in Xerox FSM is engineering hack POV; had the description called it substandard or dialectal word form it would already be usable in all applications! . . . . . . Tommi A Pirinen (Helsinki) Compiling apertium monodix with HFST May 22, 2012 8/8