Apertium is a free and open source MT platform, where both the linguistic data and engines are under free licences. Constraint Grammar is used for pre-disambiguation in several language pairs.
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Constraint Grammar and Apertium
1. CG in Apertium
Kevin Brubeck Unhammer
University of Bergen, Norway
14th May 2009
2. What is Apertium?
An Open Source Machine Translation platform
both source code and data have Free / Open Source licences
Modular
stand-alone programs communicate through standard Unix pipes
particular language pairs need not use all modules!
Developed by universities, companies and independent
(volunteer and paid) developers
3. History of Apertium
Initially developed for closely related languages (Portuguese ↔
Spanish ↔ Catalan) by the Transducens group at the Universitat
d’Alacant
Later extended to allow more distant language pairs
Now also involves various companies in Spain, the universities of
Vigo, Reykjavík, Oviedo, Barcelona (Pompeu Fabra), etc.
4. Language pairs
“Stable”: Spanish ↔ Catalan, Spanish ← Romanian, French ↔
Catalan, Occitan ↔ Catalan, English ↔ Galician, Occitan ↔
Spanish, Spanish ↔ Portuguese, English ↔ Catalan, English ↔
Spanish, English → Esperanto, Spanish ↔ Galician, French ↔
Spanish, Esperanto ← Spanish, Welsh → English, Esperanto ←
Catalan, Portuguese ↔ Catalan, Portuguese ↔ Galician,
Basque → Spanish
Other pairs being developed (Spanish ↔ Asturian, Icelandic ↔
English, Swedish ↔ Danish, Nynorsk ↔ Bokmål, . . . )
6. Modules
Morphological dictionaries
lttoolbox: XML format, compiles to FSTs
Fast (seems to perform 5x faster than SFST)
one dictionary gives both analysis and generation
CG pre-disambiguation
Statistical disambiguation (HMM)
Bilingual dictionary for lexical transfer
Shallow syntactic transfer rules
Local re-ordering (nom adj → adj nom)
Chunking (adj adj nom → SN[adj adj nom])
Insertions, deletions and substitutions of lexical units and chunks
8. The Apertium Stream Format
Simple example from Norwegian Bokmål
“lese en” (‘read a/one’)
Morphological analysis gives:
^lese/lese<vblex><inf>$ ^en/en<num><sg><mf>
/ene<vblex><imp>/en<det><ind><mf><sg>$
After CG:
^lese/lese<vblex><inf>$ ^en/en<num><sg><mf>
/en<det><ind><mf><sg>$
Formatting information (like HTML tags) is saved in superblanks
making document and web translation easy
original:
Kva er det du <em>seier</em>?
deformatted:
Kva er det du[ <em>]seier[</em>]?
10. The platform provides
a language-independent machine translation engine
tools to manage the linguistic data necessary to build a machine
translation system for a given language pair
little programming knowledge required to get started
graphical user interfaces that show each step in the translation
process
many more advanced tools (for eg. merging or sorting
dictionaries)
linguistic data for a growing number of language pairs
also usable for other NLP purposes (spelling & grammar checking,
...)
11. CG in Apertium
Used after morphological analysis for pre-disambiguation in
Nynorsk ↔ Bokmål, Welsh ↔ English, Breton ↔ French, Irish ↔
Scottish Gaelic
Apertium’s own statistical disambiguator makes a choice if CG
doesn’t completely disambiguate
12. CG in Apertium
Norwegian CG is from the Oslo-Bergen Tagger (GPL)
Sámi giellatekno provides Free grammars for Sámi languages
and Faroese
Irish grammar mostly converted manually from the An Gramadóir
project (GPL)
Other grammars made solely by Apertium members
13. Some statistics
Sections Rules Sets Tags
Welsh 2 98 141 128
Breton 4 121 125 154
Irish 1 285 298 292
Table: Rule counts for some of the CG grammars in Apertium
14. Same concepts apply between modules
CG Apertium/lttoolbox Apertium stream format
wordform surface form books
baseform lemma book
cohort ambiguous lexical unit ^books/book<n><pl>
/book<vblex><pres><p3><sg>$
reading analysis /book<n><pl>/
Table: Terminology differences
15. Same format readable by all modules
Both SFST/HFST and vislcg3 read and write the Apertium stream
format.
Example from the Open Morphology of Finnish, output by the
Apertium reader in SFST/HFST:
^kaikki/kaikki<noun><7><a><sg><nom>$
^ihmiset/ihminen<noun><38><pl><acc>/ihminen<noun><38><pl><nom>$
^syntyvät/syntyä<verb><52><j><act><pcpva><pl><acc>
/syntyä<verb><52><j><act><pcpva><pl><nom>
/syntyä<verb><52><j><act><indv><pres><pl3>$
^vapaina/vapaa<noun><17><pl><ess>$ ^ja/*ja$
^tasavertaisina/*tasavertaisina$
^arvoltaan/arvo<noun><1><sg><abl><pl3>/arvo<noun><1><sg><abl><sg3>$
^ja/*ja$
^oikeuksiltaan/oikeus<noun><40><pl><abl><pl3>/oikeus<noun><40><pl><abl><sg3>$
16. Why Apertium
Rule-based MT
most languages of the world have little freely available textual
data, let alone parallel corpora for SMT purposes; Apertium is
thus suitable for marginalised languages
Rule-based systems are linguistically interesting, and provide test
beds for linguistic theory
Reuse and Interoperability
Monolingual dictionaries and constraint grammars are directly
reusable for new language pairs
apertium-dixtools: generates new language pairs from existing
ones
vislcg3 reads and outputs the Apertium stream format, as do
Stuttgart/Helsinki Finite State Tools
Free licences allow other systems to use Apertium data and tools
17. Why Apertium
Open Source + fairly simple learning curve = great potential for
contributors
Eg. Jacob Nordfalk: entered Apertium last fall, had English →
Esperanto pair by March 2009
Very helpful and accessible community
18. Future work: dependency-based reordering in Apertium
Currently, CG is only used for disambiguation
Many constraint grammars out there give dependency
information, this could be integrated into Apertium to provide
dependency based reordering, simplifying the transfer step
20. Future work: integration with Matxin
We would like to get CG dependency information into a
Matxin-compatible format.
Apertium’s CG would handle analysis while Matxin handles the
transfer step. Eg. given the following analysis (Faroese):
"<Í>"
"í" Pr @ADVL> #1->3
"<upphavi>"
"upphav" N Neu Sg Dat Indef @P< #2->1
"<skapti>"
"skapa" V Ind Prt Sg @VMAIN #3->0
"<Gud>"
"gudur" N Msc Sg Acc Indef @<SUBJ #4->3
"<himmal>"
"himmal" N Msc Sg Acc Indef @<OBJ #5->3
21. Future work: integration with Matxin
...we would like to get this dependency tree structure:
<SENTENCE ord="1">
<NODE form=’skapti’ lem=’skapa’ ord=’3’ mi=’V.Ind.Prt.Sg’ si=’VMAIN’>
<NODE form=’Í’ lem=’Í’ ord=’1’ mi=’Pr’ si=’ADVL’>
<NODE form=’upphavi’ lem=’upphav’ ord=’2’ mi=’N.Neu.Sg.Dat.Indef’ si=’P’/>
</NODE>
<NODE form=’Gud’ lem=’Gud’ ord=’4’ mi=’N.Prop.Sg.Nom’ si=’SUBJ’/>
<NODE form=’himmal’ lem=’himmal’ ord=’5’ mi=’N.Msc.Sg.Acc.Indef’ si=’OBJ’/>
</NODE>
</SENTENCE>
and let Matxin do reordering and other transfer operations
23. Licences
This presentation may be distributed under the terms of the GNU GPL,
GNU FDL and CC-BY-SA licences.
GNU GPL v. 3.0
http://www.gnu.org/licenses/gpl.html
GNU FDL v. 1.2
http://www.gnu.org/licenses/gfdl.html
CC-BY-SA v. 3.0
http://creativecommons.org/licenses/by-sa/3.0/