Constraint Grammar and Apertium

CG in Apertium

Kevin Brubeck Unhammer
University of Bergen, Norway

14th May 2009

What is Apertium?

An Open Source Machine Translation platform
both source code and data have Free / Open Source licences
Modular
stand-alone programs communicate through standard Unix pipes
particular language pairs need not use all modules!
Developed by universities, companies and independent
(volunteer and paid) developers

History of Apertium

Initially developed for closely related languages (Portuguese ↔
Spanish ↔ Catalan) by the Transducens group at the Universitat
d’Alacant
Later extended to allow more distant language pairs
Now also involves various companies in Spain, the universities of
Vigo, Reykjavík, Oviedo, Barcelona (Pompeu Fabra), etc.

Language pairs

“Stable”: Spanish ↔ Catalan, Spanish ← Romanian, French ↔
Catalan, Occitan ↔ Catalan, English ↔ Galician, Occitan ↔
Spanish, Spanish ↔ Portuguese, English ↔ Catalan, English ↔
Spanish, English → Esperanto, Spanish ↔ Galician, French ↔
Spanish, Esperanto ← Spanish, Welsh → English, Esperanto ←
Catalan, Portuguese ↔ Catalan, Portuguese ↔ Galician,
Basque → Spanish
Other pairs being developed (Spanish ↔ Asturian, Icelandic ↔
English, Swedish ↔ Danish, Nynorsk ↔ Bokmål, . . . )

Marginalised

Few free resources
Copious free resources

Modules

Morphological dictionaries
lttoolbox: XML format, compiles to FSTs
Fast (seems to perform 5x faster than SFST)
one dictionary gives both analysis and generation
CG pre-disambiguation
Statistical disambiguation (HMM)
Bilingual dictionary for lexical transfer
Shallow syntactic transfer rules
Local re-ordering (nom adj → adj nom)
Chunking (adj adj nom → SN[adj adj nom])
Insertions, deletions and substitutions of lexical units and chunks

The Apertium Stream Format

Simple example from Norwegian Bokmål
“lese en” (‘read a/one’)
Morphological analysis gives:
^lese/lese<vblex><inf>$ ^en/en<num><sg><mf>
/ene<vblex><imp>/en<det><ind><mf><sg>$
After CG:
^lese/lese<vblex><inf>$ ^en/en<num><sg><mf>
/en<det><ind><mf><sg>$
Formatting information (like HTML tags) is saved in superblanks
making document and web translation easy
original:
Kva er det du <em>seier</em>?
deformatted:
Kva er det du[ <em>]seier[</em>]?

Visualising the process helps ﬁnd errors

The platform provides

a language-independent machine translation engine
tools to manage the linguistic data necessary to build a machine
translation system for a given language pair
little programming knowledge required to get started
graphical user interfaces that show each step in the translation
process
many more advanced tools (for eg. merging or sorting
dictionaries)

linguistic data for a growing number of language pairs
also usable for other NLP purposes (spelling & grammar checking,
...)

CG in Apertium

Used after morphological analysis for pre-disambiguation in
Nynorsk ↔ Bokmål, Welsh ↔ English, Breton ↔ French, Irish ↔
Scottish Gaelic
Apertium’s own statistical disambiguator makes a choice if CG
doesn’t completely disambiguate

CG in Apertium

Norwegian CG is from the Oslo-Bergen Tagger (GPL)
Sámi giellatekno provides Free grammars for Sámi languages
and Faroese
Irish grammar mostly converted manually from the An Gramadóir
project (GPL)
Other grammars made solely by Apertium members

Some statistics

Sections Rules Sets Tags

Welsh 2 98 141 128
Breton 4 121 125 154
Irish 1 285 298 292
Table: Rule counts for some of the CG grammars in Apertium

Same concepts apply between modules

CG Apertium/lttoolbox Apertium stream format
wordform surface form books
baseform lemma book
cohort ambiguous lexical unit ^books/book<n><pl>
/book<vblex><pres><p3><sg>$
reading analysis /book<n><pl>/

Table: Terminology differences

Same format readable by all modules

Both SFST/HFST and vislcg3 read and write the Apertium stream
format.
Example from the Open Morphology of Finnish, output by the
Apertium reader in SFST/HFST:

^kaikki/kaikki<noun><7><a><sg><nom>$
îhmiset/ihminen<noun><38><pl><acc>/ihminen<noun><38><pl><nom>$
^syntyvät/syntyä<verb><52><j><act><pcpva><pl><acc>
/syntyä<verb><52><j><act><pcpva><pl><nom>
/syntyä<verb><52><j><act><indv><pres><pl3>$
^vapaina/vapaa<noun><17><pl><ess>$ ^ja/*ja$
^tasavertaisina/*tasavertaisina$
ârvoltaan/arvo<noun><1><sg><abl><pl3>/arvo<noun><1><sg><abl><sg3>$
^ja/*ja$
ôikeuksiltaan/oikeus<noun><40><pl><abl><pl3>/oikeus<noun><40><pl><abl><sg3>$

Why Apertium

Rule-based MT
most languages of the world have little freely available textual
data, let alone parallel corpora for SMT purposes; Apertium is
thus suitable for marginalised languages
Rule-based systems are linguistically interesting, and provide test
beds for linguistic theory

Reuse and Interoperability
Monolingual dictionaries and constraint grammars are directly
reusable for new language pairs
apertium-dixtools: generates new language pairs from existing
ones
vislcg3 reads and outputs the Apertium stream format, as do
Stuttgart/Helsinki Finite State Tools
Free licences allow other systems to use Apertium data and tools

Why Apertium

Open Source + fairly simple learning curve = great potential for
contributors
Eg. Jacob Nordfalk: entered Apertium last fall, had English →
Esperanto pair by March 2009
Very helpful and accessible community

Future work: dependency-based reordering in Apertium

Currently, CG is only used for disambiguation
Many constraint grammars out there give dependency
information, this could be integrated into Apertium to provide
dependency based reordering, simplifying the transfer step

Future Work: integration with Matxin

Matxin is a Free Software sister project of Apertium which
currently uses FreeLing for dependency analyses:

<SENTENCE ord=’1’>
<CHUNK ord=’2’ type=’grup-verb’ si=’top’>
<NODE ord=’4’ alloc=’19’ form=’sacude’ lem=’sacudir’ mi=’VMIP3S0’> </NODE>
<CHUNK ord=’1’ type=’sn’ si=’subj’>
<NODE ord=’3’ alloc=’10’ form=’atentado’ lem=’atentado’ mi=’NCMS000’>
<NODE ord=’1’ alloc=’0’ form=’Un’ lem=’uno’ mi=’DI0MS0’> </NODE>
<NODE ord=’2’ alloc=’3’ form=’triple’ lem=’triple’ mi=’AQ0CS0’> </NODE>
</NODE>
</CHUNK>
<CHUNK ord=’3’ type=’sn’ si=’obj’>
<NODE ord=’5’ alloc=’26’ form=’Bagdad’ lem=’Bagdad’ mi=’NP00000’> </NODE>
</CHUNK>
<CHUNK ord=’4’ type=’F-term’ si=’modnomatch’>
<NODE ord=’6’ alloc=’32’ form=’.’ lem=’.’ mi=’Fp’> </NODE>
</CHUNK>
</CHUNK>
</SENTENCE>

Future work: integration with Matxin

We would like to get CG dependency information into a
Matxin-compatible format.
Apertium’s CG would handle analysis while Matxin handles the
transfer step. Eg. given the following analysis (Faroese):

"<Í>"
"í" Pr @ADVL> #1->3
"<upphavi>"
"upphav" N Neu Sg Dat Indef @P< #2->1
"<skapti>"
"skapa" V Ind Prt Sg @VMAIN #3->0
"<Gud>"
"gudur" N Msc Sg Acc Indef @<SUBJ #4->3
"<himmal>"
"himmal" N Msc Sg Acc Indef @<OBJ #5->3

Future work: integration with Matxin

...we would like to get this dependency tree structure:

<SENTENCE ord="1">
<NODE form=’skapti’ lem=’skapa’ ord=’3’ mi=’V.Ind.Prt.Sg’ si=’VMAIN’>
<NODE form=’Í’ lem=’Í’ ord=’1’ mi=’Pr’ si=’ADVL’>
<NODE form=’upphavi’ lem=’upphav’ ord=’2’ mi=’N.Neu.Sg.Dat.Indef’ si=’P’/>
</NODE>
<NODE form=’Gud’ lem=’Gud’ ord=’4’ mi=’N.Prop.Sg.Nom’ si=’SUBJ’/>
<NODE form=’himmal’ lem=’himmal’ ord=’5’ mi=’N.Msc.Sg.Acc.Indef’ si=’OBJ’/>
</NODE>
</SENTENCE>

and let Matxin do reordering and other transfer operations

Licences

This presentation may be distributed under the terms of the GNU GPL,
GNU FDL and CC-BY-SA licences.
GNU GPL v. 3.0
http://www.gnu.org/licenses/gpl.html
GNU FDL v. 1.2
http://www.gnu.org/licenses/gfdl.html
CC-BY-SA v. 3.0
http://creativecommons.org/licenses/by-sa/3.0/

Constraint Grammar and Apertium

Recommandé

Recommandé

Contenu connexe

Similaire à Constraint Grammar and Apertium

Similaire à Constraint Grammar and Apertium (20)

Dernier

Dernier (20)

Constraint Grammar and Apertium