SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
CG in Apertium

 Kevin Brubeck Unhammer
University of Bergen, Norway



      14th May 2009
What is Apertium?




      An Open Source Machine Translation platform
          both source code and data have Free / Open Source licences
      Modular
          stand-alone programs communicate through standard Unix pipes
          particular language pairs need not use all modules!
      Developed by universities, companies and independent
      (volunteer and paid) developers
History of Apertium




       Initially developed for closely related languages (Portuguese ↔
       Spanish ↔ Catalan) by the Transducens group at the Universitat
       d’Alacant
       Later extended to allow more distant language pairs
       Now also involves various companies in Spain, the universities of
       Vigo, Reykjavík, Oviedo, Barcelona (Pompeu Fabra), etc.
Language pairs



      “Stable”: Spanish ↔ Catalan, Spanish ← Romanian, French ↔
      Catalan, Occitan ↔ Catalan, English ↔ Galician, Occitan ↔
      Spanish, Spanish ↔ Portuguese, English ↔ Catalan, English ↔
      Spanish, English → Esperanto, Spanish ↔ Galician, French ↔
      Spanish, Esperanto ← Spanish, Welsh → English, Esperanto ←
      Catalan, Portuguese ↔ Catalan, Portuguese ↔ Galician,
      Basque → Spanish
      Other pairs being developed (Spanish ↔ Asturian, Icelandic ↔
      English, Swedish ↔ Danish, Nynorsk ↔ Bokmål, . . . )
Marginalised

Few free resources
Copious free resources
Modules


     Morphological dictionaries
          lttoolbox: XML format, compiles to FSTs
                Fast (seems to perform 5x faster than SFST)
          one dictionary gives both analysis and generation
     CG pre-disambiguation
     Statistical disambiguation (HMM)
     Bilingual dictionary for lexical transfer
     Shallow syntactic transfer rules
          Local re-ordering (nom adj → adj nom)
          Chunking (adj adj nom → SN[adj adj nom])
          Insertions, deletions and substitutions of lexical units and chunks
A sketch of the architecture
The Apertium Stream Format

      Simple example from Norwegian Bokmål
          “lese en” (‘read a/one’)
          Morphological analysis gives:
          ^lese/lese<vblex><inf>$ ^en/en<num><sg><mf>
          /ene<vblex><imp>/en<det><ind><mf><sg>$
          After CG:
          ^lese/lese<vblex><inf>$ ^en/en<num><sg><mf>
          /en<det><ind><mf><sg>$
      Formatting information (like HTML tags) is saved in superblanks
      making document and web translation easy
          original:
          Kva er det du <em>seier</em>?
          deformatted:
          Kva er det du[ <em>]seier[</em>]?
Visualising the process helps find errors
The platform provides


       a language-independent machine translation engine
       tools to manage the linguistic data necessary to build a machine
       translation system for a given language pair
            little programming knowledge required to get started
            graphical user interfaces that show each step in the translation
            process
            many more advanced tools (for eg. merging or sorting
            dictionaries)

       linguistic data for a growing number of language pairs
            also usable for other NLP purposes (spelling & grammar checking,
            ...)
CG in Apertium




      Used after morphological analysis for pre-disambiguation in
      Nynorsk ↔ Bokmål, Welsh ↔ English, Breton ↔ French, Irish ↔
      Scottish Gaelic
      Apertium’s own statistical disambiguator makes a choice if CG
      doesn’t completely disambiguate
CG in Apertium




      Norwegian CG is from the Oslo-Bergen Tagger (GPL)
      Sámi giellatekno provides Free grammars for Sámi languages
      and Faroese
      Irish grammar mostly converted manually from the An Gramadóir
      project (GPL)
      Other grammars made solely by Apertium members
Some statistics




                        Sections    Rules    Sets    Tags

               Welsh    2           98       141     128
               Breton   4           121      125     154
               Irish    1           285      298     292
        Table: Rule counts for some of the CG grammars in Apertium
Same concepts apply between modules




   CG         Apertium/lttoolbox       Apertium stream format
   wordform   surface form             books
   baseform   lemma                    book
   cohort     ambiguous lexical unit   ^books/book<n><pl>
                                       /book<vblex><pres><p3><sg>$
   reading    analysis                 /book<n><pl>/

                     Table: Terminology differences
Same format readable by all modules


        Both SFST/HFST and vislcg3 read and write the Apertium stream
        format.
        Example from the Open Morphology of Finnish, output by the
        Apertium reader in SFST/HFST:

   ^kaikki/kaikki<noun><7><a><sg><nom>$
   ^ihmiset/ihminen<noun><38><pl><acc>/ihminen<noun><38><pl><nom>$
   ^syntyvät/syntyä<verb><52><j><act><pcpva><pl><acc>
   /syntyä<verb><52><j><act><pcpva><pl><nom>
   /syntyä<verb><52><j><act><indv><pres><pl3>$
   ^vapaina/vapaa<noun><17><pl><ess>$ ^ja/*ja$
   ^tasavertaisina/*tasavertaisina$
   ^arvoltaan/arvo<noun><1><sg><abl><pl3>/arvo<noun><1><sg><abl><sg3>$
   ^ja/*ja$
   ^oikeuksiltaan/oikeus<noun><40><pl><abl><pl3>/oikeus<noun><40><pl><abl><sg3>$
Why Apertium


      Rule-based MT
          most languages of the world have little freely available textual
          data, let alone parallel corpora for SMT purposes; Apertium is
          thus suitable for marginalised languages
          Rule-based systems are linguistically interesting, and provide test
          beds for linguistic theory

      Reuse and Interoperability
          Monolingual dictionaries and constraint grammars are directly
          reusable for new language pairs
          apertium-dixtools: generates new language pairs from existing
          ones
          vislcg3 reads and outputs the Apertium stream format, as do
          Stuttgart/Helsinki Finite State Tools
          Free licences allow other systems to use Apertium data and tools
Why Apertium




      Open Source + fairly simple learning curve = great potential for
      contributors
           Eg. Jacob Nordfalk: entered Apertium last fall, had English →
           Esperanto pair by March 2009
      Very helpful and accessible community
Future work: dependency-based reordering in Apertium




      Currently, CG is only used for disambiguation
      Many constraint grammars out there give dependency
      information, this could be integrated into Apertium to provide
      dependency based reordering, simplifying the transfer step
Future Work: integration with Matxin

        Matxin is a Free Software sister project of Apertium which
        currently uses FreeLing for dependency analyses:

   <SENTENCE ord=’1’>
   <CHUNK ord=’2’ type=’grup-verb’ si=’top’>
     <NODE ord=’4’ alloc=’19’ form=’sacude’ lem=’sacudir’ mi=’VMIP3S0’> </NODE>
     <CHUNK ord=’1’ type=’sn’ si=’subj’>
       <NODE ord=’3’ alloc=’10’ form=’atentado’ lem=’atentado’ mi=’NCMS000’>
         <NODE ord=’1’ alloc=’0’ form=’Un’ lem=’uno’ mi=’DI0MS0’> </NODE>
         <NODE ord=’2’ alloc=’3’ form=’triple’ lem=’triple’ mi=’AQ0CS0’> </NODE>
       </NODE>
     </CHUNK>
     <CHUNK ord=’3’ type=’sn’ si=’obj’>
       <NODE ord=’5’ alloc=’26’ form=’Bagdad’ lem=’Bagdad’ mi=’NP00000’> </NODE>
     </CHUNK>
     <CHUNK ord=’4’ type=’F-term’ si=’modnomatch’>
       <NODE ord=’6’ alloc=’32’ form=’.’ lem=’.’ mi=’Fp’> </NODE>
     </CHUNK>
   </CHUNK>
   </SENTENCE>
Future work: integration with Matxin

           We would like to get CG dependency information into a
           Matxin-compatible format.
           Apertium’s CG would handle analysis while Matxin handles the
           transfer step. Eg. given the following analysis (Faroese):


   "<Í>"
           "í" Pr @ADVL> #1->3
   "<upphavi>"
           "upphav" N Neu Sg Dat Indef @P< #2->1
   "<skapti>"
           "skapa" V Ind Prt Sg @VMAIN #3->0
   "<Gud>"
           "gudur" N Msc Sg Acc Indef @<SUBJ #4->3
   "<himmal>"
           "himmal" N Msc Sg Acc Indef @<OBJ #5->3
Future work: integration with Matxin



        ...we would like to get this dependency tree structure:

   <SENTENCE ord="1">
     <NODE form=’skapti’ lem=’skapa’ ord=’3’ mi=’V.Ind.Prt.Sg’ si=’VMAIN’>
       <NODE form=’Í’ lem=’Í’ ord=’1’ mi=’Pr’ si=’ADVL’>
         <NODE form=’upphavi’ lem=’upphav’ ord=’2’ mi=’N.Neu.Sg.Dat.Indef’ si=’P’/>
       </NODE>
       <NODE form=’Gud’ lem=’Gud’ ord=’4’ mi=’N.Prop.Sg.Nom’ si=’SUBJ’/>
       <NODE form=’himmal’ lem=’himmal’ ord=’5’ mi=’N.Msc.Sg.Acc.Indef’ si=’OBJ’/>
     </NODE>
   </SENTENCE>


        and let Matxin do reordering and other transfer operations
Thanks for listening!
Licences



   This presentation may be distributed under the terms of the GNU GPL,
   GNU FDL and CC-BY-SA licences.
       GNU GPL v. 3.0
       http://www.gnu.org/licenses/gpl.html
       GNU FDL v. 1.2
       http://www.gnu.org/licenses/gfdl.html
       CC-BY-SA v. 3.0
       http://creativecommons.org/licenses/by-sa/3.0/

Contenu connexe

Similaire à Constraint Grammar and Apertium

Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text ProcessingSuneel Marthi
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextDataWorks Summit
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
Declare Your Language: Syntax Definition
Declare Your Language: Syntax DefinitionDeclare Your Language: Syntax Definition
Declare Your Language: Syntax DefinitionEelco Visser
 
Aspect-oriented programming in Perl
Aspect-oriented programming in PerlAspect-oriented programming in Perl
Aspect-oriented programming in Perlmegakott
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.pptbutest
 
Computational model language and grammar bnf
Computational model language and grammar bnfComputational model language and grammar bnf
Computational model language and grammar bnfTaha Shakeel
 
Enroller Colloquium: Sulman Sarwar
Enroller Colloquium: Sulman SarwarEnroller Colloquium: Sulman Sarwar
Enroller Colloquium: Sulman SarwarJohanna Green
 
Processing large-scale graphs with Google Pregel
Processing large-scale graphs with Google PregelProcessing large-scale graphs with Google Pregel
Processing large-scale graphs with Google PregelMax Neunhöffer
 
Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?mikaelbarbero
 
Hello, I need help with the following assignmentThis assignment w.pdf
Hello, I need help with the following assignmentThis assignment w.pdfHello, I need help with the following assignmentThis assignment w.pdf
Hello, I need help with the following assignmentThis assignment w.pdfnamarta88
 
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated2040.io
 
An introduction on language processing
An introduction on language processingAn introduction on language processing
An introduction on language processingRalf Laemmel
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?lichtkind
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014Edwin de Jonge
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding Systeminscit2006
 

Similaire à Constraint Grammar and Apertium (20)

Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Pgbr 2013 fts
Pgbr 2013 ftsPgbr 2013 fts
Pgbr 2013 fts
 
Declare Your Language: Syntax Definition
Declare Your Language: Syntax DefinitionDeclare Your Language: Syntax Definition
Declare Your Language: Syntax Definition
 
Aspect-oriented programming in Perl
Aspect-oriented programming in PerlAspect-oriented programming in Perl
Aspect-oriented programming in Perl
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
biopython, doctest and makefiles
biopython, doctest and makefilesbiopython, doctest and makefiles
biopython, doctest and makefiles
 
Computational model language and grammar bnf
Computational model language and grammar bnfComputational model language and grammar bnf
Computational model language and grammar bnf
 
Enroller Colloquium: Sulman Sarwar
Enroller Colloquium: Sulman SarwarEnroller Colloquium: Sulman Sarwar
Enroller Colloquium: Sulman Sarwar
 
Processing large-scale graphs with Google Pregel
Processing large-scale graphs with Google PregelProcessing large-scale graphs with Google Pregel
Processing large-scale graphs with Google Pregel
 
Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?
 
Hello, I need help with the following assignmentThis assignment w.pdf
Hello, I need help with the following assignmentThis assignment w.pdfHello, I need help with the following assignmentThis assignment w.pdf
Hello, I need help with the following assignmentThis assignment w.pdf
 
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated
 
Easy R
Easy REasy R
Easy R
 
An introduction on language processing
An introduction on language processingAn introduction on language processing
An introduction on language processing
 
PARADIGM IT.pptx
PARADIGM IT.pptxPARADIGM IT.pptx
PARADIGM IT.pptx
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
 

Dernier

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Dernier (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Constraint Grammar and Apertium

  • 1. CG in Apertium Kevin Brubeck Unhammer University of Bergen, Norway 14th May 2009
  • 2. What is Apertium? An Open Source Machine Translation platform both source code and data have Free / Open Source licences Modular stand-alone programs communicate through standard Unix pipes particular language pairs need not use all modules! Developed by universities, companies and independent (volunteer and paid) developers
  • 3. History of Apertium Initially developed for closely related languages (Portuguese ↔ Spanish ↔ Catalan) by the Transducens group at the Universitat d’Alacant Later extended to allow more distant language pairs Now also involves various companies in Spain, the universities of Vigo, Reykjavík, Oviedo, Barcelona (Pompeu Fabra), etc.
  • 4. Language pairs “Stable”: Spanish ↔ Catalan, Spanish ← Romanian, French ↔ Catalan, Occitan ↔ Catalan, English ↔ Galician, Occitan ↔ Spanish, Spanish ↔ Portuguese, English ↔ Catalan, English ↔ Spanish, English → Esperanto, Spanish ↔ Galician, French ↔ Spanish, Esperanto ← Spanish, Welsh → English, Esperanto ← Catalan, Portuguese ↔ Catalan, Portuguese ↔ Galician, Basque → Spanish Other pairs being developed (Spanish ↔ Asturian, Icelandic ↔ English, Swedish ↔ Danish, Nynorsk ↔ Bokmål, . . . )
  • 6. Modules Morphological dictionaries lttoolbox: XML format, compiles to FSTs Fast (seems to perform 5x faster than SFST) one dictionary gives both analysis and generation CG pre-disambiguation Statistical disambiguation (HMM) Bilingual dictionary for lexical transfer Shallow syntactic transfer rules Local re-ordering (nom adj → adj nom) Chunking (adj adj nom → SN[adj adj nom]) Insertions, deletions and substitutions of lexical units and chunks
  • 7. A sketch of the architecture
  • 8. The Apertium Stream Format Simple example from Norwegian Bokmål “lese en” (‘read a/one’) Morphological analysis gives: ^lese/lese<vblex><inf>$ ^en/en<num><sg><mf> /ene<vblex><imp>/en<det><ind><mf><sg>$ After CG: ^lese/lese<vblex><inf>$ ^en/en<num><sg><mf> /en<det><ind><mf><sg>$ Formatting information (like HTML tags) is saved in superblanks making document and web translation easy original: Kva er det du <em>seier</em>? deformatted: Kva er det du[ <em>]seier[</em>]?
  • 9. Visualising the process helps find errors
  • 10. The platform provides a language-independent machine translation engine tools to manage the linguistic data necessary to build a machine translation system for a given language pair little programming knowledge required to get started graphical user interfaces that show each step in the translation process many more advanced tools (for eg. merging or sorting dictionaries) linguistic data for a growing number of language pairs also usable for other NLP purposes (spelling & grammar checking, ...)
  • 11. CG in Apertium Used after morphological analysis for pre-disambiguation in Nynorsk ↔ Bokmål, Welsh ↔ English, Breton ↔ French, Irish ↔ Scottish Gaelic Apertium’s own statistical disambiguator makes a choice if CG doesn’t completely disambiguate
  • 12. CG in Apertium Norwegian CG is from the Oslo-Bergen Tagger (GPL) Sámi giellatekno provides Free grammars for Sámi languages and Faroese Irish grammar mostly converted manually from the An Gramadóir project (GPL) Other grammars made solely by Apertium members
  • 13. Some statistics Sections Rules Sets Tags Welsh 2 98 141 128 Breton 4 121 125 154 Irish 1 285 298 292 Table: Rule counts for some of the CG grammars in Apertium
  • 14. Same concepts apply between modules CG Apertium/lttoolbox Apertium stream format wordform surface form books baseform lemma book cohort ambiguous lexical unit ^books/book<n><pl> /book<vblex><pres><p3><sg>$ reading analysis /book<n><pl>/ Table: Terminology differences
  • 15. Same format readable by all modules Both SFST/HFST and vislcg3 read and write the Apertium stream format. Example from the Open Morphology of Finnish, output by the Apertium reader in SFST/HFST: ^kaikki/kaikki<noun><7><a><sg><nom>$ ^ihmiset/ihminen<noun><38><pl><acc>/ihminen<noun><38><pl><nom>$ ^syntyvät/syntyä<verb><52><j><act><pcpva><pl><acc> /syntyä<verb><52><j><act><pcpva><pl><nom> /syntyä<verb><52><j><act><indv><pres><pl3>$ ^vapaina/vapaa<noun><17><pl><ess>$ ^ja/*ja$ ^tasavertaisina/*tasavertaisina$ ^arvoltaan/arvo<noun><1><sg><abl><pl3>/arvo<noun><1><sg><abl><sg3>$ ^ja/*ja$ ^oikeuksiltaan/oikeus<noun><40><pl><abl><pl3>/oikeus<noun><40><pl><abl><sg3>$
  • 16. Why Apertium Rule-based MT most languages of the world have little freely available textual data, let alone parallel corpora for SMT purposes; Apertium is thus suitable for marginalised languages Rule-based systems are linguistically interesting, and provide test beds for linguistic theory Reuse and Interoperability Monolingual dictionaries and constraint grammars are directly reusable for new language pairs apertium-dixtools: generates new language pairs from existing ones vislcg3 reads and outputs the Apertium stream format, as do Stuttgart/Helsinki Finite State Tools Free licences allow other systems to use Apertium data and tools
  • 17. Why Apertium Open Source + fairly simple learning curve = great potential for contributors Eg. Jacob Nordfalk: entered Apertium last fall, had English → Esperanto pair by March 2009 Very helpful and accessible community
  • 18. Future work: dependency-based reordering in Apertium Currently, CG is only used for disambiguation Many constraint grammars out there give dependency information, this could be integrated into Apertium to provide dependency based reordering, simplifying the transfer step
  • 19. Future Work: integration with Matxin Matxin is a Free Software sister project of Apertium which currently uses FreeLing for dependency analyses: <SENTENCE ord=’1’> <CHUNK ord=’2’ type=’grup-verb’ si=’top’> <NODE ord=’4’ alloc=’19’ form=’sacude’ lem=’sacudir’ mi=’VMIP3S0’> </NODE> <CHUNK ord=’1’ type=’sn’ si=’subj’> <NODE ord=’3’ alloc=’10’ form=’atentado’ lem=’atentado’ mi=’NCMS000’> <NODE ord=’1’ alloc=’0’ form=’Un’ lem=’uno’ mi=’DI0MS0’> </NODE> <NODE ord=’2’ alloc=’3’ form=’triple’ lem=’triple’ mi=’AQ0CS0’> </NODE> </NODE> </CHUNK> <CHUNK ord=’3’ type=’sn’ si=’obj’> <NODE ord=’5’ alloc=’26’ form=’Bagdad’ lem=’Bagdad’ mi=’NP00000’> </NODE> </CHUNK> <CHUNK ord=’4’ type=’F-term’ si=’modnomatch’> <NODE ord=’6’ alloc=’32’ form=’.’ lem=’.’ mi=’Fp’> </NODE> </CHUNK> </CHUNK> </SENTENCE>
  • 20. Future work: integration with Matxin We would like to get CG dependency information into a Matxin-compatible format. Apertium’s CG would handle analysis while Matxin handles the transfer step. Eg. given the following analysis (Faroese): "<Í>" "í" Pr @ADVL> #1->3 "<upphavi>" "upphav" N Neu Sg Dat Indef @P< #2->1 "<skapti>" "skapa" V Ind Prt Sg @VMAIN #3->0 "<Gud>" "gudur" N Msc Sg Acc Indef @<SUBJ #4->3 "<himmal>" "himmal" N Msc Sg Acc Indef @<OBJ #5->3
  • 21. Future work: integration with Matxin ...we would like to get this dependency tree structure: <SENTENCE ord="1"> <NODE form=’skapti’ lem=’skapa’ ord=’3’ mi=’V.Ind.Prt.Sg’ si=’VMAIN’> <NODE form=’Í’ lem=’Í’ ord=’1’ mi=’Pr’ si=’ADVL’> <NODE form=’upphavi’ lem=’upphav’ ord=’2’ mi=’N.Neu.Sg.Dat.Indef’ si=’P’/> </NODE> <NODE form=’Gud’ lem=’Gud’ ord=’4’ mi=’N.Prop.Sg.Nom’ si=’SUBJ’/> <NODE form=’himmal’ lem=’himmal’ ord=’5’ mi=’N.Msc.Sg.Acc.Indef’ si=’OBJ’/> </NODE> </SENTENCE> and let Matxin do reordering and other transfer operations
  • 23. Licences This presentation may be distributed under the terms of the GNU GPL, GNU FDL and CC-BY-SA licences. GNU GPL v. 3.0 http://www.gnu.org/licenses/gpl.html GNU FDL v. 1.2 http://www.gnu.org/licenses/gfdl.html CC-BY-SA v. 3.0 http://creativecommons.org/licenses/by-sa/3.0/