SlideShare une entreprise Scribd logo
1  sur  16
Poio API - An annotation framework to bridge
Language Documentation and Natural Language
                  Processing
      Centro Interdisciplinar de Documentação Linguística e Social
                              Minde/Portugal

                        Vera Ferreira, vferreira@cidles.eu
                         Peter Bouda, pbouda@cidles.eu
                        António Lopes, alopes@cidles.eu




       Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                                 @ ACRH-2, Lisbon, 29.11.2012
Language documentation
●   Aim of developing a "lasting, multipurpose record of a
    language"
●   Collection, distribution, and preservation of primary
    data of a variety of communicative events
●   Data is normally transcribed, translated, and it should
    also be annotated
●   Archives to preserve and publish documentation
    ●   The Language Archive
    ●   Endangered Languages Archive (ELAR)

            Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                                      @ ACRH-2, Lisbon, 29.11.2012
Natural Language Processing
●   Any kind of computer manipulation of natural language
●   Mostly for „major“ languages like English, Spanish,
    German, etc.
●   NLP is rarely used on LD data
●   Archiving needs led to digitization
●   Now we see „corpus-based XYZ“ in General
    Linguistics
●   Indiviual examples are hand-picked
●   (Semi-)automated tagging of lesser-known languages
           Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                                     @ ACRH-2, Lisbon, 29.11.2012
Quantitative Language Comparison
●   In contrast to „corpus linguistics“ (see Michael
    Cysouw's research group)
●   Based on LD data, bible texts, movie subtitles etc.
●   Supports typological research




           Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                                     @ ACRH-2, Lisbon, 29.11.2012
Annotation Graphs, LAF and GrAF
●   ISO standard 24612 "Language resource
    management - Linguistic annotation framework (LAF)“
●   Annotation graphs as the underlying data model for
    linguistic annotations
●   Developed for MASC of the American National Corpus
●   Existing connectors for UIMA and GATE
●   Radical stand-off approach
     ●   Unsupervised collaboration


             Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                                       @ ACRH-2, Lisbon, 29.11.2012
Poio API
●   Part of Clarin-D curation project at the University of
    Cologne
●   Connectors to „The Language Archive“ and Clarin
    Weblicht
●   Layered architecture
    ●   API
    ●   Internal representation (LAF)
    ●   File format plugins (EAF, Toolbox, TCF)
●   Based on PyAnnotation and graf-python

              Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                                        @ ACRH-2, Lisbon, 29.11.2012
Poio API




Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                          @ ACRH-2, Lisbon, 29.11.2012
Data Structure Types (1/2)
●   List of lists, tree structure
    ●   [ ’utterance’,
          [’word’, ’wfw’],
        ’translation’ ]
●   For example GRAID (Grammatical Relations and
    Animacy in Discourse)
    ●   [ ’utterance’,
          [’clause unit’,
             [ ’word’, ’wfw’, ’graid1’],
          ’graid2’],
        ’translation’ ]

            Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                                      @ ACRH-2, Lisbon, 29.11.2012
Data Structure Types (2/2)
●   Objective
    ●   Mapping the tree structures into GrAF structure
●   Advantages
    ●   Flexibility in the construction of annotation hierarchies
    ●   Automatic transformation of the tree structures into a user interface (Poio
        Editor and Analyzer)
    ●   Customization and colloboration
●   Disadvantages
    ●   Not all annotation schemes can be mapped onto a tree-like structure




              Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                                        @ ACRH-2, Lisbon, 29.11.2012
Annotation Tree




Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                          @ ACRH-2, Lisbon, 29.11.2012
Graf-python (1/3)
●   Python implementation of GrAF
    ●   Developed by Stephen Matysik for ANC
●   Provides the underlying data structure for all data and
    annotations that Poio API can manage
    (interoperability)
    ●   Accessing the nodes, edges, regions and their annotations
        from the parsed files (GrAF ISO)




            Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                                      @ ACRH-2, Lisbon, 29.11.2012
Graf-python (2/3)
●   Example: accessing the nodes in a „graid1“ tier

Block Code:                                                               Result:

gparser = GraphParser()                                                   NodeID = word-n1
file = 'example-graid1.xml'                                               Annotation('word', 'a-112')
file_stream = codecs.open(file, 'r', 'utf-8')                             Annotation('graid1', 'a-508')
g = gparser.parse(file_stream)                                            comp
for node in g.nodes:                                                      NodeID = word-n2
    print(node)                                                           Annotation('word', 'a-113')
    for annotation in node.annotations:                                   Annotation('graid1', 'a-509')
        print(annotation)                                                 deti
        graid1 = annotation.features.get('graid1')                        NodeID = word-n3
        if graid1 is not None:                                            Annotation('word', 'a-114')
            print(graid1)                                                 Annotation('graid1', 'a-510')
                                                                          np.h:s=cop:predp




              Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                                        @ ACRH-2, Lisbon, 29.11.2012
Graf-python (3/3)
                                                          Utterance 1
                                                         Region [0-20]




                                  Word-n1                                         Word-n2
                                 Region [0 2]                                    Region [3 7]




                         „ki“                   „comp“                   „yag“                   „deti“

                      word                      graid1                   word                   graid1




Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                          @ ACRH-2, Lisbon, 29.11.2012
The future: Usage of graphs
●   Graph-coloring algorithm to provide insight on LD data
    ●   make common subgraphs visible after merge of
        corpora
●   Graph-traversal algorithms to collect statistical data
    ●   Clusters of annotation values
●   Weighted graphs to reflect links between sources
    ●   Quantitative Historical Linguistics with dictionaries
    ●   Linked via spanish translations


            Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                                      @ ACRH-2, Lisbon, 29.11.2012
Thank you for your attention!
Centro Interdisciplinar de Documentação Linguística e Social
                        Minde/Portugal

                 Vera Ferreira, vferreira@cidles.eu
                  Peter Bouda, pbouda@cidles.eu
                 António Lopes, alopes@cidles.eu




Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                          @ ACRH-2, Lisbon, 29.11.2012
Links
●   Poio (API): http://media.cidles.eu/poio/
●   ISO 24612:
    http://www.iso.org/iso/catalogue_detail.htm?csnumb
●   The Language Archive:http://tla.mpi.nl/
●   Weblicht:
    http://weblicht.sfs.uni-tuebingen.de/index.shtml




          Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
                                    @ ACRH-2, Lisbon, 29.11.2012

Contenu connexe

Similaire à Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing

e-Services to Keep Your Digital Files Current
e-Services to Keep Your Digital Files Currente-Services to Keep Your Digital Files Current
e-Services to Keep Your Digital Files Currentpbajcsy
 
Oc wg-nif-20130711
Oc wg-nif-20130711Oc wg-nif-20130711
Oc wg-nif-20130711STIinnsbruck
 
NIF 2.0 Phd thesis intermediate report
NIF 2.0 Phd thesis intermediate reportNIF 2.0 Phd thesis intermediate report
NIF 2.0 Phd thesis intermediate reportSebastian Hellmann
 
Python workshop
Python workshopPython workshop
Python workshopShiraz LUG
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Péter Király
 
Archival Technologies
Archival TechnologiesArchival Technologies
Archival TechnologiesCliff Landis
 
NIF - Version 1.0 - 2011/10/23
NIF - Version 1.0 - 2011/10/23NIF - Version 1.0 - 2011/10/23
NIF - Version 1.0 - 2011/10/23Sebastian Hellmann
 
Programming in Civil Engineering_UNIT 1_NOTES
Programming in Civil Engineering_UNIT 1_NOTESProgramming in Civil Engineering_UNIT 1_NOTES
Programming in Civil Engineering_UNIT 1_NOTESRushikesh Kolhe
 
Exploring Language Communities on Github
Exploring Language Communities on GithubExploring Language Communities on Github
Exploring Language Communities on GithubAntigoni-Maria Founta
 
Data Analysis and Visualization: R Workflow
Data Analysis and Visualization: R WorkflowData Analysis and Visualization: R Workflow
Data Analysis and Visualization: R WorkflowOlga Scrivner
 
Specialising the EDM for Digitised Manuscript (SWIB13)
Specialising the EDM for Digitised Manuscript (SWIB13)Specialising the EDM for Digitised Manuscript (SWIB13)
Specialising the EDM for Digitised Manuscript (SWIB13)Kai Eckert
 
Smithies bodleian 2017_v.2.0
Smithies bodleian 2017_v.2.0Smithies bodleian 2017_v.2.0
Smithies bodleian 2017_v.2.0jamessmithies
 
The recovery of netherlands geographic information system (nlgis 2)
The recovery of netherlands geographic information system (nlgis 2)The recovery of netherlands geographic information system (nlgis 2)
The recovery of netherlands geographic information system (nlgis 2)vty
 
Local content in a Europeana cloud for small & medium content providers
Local content in a Europeana cloud for small & medium content providersLocal content in a Europeana cloud for small & medium content providers
Local content in a Europeana cloud for small & medium content providerslocloud
 
The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...João Rocha da Silva
 
D4 science scientific data infrastructure promoting interoperability by embra...
D4 science scientific data infrastructure promoting interoperability by embra...D4 science scientific data infrastructure promoting interoperability by embra...
D4 science scientific data infrastructure promoting interoperability by embra...FAO
 
D4Science scientific data infrastructure promoting interoperability by embrac...
D4Science scientific data infrastructure promoting interoperability by embrac...D4Science scientific data infrastructure promoting interoperability by embrac...
D4Science scientific data infrastructure promoting interoperability by embrac...FAO
 
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item RecommendationAn Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item RecommendationEnrico Palumbo
 
Tech Days 2015: Multi-language Programming with GPRbuild
Tech Days 2015: Multi-language Programming with GPRbuildTech Days 2015: Multi-language Programming with GPRbuild
Tech Days 2015: Multi-language Programming with GPRbuildAdaCore
 

Similaire à Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing (20)

e-Services to Keep Your Digital Files Current
e-Services to Keep Your Digital Files Currente-Services to Keep Your Digital Files Current
e-Services to Keep Your Digital Files Current
 
Oc wg-nif-20130711
Oc wg-nif-20130711Oc wg-nif-20130711
Oc wg-nif-20130711
 
NIF 2.0 Phd thesis intermediate report
NIF 2.0 Phd thesis intermediate reportNIF 2.0 Phd thesis intermediate report
NIF 2.0 Phd thesis intermediate report
 
Python workshop
Python workshopPython workshop
Python workshop
 
Python workshop
Python workshopPython workshop
Python workshop
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
Archival Technologies
Archival TechnologiesArchival Technologies
Archival Technologies
 
NIF - Version 1.0 - 2011/10/23
NIF - Version 1.0 - 2011/10/23NIF - Version 1.0 - 2011/10/23
NIF - Version 1.0 - 2011/10/23
 
Programming in Civil Engineering_UNIT 1_NOTES
Programming in Civil Engineering_UNIT 1_NOTESProgramming in Civil Engineering_UNIT 1_NOTES
Programming in Civil Engineering_UNIT 1_NOTES
 
Exploring Language Communities on Github
Exploring Language Communities on GithubExploring Language Communities on Github
Exploring Language Communities on Github
 
Data Analysis and Visualization: R Workflow
Data Analysis and Visualization: R WorkflowData Analysis and Visualization: R Workflow
Data Analysis and Visualization: R Workflow
 
Specialising the EDM for Digitised Manuscript (SWIB13)
Specialising the EDM for Digitised Manuscript (SWIB13)Specialising the EDM for Digitised Manuscript (SWIB13)
Specialising the EDM for Digitised Manuscript (SWIB13)
 
Smithies bodleian 2017_v.2.0
Smithies bodleian 2017_v.2.0Smithies bodleian 2017_v.2.0
Smithies bodleian 2017_v.2.0
 
The recovery of netherlands geographic information system (nlgis 2)
The recovery of netherlands geographic information system (nlgis 2)The recovery of netherlands geographic information system (nlgis 2)
The recovery of netherlands geographic information system (nlgis 2)
 
Local content in a Europeana cloud for small & medium content providers
Local content in a Europeana cloud for small & medium content providersLocal content in a Europeana cloud for small & medium content providers
Local content in a Europeana cloud for small & medium content providers
 
The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...
 
D4 science scientific data infrastructure promoting interoperability by embra...
D4 science scientific data infrastructure promoting interoperability by embra...D4 science scientific data infrastructure promoting interoperability by embra...
D4 science scientific data infrastructure promoting interoperability by embra...
 
D4Science scientific data infrastructure promoting interoperability by embrac...
D4Science scientific data infrastructure promoting interoperability by embrac...D4Science scientific data infrastructure promoting interoperability by embrac...
D4Science scientific data infrastructure promoting interoperability by embrac...
 
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item RecommendationAn Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
An Empirical Comparison of Knowledge Graph Embeddings for Item Recommendation
 
Tech Days 2015: Multi-language Programming with GPRbuild
Tech Days 2015: Multi-language Programming with GPRbuildTech Days 2015: Multi-language Programming with GPRbuild
Tech Days 2015: Multi-language Programming with GPRbuild
 

Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing

  • 1. Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing Centro Interdisciplinar de Documentação Linguística e Social Minde/Portugal Vera Ferreira, vferreira@cidles.eu Peter Bouda, pbouda@cidles.eu António Lopes, alopes@cidles.eu Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 2. Language documentation ● Aim of developing a "lasting, multipurpose record of a language" ● Collection, distribution, and preservation of primary data of a variety of communicative events ● Data is normally transcribed, translated, and it should also be annotated ● Archives to preserve and publish documentation ● The Language Archive ● Endangered Languages Archive (ELAR) Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 3. Natural Language Processing ● Any kind of computer manipulation of natural language ● Mostly for „major“ languages like English, Spanish, German, etc. ● NLP is rarely used on LD data ● Archiving needs led to digitization ● Now we see „corpus-based XYZ“ in General Linguistics ● Indiviual examples are hand-picked ● (Semi-)automated tagging of lesser-known languages Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 4. Quantitative Language Comparison ● In contrast to „corpus linguistics“ (see Michael Cysouw's research group) ● Based on LD data, bible texts, movie subtitles etc. ● Supports typological research Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 5. Annotation Graphs, LAF and GrAF ● ISO standard 24612 "Language resource management - Linguistic annotation framework (LAF)“ ● Annotation graphs as the underlying data model for linguistic annotations ● Developed for MASC of the American National Corpus ● Existing connectors for UIMA and GATE ● Radical stand-off approach ● Unsupervised collaboration Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 6. Poio API ● Part of Clarin-D curation project at the University of Cologne ● Connectors to „The Language Archive“ and Clarin Weblicht ● Layered architecture ● API ● Internal representation (LAF) ● File format plugins (EAF, Toolbox, TCF) ● Based on PyAnnotation and graf-python Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 7. Poio API Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 8. Data Structure Types (1/2) ● List of lists, tree structure ● [ ’utterance’, [’word’, ’wfw’], ’translation’ ] ● For example GRAID (Grammatical Relations and Animacy in Discourse) ● [ ’utterance’, [’clause unit’, [ ’word’, ’wfw’, ’graid1’], ’graid2’], ’translation’ ] Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 9. Data Structure Types (2/2) ● Objective ● Mapping the tree structures into GrAF structure ● Advantages ● Flexibility in the construction of annotation hierarchies ● Automatic transformation of the tree structures into a user interface (Poio Editor and Analyzer) ● Customization and colloboration ● Disadvantages ● Not all annotation schemes can be mapped onto a tree-like structure Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 10. Annotation Tree Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 11. Graf-python (1/3) ● Python implementation of GrAF ● Developed by Stephen Matysik for ANC ● Provides the underlying data structure for all data and annotations that Poio API can manage (interoperability) ● Accessing the nodes, edges, regions and their annotations from the parsed files (GrAF ISO) Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 12. Graf-python (2/3) ● Example: accessing the nodes in a „graid1“ tier Block Code: Result: gparser = GraphParser() NodeID = word-n1 file = 'example-graid1.xml' Annotation('word', 'a-112') file_stream = codecs.open(file, 'r', 'utf-8') Annotation('graid1', 'a-508') g = gparser.parse(file_stream) comp for node in g.nodes: NodeID = word-n2 print(node) Annotation('word', 'a-113') for annotation in node.annotations: Annotation('graid1', 'a-509') print(annotation) deti graid1 = annotation.features.get('graid1') NodeID = word-n3 if graid1 is not None: Annotation('word', 'a-114') print(graid1) Annotation('graid1', 'a-510') np.h:s=cop:predp Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 13. Graf-python (3/3) Utterance 1 Region [0-20] Word-n1 Word-n2 Region [0 2] Region [3 7] „ki“ „comp“ „yag“ „deti“ word graid1 word graid1 Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 14. The future: Usage of graphs ● Graph-coloring algorithm to provide insight on LD data ● make common subgraphs visible after merge of corpora ● Graph-traversal algorithms to collect statistical data ● Clusters of annotation values ● Weighted graphs to reflect links between sources ● Quantitative Historical Linguistics with dictionaries ● Linked via spanish translations Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 15. Thank you for your attention! Centro Interdisciplinar de Documentação Linguística e Social Minde/Portugal Vera Ferreira, vferreira@cidles.eu Peter Bouda, pbouda@cidles.eu António Lopes, alopes@cidles.eu Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 16. Links ● Poio (API): http://media.cidles.eu/poio/ ● ISO 24612: http://www.iso.org/iso/catalogue_detail.htm?csnumb ● The Language Archive:http://tla.mpi.nl/ ● Weblicht: http://weblicht.sfs.uni-tuebingen.de/index.shtml Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012