Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing

Poio API - An annotation framework to bridge
Language Documentation and Natural Language
Processing
Centro Interdisciplinar de Documentação Linguística e Social
Minde/Portugal

Vera Ferreira, vferreira@cidles.eu
Peter Bouda, pbouda@cidles.eu
António Lopes, alopes@cidles.eu

Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
@ ACRH-2, Lisbon, 29.11.2012

Language documentation
● Aim of developing a "lasting, multipurpose record of a
language"
● Collection, distribution, and preservation of primary
data of a variety of communicative events
● Data is normally transcribed, translated, and it should
also be annotated
● Archives to preserve and publish documentation
● The Language Archive
● Endangered Languages Archive (ELAR)

@ ACRH-2, Lisbon, 29.11.2012

Natural Language Processing
● Any kind of computer manipulation of natural language
● Mostly for „major“ languages like English, Spanish,
German, etc.
● NLP is rarely used on LD data
● Archiving needs led to digitization
● Now we see „corpus-based XYZ“ in General
Linguistics
● Indiviual examples are hand-picked
● (Semi-)automated tagging of lesser-known languages
@ ACRH-2, Lisbon, 29.11.2012

Quantitative Language Comparison
● In contrast to „corpus linguistics“ (see Michael
Cysouw's research group)
● Based on LD data, bible texts, movie subtitles etc.
● Supports typological research

@ ACRH-2, Lisbon, 29.11.2012

Annotation Graphs, LAF and GrAF
● ISO standard 24612 "Language resource
management - Linguistic annotation framework (LAF)“
● Annotation graphs as the underlying data model for
linguistic annotations
● Developed for MASC of the American National Corpus
● Existing connectors for UIMA and GATE
● Radical stand-off approach
● Unsupervised collaboration

@ ACRH-2, Lisbon, 29.11.2012

Poio API
● Part of Clarin-D curation project at the University of
Cologne
● Connectors to „The Language Archive“ and Clarin
Weblicht
● Layered architecture
● API
● Internal representation (LAF)
● File format plugins (EAF, Toolbox, TCF)
● Based on PyAnnotation and graf-python

@ ACRH-2, Lisbon, 29.11.2012

Poio API

@ ACRH-2, Lisbon, 29.11.2012

Data Structure Types (1/2)
● List of lists, tree structure
● [ ’utterance’,
[’word’, ’wfw’],
’translation’ ]
● For example GRAID (Grammatical Relations and
Animacy in Discourse)
● [ ’utterance’,
[’clause unit’,
[ ’word’, ’wfw’, ’graid1’],
’graid2’],
’translation’ ]

@ ACRH-2, Lisbon, 29.11.2012

Data Structure Types (2/2)
● Objective
● Mapping the tree structures into GrAF structure
● Advantages
● Flexibility in the construction of annotation hierarchies
● Automatic transformation of the tree structures into a user interface (Poio
Editor and Analyzer)
● Customization and colloboration
● Disadvantages
● Not all annotation schemes can be mapped onto a tree-like structure

@ ACRH-2, Lisbon, 29.11.2012

Annotation Tree

@ ACRH-2, Lisbon, 29.11.2012

Graf-python (1/3)
● Python implementation of GrAF
● Developed by Stephen Matysik for ANC
● Provides the underlying data structure for all data and
annotations that Poio API can manage
(interoperability)
● Accessing the nodes, edges, regions and their annotations
from the parsed files (GrAF ISO)

@ ACRH-2, Lisbon, 29.11.2012

Graf-python (2/3)
● Example: accessing the nodes in a „graid1“ tier

Block Code: Result:

gparser = GraphParser() NodeID = word-n1
file = 'example-graid1.xml' Annotation('word', 'a-112')
file_stream = codecs.open(file, 'r', 'utf-8') Annotation('graid1', 'a-508')
g = gparser.parse(file_stream) comp
for node in g.nodes: NodeID = word-n2
print(node) Annotation('word', 'a-113')
for annotation in node.annotations: Annotation('graid1', 'a-509')
print(annotation) deti
graid1 = annotation.features.get('graid1') NodeID = word-n3
if graid1 is not None: Annotation('word', 'a-114')
print(graid1) Annotation('graid1', 'a-510')
np.h:s=cop:predp

@ ACRH-2, Lisbon, 29.11.2012

Graf-python (3/3)
Utterance 1
Region [0-20]

Word-n1 Word-n2
Region [0 2] Region [3 7]

„ki“ „comp“ „yag“ „deti“

word graid1 word graid1

@ ACRH-2, Lisbon, 29.11.2012

The future: Usage of graphs
● Graph-coloring algorithm to provide insight on LD data
● make common subgraphs visible after merge of
corpora
● Graph-traversal algorithms to collect statistical data
● Clusters of annotation values
● Weighted graphs to reflect links between sources
● Quantitative Historical Linguistics with dictionaries
● Linked via spanish translations

@ ACRH-2, Lisbon, 29.11.2012

Thank you for your attention!
Centro Interdisciplinar de Documentação Linguística e Social
Minde/Portugal

Vera Ferreira, vferreira@cidles.eu
Peter Bouda, pbouda@cidles.eu
António Lopes, alopes@cidles.eu

@ ACRH-2, Lisbon, 29.11.2012

Links
● Poio (API): http://media.cidles.eu/poio/
● ISO 24612:
http://www.iso.org/iso/catalogue_detail.htm?csnumb
● The Language Archive:http://tla.mpi.nl/
● Weblicht:
http://weblicht.sfs.uni-tuebingen.de/index.shtml

@ ACRH-2, Lisbon, 29.11.2012

Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing

Recommandé

Recommandé

Contenu connexe

Similaire à Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing

Similaire à Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing (20)

Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing