After 20 years of multimedia data collection from endangered languages and consequent creation of extensive corpora with large amounts of annotated linguistic data, a new trend in Language Documentation is now observable. It can be described as a shift from data collection and qualitative language analysis to quantitative language comparison based on the data previously collected. However, the heterogeneous annotation types and formats in the corpora hinder the application of new developed computational methods in their analysis. A standardized representation is needed. Poio API, a scientific software library written in Python and based on Linguistic Annotation Framework, fulfills this need and establishes the bridge between Language Documentation and Natural Language Processing (NLP). Hence, it represents an innovative approach which will open up new options in interdisciplinary collaborative linguistic research. This paper offers a contextualization of Poio API in the framework of current linguistic and NLP research as well as a description of its development.
Tech Days 2015: Multi-language Programming with GPRbuild
Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing
1. Poio API - An annotation framework to bridge
Language Documentation and Natural Language
Processing
Centro Interdisciplinar de Documentação Linguística e Social
Minde/Portugal
Vera Ferreira, vferreira@cidles.eu
Peter Bouda, pbouda@cidles.eu
António Lopes, alopes@cidles.eu
Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
@ ACRH-2, Lisbon, 29.11.2012
2. Language documentation
● Aim of developing a "lasting, multipurpose record of a
language"
● Collection, distribution, and preservation of primary
data of a variety of communicative events
● Data is normally transcribed, translated, and it should
also be annotated
● Archives to preserve and publish documentation
● The Language Archive
● Endangered Languages Archive (ELAR)
Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
@ ACRH-2, Lisbon, 29.11.2012
3. Natural Language Processing
● Any kind of computer manipulation of natural language
● Mostly for „major“ languages like English, Spanish,
German, etc.
● NLP is rarely used on LD data
● Archiving needs led to digitization
● Now we see „corpus-based XYZ“ in General
Linguistics
● Indiviual examples are hand-picked
● (Semi-)automated tagging of lesser-known languages
Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
@ ACRH-2, Lisbon, 29.11.2012
4. Quantitative Language Comparison
● In contrast to „corpus linguistics“ (see Michael
Cysouw's research group)
● Based on LD data, bible texts, movie subtitles etc.
● Supports typological research
Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
@ ACRH-2, Lisbon, 29.11.2012
5. Annotation Graphs, LAF and GrAF
● ISO standard 24612 "Language resource
management - Linguistic annotation framework (LAF)“
● Annotation graphs as the underlying data model for
linguistic annotations
● Developed for MASC of the American National Corpus
● Existing connectors for UIMA and GATE
● Radical stand-off approach
● Unsupervised collaboration
Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
@ ACRH-2, Lisbon, 29.11.2012
6. Poio API
● Part of Clarin-D curation project at the University of
Cologne
● Connectors to „The Language Archive“ and Clarin
Weblicht
● Layered architecture
● API
● Internal representation (LAF)
● File format plugins (EAF, Toolbox, TCF)
● Based on PyAnnotation and graf-python
Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
@ ACRH-2, Lisbon, 29.11.2012
8. Data Structure Types (1/2)
● List of lists, tree structure
● [ ’utterance’,
[’word’, ’wfw’],
’translation’ ]
● For example GRAID (Grammatical Relations and
Animacy in Discourse)
● [ ’utterance’,
[’clause unit’,
[ ’word’, ’wfw’, ’graid1’],
’graid2’],
’translation’ ]
Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
@ ACRH-2, Lisbon, 29.11.2012
9. Data Structure Types (2/2)
● Objective
● Mapping the tree structures into GrAF structure
● Advantages
● Flexibility in the construction of annotation hierarchies
● Automatic transformation of the tree structures into a user interface (Poio
Editor and Analyzer)
● Customization and colloboration
● Disadvantages
● Not all annotation schemes can be mapped onto a tree-like structure
Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
@ ACRH-2, Lisbon, 29.11.2012
11. Graf-python (1/3)
● Python implementation of GrAF
● Developed by Stephen Matysik for ANC
● Provides the underlying data structure for all data and
annotations that Poio API can manage
(interoperability)
● Accessing the nodes, edges, regions and their annotations
from the parsed files (GrAF ISO)
Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
@ ACRH-2, Lisbon, 29.11.2012
12. Graf-python (2/3)
● Example: accessing the nodes in a „graid1“ tier
Block Code: Result:
gparser = GraphParser() NodeID = word-n1
file = 'example-graid1.xml' Annotation('word', 'a-112')
file_stream = codecs.open(file, 'r', 'utf-8') Annotation('graid1', 'a-508')
g = gparser.parse(file_stream) comp
for node in g.nodes: NodeID = word-n2
print(node) Annotation('word', 'a-113')
for annotation in node.annotations: Annotation('graid1', 'a-509')
print(annotation) deti
graid1 = annotation.features.get('graid1') NodeID = word-n3
if graid1 is not None: Annotation('word', 'a-114')
print(graid1) Annotation('graid1', 'a-510')
np.h:s=cop:predp
Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
@ ACRH-2, Lisbon, 29.11.2012
13. Graf-python (3/3)
Utterance 1
Region [0-20]
Word-n1 Word-n2
Region [0 2] Region [3 7]
„ki“ „comp“ „yag“ „deti“
word graid1 word graid1
Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
@ ACRH-2, Lisbon, 29.11.2012
14. The future: Usage of graphs
● Graph-coloring algorithm to provide insight on LD data
● make common subgraphs visible after merge of
corpora
● Graph-traversal algorithms to collect statistical data
● Clusters of annotation values
● Weighted graphs to reflect links between sources
● Quantitative Historical Linguistics with dictionaries
● Linked via spanish translations
Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
@ ACRH-2, Lisbon, 29.11.2012
15. Thank you for your attention!
Centro Interdisciplinar de Documentação Linguística e Social
Minde/Portugal
Vera Ferreira, vferreira@cidles.eu
Peter Bouda, pbouda@cidles.eu
António Lopes, alopes@cidles.eu
Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
@ ACRH-2, Lisbon, 29.11.2012
16. Links
● Poio (API): http://media.cidles.eu/poio/
● ISO 24612:
http://www.iso.org/iso/catalogue_detail.htm?csnumb
● The Language Archive:http://tla.mpi.nl/
● Weblicht:
http://weblicht.sfs.uni-tuebingen.de/index.shtml
Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu
@ ACRH-2, Lisbon, 29.11.2012