How community software supports language documentation and data analysis

How community software supports language
documentation and data analysis

Peter Bouda
Centro Interdisciplinar de Documentação
Linguística e Social
Minde/Portugal

What is „open“ in software?
• Open Source license (but be careful about
restrictions!)
• Make participation easy
– Documentation
– Transparent development process (e.g. discuss
features publically)
– Attract programmers (code quality, make „giving
back“ easy, online meetings, code sprints, …)
• Try to create and support a community from the
beginning, otherwise nobody will use your code

Community
• Software projects are not only source code:
– Feedback from users
– Write documentation
– Test!!! Report bugs
– In our case: provide data for tests
– Propose features
• Best code and software quality
• Websites for community development
(Github, Bitbucket, …)

Examples
• EOPAS
• LingPy and qlc
• NLTK (Natural Language Toolkit)
• Poio and PyAnnotation
• Scientific Python

EOPAS
• „Ethnographic E-Research Online Presentation
System for Interlinear Text”
• Present interlinear text online
• Supported files:
– Elan
– Transcriber
– Toolbox interlinear glossed text

Why EOPAS is open
• Published on Github
• EOPAS is community software:
– Based on a modern web framework (Ruby on Rails)
– clear and documented deployment strategy (how to
use it on your own)
– easy to maintain, low entry level, good code quality
– several options for participation listed on website
• Publish your data with EOPAS and support the
development!

Poio and PyAnnotation
• Started during my internship in DoBeS project
„Minderico - An endangered language in
Portugal“
• Ideas and support by Prof. Johannes Helmbrecht
• Support by Institute for General Linguistics and
Language Typology at University of Munich
• Support by Institute for General Linguistics at
University of Bamberg

Poio and PyAnnotation
• PyAnnotation provides access to different file
formats
• Provides access to data programmatically (API)
• Poio is graphical user interface (GUI) on top of
PyAnnotation
• Two software packages:
– Poio Editor
– Poio Analyzer

Poio Editor and Analyzer
• Start: Poio Editor as an „add-on“ to Elan
• Open Elan transcription and add morpho-
syntactic annotations
• Analyzer to search in Elan and Toolbox files
• Now: adding support for GRAID (Grammatical
Relations and Animacy in Discourse) and any
other annotation types
• Goal: a highly customizable desktop software for
diverse annotation and analysis scenarios /
sparse annotation

PyAnnotation
• Parses LD files:
– Elan
– Toolbox
– Kura
• Unified data access through API
• Modify data structure and write Elan files again
– good for batch processing
• Combines well with Scientific Python for analysis

Scientific Python
• Python programming language
• Collection of tools and scientific libraries:
– IPython
– NumPy and SciPy, scikit-learn, networkx, …
– An easy installer: Python(x,y)
• Alternative to Matlab, R, and other
mathematical tools
• But: general usage for software development

CLARIN
• „Common Language Resources and
Technology Infrastructure“
• Current projects of CLARIN-D:
– Weblicht: SOA to create annotated corpora
– Virtual Language Observatory
• “Kurationsprojekt” to develop software
framework to access fieldwork data
• Among others: based on code of
PyAnnotation

Framework for Fieldwork Data
• Improve annotation and analysis tasks based on
documentation data
• Build a bridge between LD and NLP data formats
and technology
– Lexan, UIMA, …
• DoBeS corpus as central resource
• Develop a basic software library
• Web API and web app as reference
implementation

Generic representation
• GrAF: Graph Annotation Framework
• Based on annotation graphs
• Developed at American National Corpus
• Common representation helps to process and
analyze data from different sources
• Map LD data (Elan, Toolbox, …) to GrAF
• Users work with „structures“ (GRAID, Morph-
Syntax, POS, …) that can be mapped to a GUI

Custom structures
• Morpho-syntactic vs. GRAID

Development
• Library is developed at CIDLeS
• Web API and app is developed at
CCeH/University of Cologne
• Coordination at Institute for Linguistics,
University Cologne
• August/September 2012 – July 2013

Support Open Software!
• Use existing project whenever possible and
contribute by giving feedback
• In LD: data drives development, developers
need files to test
• Share your code as soon as possible
• Use existing infrastructure like Github to share

Thank you for your attention!
• pbouda@cidles.eu
• Become a member of CIDLeS to support our
software development:

www.cidles.eu

Links
• EOPAS: http://www.eopas.org/
• Poio and PyAnnotation:
http://www.cidles.eu/ltll/poio
• Python(x,y):
http://code.google.com/p/pythonxy/
• LingPy: http://lingulist.de/lingpy/
• QLC: https://github.com/pbouda/qlc
• NLTK: http://nltk.org/

Links
• Apache UIMA: http://uima.apache.org/
• CLARIN-D: http://de.clarin.eu/index.php/en/

How community software supports language documentation and data analysis

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (8)

Similaire à How community software supports language documentation and data analysis

Similaire à How community software supports language documentation and data analysis (20)

Plus de Peter Bouda

Plus de Peter Bouda (7)

How community software supports language documentation and data analysis