SlideShare une entreprise Scribd logo
1  sur  25
How community software supports language
     documentation and data analysis

                    Peter Bouda
     Centro Interdisciplinar de Documentação
                Linguística e Social
                  Minde/Portugal
What is „open“ in software?
• Open Source license (but be careful about
  restrictions!)
• Make participation easy
  – Documentation
  – Transparent development process (e.g. discuss
    features publically)
  – Attract programmers (code quality, make „giving
    back“ easy, online meetings, code sprints, …)
• Try to create and support a community from the
  beginning, otherwise nobody will use your code
Community
• Software projects are not only source code:
  – Feedback from users
  – Write documentation
  – Test!!! Report bugs
  – In our case: provide data for tests
  – Propose features
• Best code and software quality
• Websites for community development
  (Github, Bitbucket, …)
Examples
•   EOPAS
•   LingPy and qlc
•   NLTK (Natural Language Toolkit)
•   Poio and PyAnnotation
•   Scientific Python
EOPAS
• „Ethnographic E-Research Online Presentation
  System for Interlinear Text”
• Present interlinear text online
• Supported files:
  – Elan
  – Transcriber
  – Toolbox interlinear glossed text
EOPAS
Why EOPAS is open
• Published on Github
• EOPAS is community software:
  – Based on a modern web framework (Ruby on Rails)
  – clear and documented deployment strategy (how to
    use it on your own)
  – easy to maintain, low entry level, good code quality
  – several options for participation listed on website
• Publish your data with EOPAS and support the
  development!
Poio and PyAnnotation
• Started during my internship in DoBeS project
  „Minderico - An endangered language in
  Portugal“
• Ideas and support by Prof. Johannes Helmbrecht
• Support by Institute for General Linguistics and
  Language Typology at University of Munich
• Support by Institute for General Linguistics at
  University of Bamberg
Poio and PyAnnotation
• PyAnnotation provides access to different file
  formats
• Provides access to data programmatically (API)
• Poio is graphical user interface (GUI) on top of
  PyAnnotation
• Two software packages:
  – Poio Editor
  – Poio Analyzer
Poio Editor and Analyzer
• Start: Poio Editor as an „add-on“ to Elan
• Open Elan transcription and add morpho-
  syntactic annotations
• Analyzer to search in Elan and Toolbox files
• Now: adding support for GRAID (Grammatical
  Relations and Animacy in Discourse) and any
  other annotation types
• Goal: a highly customizable desktop software for
  diverse annotation and analysis scenarios /
  sparse annotation
Live Demo
PyAnnotation
• Parses LD files:
   – Elan
   – Toolbox
   – Kura
• Unified data access through API
• Modify data structure and write Elan files again
   – good for batch processing
• Combines well with Scientific Python for analysis
Scientific Python
• Python programming language
• Collection of tools and scientific libraries:
   – IPython
   – NumPy and SciPy, scikit-learn, networkx, …
   – An easy installer: Python(x,y)
• Alternative to Matlab, R, and other
  mathematical tools
• But: general usage for software development
Live Demo
CLARIN
• „Common Language Resources and
  Technology Infrastructure“
• Current projects of CLARIN-D:
  – Weblicht: SOA to create annotated corpora
  – Virtual Language Observatory
• “Kurationsprojekt” to develop software
  framework to access fieldwork data
• Among others: based on code of
  PyAnnotation
Framework for Fieldwork Data
• Improve annotation and analysis tasks based on
  documentation data
• Build a bridge between LD and NLP data formats
  and technology
  – Lexan, UIMA, …
• DoBeS corpus as central resource
• Develop a basic software library
• Web API and web app as reference
  implementation
Library to access LD data
Generic representation
• GrAF: Graph Annotation Framework
• Based on annotation graphs
• Developed at American National Corpus
• Common representation helps to process and
  analyze data from different sources
• Map LD data (Elan, Toolbox, …) to GrAF
• Users work with „structures“ (GRAID, Morph-
  Syntax, POS, …) that can be mapped to a GUI
Morpho-syntactic structure
Custom structures
• Morpho-syntactic vs. GRAID
Development
• Library is developed at CIDLeS
• Web API and app is developed at
  CCeH/University of Cologne
• Coordination at Institute for Linguistics,
  University Cologne
• August/September 2012 – July 2013
Support Open Software!
• Use existing project whenever possible and
  contribute by giving feedback
• In LD: data drives development, developers
  need files to test
• Share your code as soon as possible
• Use existing infrastructure like Github to share
Thank you for your attention!
• pbouda@cidles.eu
• Become a member of CIDLeS to support our
  software development:

               www.cidles.eu
Links
• EOPAS: http://www.eopas.org/
• Poio and PyAnnotation:
  http://www.cidles.eu/ltll/poio
• Python(x,y):
  http://code.google.com/p/pythonxy/
• LingPy: http://lingulist.de/lingpy/
• QLC: https://github.com/pbouda/qlc
• NLTK: http://nltk.org/
Links
• Apache UIMA: http://uima.apache.org/
• CLARIN-D: http://de.clarin.eu/index.php/en/

Contenu connexe

Tendances (8)

Olf2016
Olf2016Olf2016
Olf2016
 
Tlf2016
Tlf2016Tlf2016
Tlf2016
 
Knoxbug2016
Knoxbug2016Knoxbug2016
Knoxbug2016
 
Exploring Language Communities on Github
Exploring Language Communities on GithubExploring Language Communities on Github
Exploring Language Communities on Github
 
What is python
What is pythonWhat is python
What is python
 
Fosdem 2011 odt2daisy odt2braille
Fosdem 2011 odt2daisy odt2brailleFosdem 2011 odt2daisy odt2braille
Fosdem 2011 odt2daisy odt2braille
 
C++ in object oriented programming
C++ in object oriented programmingC++ in object oriented programming
C++ in object oriented programming
 
The Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New TechnologiesThe Standards Mosaic Opening the Way to New Technologies
The Standards Mosaic Opening the Way to New Technologies
 

Similaire à How community software supports language documentation and data analysis

Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
Nikhil Kapoor
 
PYTHON UNIT 1
PYTHON UNIT 1PYTHON UNIT 1
PYTHON UNIT 1
nagendrasai12
 
Python programming ppt.pptx
Python programming ppt.pptxPython programming ppt.pptx
Python programming ppt.pptx
nagendrasai12
 
Python Programming Unit1_Aditya College of Engg & Tech
Python Programming Unit1_Aditya College of Engg & TechPython Programming Unit1_Aditya College of Engg & Tech
Python Programming Unit1_Aditya College of Engg & Tech
Ramanamurthy Banda
 
Programming in Civil Engineering_UNIT 1_NOTES
Programming in Civil Engineering_UNIT 1_NOTESProgramming in Civil Engineering_UNIT 1_NOTES
Programming in Civil Engineering_UNIT 1_NOTES
Rushikesh Kolhe
 
Python For Audio Signal Processing ( PDFDrive ).pdf
Python For Audio Signal Processing ( PDFDrive ).pdfPython For Audio Signal Processing ( PDFDrive ).pdf
Python For Audio Signal Processing ( PDFDrive ).pdf
shaikriyaz89
 
Open source caqdas what is in the box and what is missing
Open source caqdas what is in the box and what is missingOpen source caqdas what is in the box and what is missing
Open source caqdas what is in the box and what is missing
Merlien Institute
 

Similaire à How community software supports language documentation and data analysis (20)

Python programming
Python programmingPython programming
Python programming
 
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
 
Python.pptx
Python.pptxPython.pptx
Python.pptx
 
Introduction to python
Introduction to python Introduction to python
Introduction to python
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
PHP vs Python Which is Best for Web Development.pdf
PHP vs Python Which is Best for Web Development.pdfPHP vs Python Which is Best for Web Development.pdf
PHP vs Python Which is Best for Web Development.pdf
 
PYTHON UNIT 1
PYTHON UNIT 1PYTHON UNIT 1
PYTHON UNIT 1
 
Python programming ppt.pptx
Python programming ppt.pptxPython programming ppt.pptx
Python programming ppt.pptx
 
Python Programming Unit1_Aditya College of Engg & Tech
Python Programming Unit1_Aditya College of Engg & TechPython Programming Unit1_Aditya College of Engg & Tech
Python Programming Unit1_Aditya College of Engg & Tech
 
P1 2017 python
P1 2017 pythonP1 2017 python
P1 2017 python
 
Programming in Civil Engineering_UNIT 1_NOTES
Programming in Civil Engineering_UNIT 1_NOTESProgramming in Civil Engineering_UNIT 1_NOTES
Programming in Civil Engineering_UNIT 1_NOTES
 
Introduction to Python Programming Basics
Introduction  to  Python  Programming BasicsIntroduction  to  Python  Programming Basics
Introduction to Python Programming Basics
 
Python For Audio Signal Processing ( PDFDrive ).pdf
Python For Audio Signal Processing ( PDFDrive ).pdfPython For Audio Signal Processing ( PDFDrive ).pdf
Python For Audio Signal Processing ( PDFDrive ).pdf
 
Python Introduction.ppt
Python Introduction.pptPython Introduction.ppt
Python Introduction.ppt
 
Open source caqdas what is in the box and what is missing
Open source caqdas what is in the box and what is missingOpen source caqdas what is in the box and what is missing
Open source caqdas what is in the box and what is missing
 
Python presentation by Monu Sharma
Python presentation by Monu SharmaPython presentation by Monu Sharma
Python presentation by Monu Sharma
 
Open source softwares
Open source softwaresOpen source softwares
Open source softwares
 
Open source softwares
Open source softwaresOpen source softwares
Open source softwares
 
What is python
What is pythonWhat is python
What is python
 

Plus de Peter Bouda

Plus de Peter Bouda (7)

Best episode ever: Angular 2 from the perspective of an Angular 1 developer
Best episode ever: Angular 2 from the perspective of an Angular 1 developerBest episode ever: Angular 2 from the perspective of an Angular 1 developer
Best episode ever: Angular 2 from the perspective of an Angular 1 developer
 
Poio API: a CLARIN-D curation project for language documentation and language...
Poio API: a CLARIN-D curation project for language documentation and language...Poio API: a CLARIN-D curation project for language documentation and language...
Poio API: a CLARIN-D curation project for language documentation and language...
 
Querying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysisQuerying GrAF data in linguistic analysis
Querying GrAF data in linguistic analysis
 
Poio API and GraF-XML @ Balisage 2013
Poio API and GraF-XML @ Balisage 2013Poio API and GraF-XML @ Balisage 2013
Poio API and GraF-XML @ Balisage 2013
 
Poio API - An annotation framework to bridge Language Documentation and Natur...
Poio API - An annotation framework to bridge Language Documentation and Natur...Poio API - An annotation framework to bridge Language Documentation and Natur...
Poio API - An annotation framework to bridge Language Documentation and Natur...
 
O contributo das tecnoloxías da linguaxe na documentación e na revitalización...
O contributo das tecnoloxías da linguaxe na documentación e na revitalización...O contributo das tecnoloxías da linguaxe na documentación e na revitalización...
O contributo das tecnoloxías da linguaxe na documentación e na revitalización...
 
O CIDLeS - Objectivos e projectos
O CIDLeS - Objectivos e projectosO CIDLeS - Objectivos e projectos
O CIDLeS - Objectivos e projectos
 

How community software supports language documentation and data analysis

  • 1. How community software supports language documentation and data analysis Peter Bouda Centro Interdisciplinar de Documentação Linguística e Social Minde/Portugal
  • 2. What is „open“ in software? • Open Source license (but be careful about restrictions!) • Make participation easy – Documentation – Transparent development process (e.g. discuss features publically) – Attract programmers (code quality, make „giving back“ easy, online meetings, code sprints, …) • Try to create and support a community from the beginning, otherwise nobody will use your code
  • 3. Community • Software projects are not only source code: – Feedback from users – Write documentation – Test!!! Report bugs – In our case: provide data for tests – Propose features • Best code and software quality • Websites for community development (Github, Bitbucket, …)
  • 4. Examples • EOPAS • LingPy and qlc • NLTK (Natural Language Toolkit) • Poio and PyAnnotation • Scientific Python
  • 5. EOPAS • „Ethnographic E-Research Online Presentation System for Interlinear Text” • Present interlinear text online • Supported files: – Elan – Transcriber – Toolbox interlinear glossed text
  • 7. Why EOPAS is open • Published on Github • EOPAS is community software: – Based on a modern web framework (Ruby on Rails) – clear and documented deployment strategy (how to use it on your own) – easy to maintain, low entry level, good code quality – several options for participation listed on website • Publish your data with EOPAS and support the development!
  • 8. Poio and PyAnnotation • Started during my internship in DoBeS project „Minderico - An endangered language in Portugal“ • Ideas and support by Prof. Johannes Helmbrecht • Support by Institute for General Linguistics and Language Typology at University of Munich • Support by Institute for General Linguistics at University of Bamberg
  • 9. Poio and PyAnnotation • PyAnnotation provides access to different file formats • Provides access to data programmatically (API) • Poio is graphical user interface (GUI) on top of PyAnnotation • Two software packages: – Poio Editor – Poio Analyzer
  • 10. Poio Editor and Analyzer • Start: Poio Editor as an „add-on“ to Elan • Open Elan transcription and add morpho- syntactic annotations • Analyzer to search in Elan and Toolbox files • Now: adding support for GRAID (Grammatical Relations and Animacy in Discourse) and any other annotation types • Goal: a highly customizable desktop software for diverse annotation and analysis scenarios / sparse annotation
  • 12. PyAnnotation • Parses LD files: – Elan – Toolbox – Kura • Unified data access through API • Modify data structure and write Elan files again – good for batch processing • Combines well with Scientific Python for analysis
  • 13. Scientific Python • Python programming language • Collection of tools and scientific libraries: – IPython – NumPy and SciPy, scikit-learn, networkx, … – An easy installer: Python(x,y) • Alternative to Matlab, R, and other mathematical tools • But: general usage for software development
  • 15. CLARIN • „Common Language Resources and Technology Infrastructure“ • Current projects of CLARIN-D: – Weblicht: SOA to create annotated corpora – Virtual Language Observatory • “Kurationsprojekt” to develop software framework to access fieldwork data • Among others: based on code of PyAnnotation
  • 16. Framework for Fieldwork Data • Improve annotation and analysis tasks based on documentation data • Build a bridge between LD and NLP data formats and technology – Lexan, UIMA, … • DoBeS corpus as central resource • Develop a basic software library • Web API and web app as reference implementation
  • 17. Library to access LD data
  • 18. Generic representation • GrAF: Graph Annotation Framework • Based on annotation graphs • Developed at American National Corpus • Common representation helps to process and analyze data from different sources • Map LD data (Elan, Toolbox, …) to GrAF • Users work with „structures“ (GRAID, Morph- Syntax, POS, …) that can be mapped to a GUI
  • 21. Development • Library is developed at CIDLeS • Web API and app is developed at CCeH/University of Cologne • Coordination at Institute for Linguistics, University Cologne • August/September 2012 – July 2013
  • 22. Support Open Software! • Use existing project whenever possible and contribute by giving feedback • In LD: data drives development, developers need files to test • Share your code as soon as possible • Use existing infrastructure like Github to share
  • 23. Thank you for your attention! • pbouda@cidles.eu • Become a member of CIDLeS to support our software development: www.cidles.eu
  • 24. Links • EOPAS: http://www.eopas.org/ • Poio and PyAnnotation: http://www.cidles.eu/ltll/poio • Python(x,y): http://code.google.com/p/pythonxy/ • LingPy: http://lingulist.de/lingpy/ • QLC: https://github.com/pbouda/qlc • NLTK: http://nltk.org/
  • 25. Links • Apache UIMA: http://uima.apache.org/ • CLARIN-D: http://de.clarin.eu/index.php/en/