This document describes CorpusStudio, a web application for corpus linguistics research that allows defining queries to analyze text corpora in various formats. The application allows users to create corpus research projects containing metadata, definitions, queries and result databases. It includes editors for defining queries and constructing output as well as viewers for results and corpora. The application execution is handled asynchronously with a queuing system. Future plans include expanding grouping and filtering of query results.
1. CorpusStudio web application
Erwin R. Komen
Meertens Instituut // Radboud University Nijmegen // SIL-International
E.Komen@ru.nl
1. Background
• Existing software:
• CorpusStudio – Windows
• Cesax – Windows
• Successfully used in linguistic research
• Web application version?
• Central location for corpora (‘last’ version)
• Platform independent: MacOS/Linux/Windows
• Fast parallel processing
2. Formats
• FoLiA xml
• Dutch: Nederlab, CGN, Sonar/Lassy
• TEI-Psdx xml
• English historical + SLA
• Caucasian: Chechen, Lak, Lezgi
• Old Welsh
• Dutch
• Additional formats
• Convert via ‘Cesax’ (Alpino, Negra, …)
• Add handler into CorpusStudio
4. Defining queries
• Definition editor
• Constants
• Functions (Xquery)
• Query editor
• Subcategorization (Xquery)
• Constructor editor
• Execution order
• Options (examples, output, complement)
• Result database Feature editor
• Xquery user-functions calculate them
6. Availability
• CorpusStudio sources (build your own version)
• https://github.com/ErwinKomen
• CLARIN-NL access
• http://www.clarin.nl/node/2095
7. References
Boag, Scott, Don Chamberlin, Mary F. Fernández, Daniela Florescu, Jonathan Robie, and Jérôme Siméon. 2010.
XQuery 1.0: An XML Query Language (Second Edition): W3C Recommendation, <http://www.w3.org/XML/Query>.
van Gompel, Maarten & Martin Reynaert (2014). FoLiA: A practical XML format for linguistic annotation - a descriptive
and comparative study. Computational Linguistics in the Netherlands Journal; 3:63-81; 2013.
Komen, Erwin R. 2013. Corpus databases with feature pre-calculation. In Proceedings of the twelfth workshop on
treebanks and linguistic theories (TLT12). Sandra Kübler, Petya Osenova & Martin Volk (eds), 85-96. Sofia,
Bulgaria: The institute of information and communication technologies, Bulgarian AS.
User informationProject information
Definition
Editor
Query
Editor
Constructor
Editor
Result viewer
Meta Data
Editor
Definitions
Queries
Corpus
Research
Project
(.crpx)
Search service: crpp
Query
Executor
Database
Creator
Output Monitor
Results
(.xml)
Corpus
Research
Database
(.xml)
Table
Viewer
Result
Viewer
Documents
(.xml)
xml
xml
xml
xml
xml
Input
Selector
json
Status
xml
json
Database
feature editor
Result
Grouping
Standard
grouping
(.json)
Grouping
Viewer
Corpus
Viewer
Result database
Result dbase
Viewer
Result dbase
Editor
3. Corpus Research Projects
• All information for one research project
• Meta information (author, dates, goal)
• Input (language, corpus, filter)
• All definition and query files used
• Execution order
• Optional: result database features
• Exchange
• Upload/download
• Compatible with Windows CorpusStudio
CorpusStudio components
Meta Data
Editor
Definition
Editor
Input
Selector
Query
Editor
Constructor
Editor
Output
Monitor
Query
Executor
Result
Viewer
Corpus
Viewer
Database
feature editor
5. Future
• Grouping editor
• Group output over meta-data categories
• User-definable (Xquery)
• Query/project wizard
• Tabular input of principal components
• Relations, names, feature calculations
• Result database editor
• View and edit result database records