SlideShare une entreprise Scribd logo
1  sur  11
Télécharger pour lire hors ligne
Dear people, it's nice to be here, a pitty it can't be in greece. Welcome to this short
journey through years of development and testing. My name is Eduard Drenth, I work
at the Fryske Akademy in the Hetherlands as a software architect. 4 years ago at
euralex Pieter Duijff and I talked about an online Frisian dictionary and our plans for a
Frisian portal. Today I want to show you where we are now. We like what we have
accomplished and increasingly enjoy the benefits. My main motivation to present is
that I would like very much to seek cooperation. There is probably some good, some
bad and perhaps a lot of similarity in what we all do. Let us investigate, connect and
join forces. So, let me first explain what we have.
1
Frisian
dictionaries
digitized
from A to Z
Solutions for lowresource
languages
It all starts with our data model, based on investigating and testing existing solutions
such as TEI-Lex0 and freedict. Important to mention is that our focus isn't the model
itself, but instead what you can do with it. Hence we found we needed a more strict
model enabling easier and more reliable software development. Basically, the model
allows less elements compared to default TEI, offers enumerations for attributes such
as type and adds validation rules and documentation. For example the content of
gram is restricted to an enumeration of universal dependency terms. We trust you
will find a place for most lexical information in our model. If not, join us to discuss
and improve.
2
Data model...
Open
TEI
UD
Strict
Reliable
development
teiCorpus
TEI
Entries
Senses
Forms + info
Texts / translations
https://bitbucket.org/fryske-akademy/online-dictionaries
https://frisian.eu/dictionary-services/dictionary xsddocumentation.html
As said before the data model is important but only a starting point, our solid ground.
Being a TEI customization, a lot can be generated. We do that using the TEI
stylesheets and custom transformations, automated in a robust maven building
process. On top of the model and what is generated we develop generic code to
query, export and transform dictionary data. We have for example a standard
transformation to TEI-Lex0. Generic applications we developed, a user interface and a
Json service, can be run directly from docker setups we provide and maintain.
3
… as basis for
datamodel
G enerate:
validation
documentation
jaxb
Develop:
services
webapp
maven libs
transformations
editor support
data sync
Use:
query
export
transform
https://bitbucket.org/fryske-akademy/online-dictionaries
Frisian.eu: json service (translate, usage, lex0), app (options, i18n, da*, details)
4
Demo
• Frisian.eu/dictionaries
Our next stop in the journey is this spiders web that should give you an idea of how
the dictionaries fit in a greater picture. The green lines show how data are edited in
one place and reused in other places. For example we have a generic transformer
that extracts a parallel corpus from dictionaries for a translation solution. The Frisian
lexicon is an important source for spelling, synonyms, variants and the paradigm of
lemmata. The red lines show you how data in several systems are used in a generic
language service we developed. Finally the blue lines show how user applications
query language data via the language service, spelling being the exception. Of course
we have more tools, for example a pos-tagger and dependency parser, that also reuse
data, you can find those at frisian.eu.
5
A language api
https://frisian.eu/languageapidocs/
Apidocs: documented readable api, contract first, generation, available libraries
Frysker: sykje nadat linken, dag*, teksten, nadat pagina 2 moeder, oersetter "dit
congres had in griekenland moeten zijn"
6
Demo
• frisian.eu/languageapidocs
• Frysker.nl
• https://bitbucket.org/fryske-
akademy/languageapi/src/master/model/src/main/resources/graphql/languageapi.svg
For our instituation the investment of unifying data and systems saved us a lot of
time and effort, enabled us to develop new functionality exposed via a state of the
art API.
7
Main achievements
• Unified dictionaries (5), lexicon and corpora
• Reliable build and deploy pipelines
• Automaticallypublish data
• Graphql API
• Established our institution as data and service provider
In the daily language labour of our institution we increasingly benefit from what we
developed over the years. We love to work with automated pipelines from editing to
publication. We love to work with focussed systems that integrate well. Standardized
terminology and stable api's in most of our systems enable us to manage data in one
place and reuse the data, giving us spare time.
8
Advantages
• Limit effort: only manage data
• Stable, usable datamodel based on well-known TEI and UD
• Editor support: documentation, dropdowns, validation
• Available service and webapp
• Separate editing from running with data sync
• Integrates well (i.e. TEI-Lex0, freedict)
• Open, free
• Stable institution to support and maintain solution
Mainly on a technical level we have some doubts concerning complexity and
developer support, it's not main stream widely supported technology. Still the
communities involved are active, enthusiastic and willing to help. Performance needs
constant attention and is a bit hard to tune, though dictionaries holding 65000+
lemma's are running fine. The issue of the installed base will be solved, after this
presentation you will of course all line up to join in and use our solutions. In search of
a friendly editing environment, we consider two options at the moment for further
investigation: oxygen web author and e-editions.org. Both offer configurable editing
support, e-edition also offers configurable publication, web author offers a bit of
workflow.
9
Disadvantages
• Complexity of TEI / Odd
• Declining attention for Xml in ICT and Java
• Exist-db / XQuery lacks sophisticated debugging and profiling
• No large installed base (yet)
• No friendly editing environment in place
Luckily there is still a lot left to be organized. Provided we can find the resources, in
the coming years we will extend functionality, following the open and modular
approach that brought us here.
10
Wishes
• Crowd involvement
• More (dialect, historical) dictionaries
• Phonetics and spoken
• Integration
• Corpora
• Gis
• Linked Open Data
• Data pipelines
We enjoy what we do and will continue to extend, improve and connect to other
initiatives. We would like to see cooperation intensify. For example a community
could be formed around our dictionary setup that steers its development and widens
its use. I hope to hear from you, don't hesitate to take a peek and contact us, thank
you for listening.
11
Join us, ...or we'll join you
• Benefit from our applications and libraries
• Provide data => run dictonaries
• Use dictionary libraries in your applications
• Maintain, improve datamodel
• Develop functionality
• Setup friendly editing environment
https://bitbucket.org/fryske-akademy/online-dictionaries​
Edrenth@fryske-akademy.nl
+31620943428

Contenu connexe

Similaire à TEI based dictionaries

Cisco webex zend con2010 presentation
Cisco webex zend con2010 presentationCisco webex zend con2010 presentation
Cisco webex zend con2010 presentation
Enterprise PHP Center
 
Cisco webex zend con2010 presentation
Cisco webex zend con2010 presentationCisco webex zend con2010 presentation
Cisco webex zend con2010 presentation
Enterprise PHP Center
 
Exoven web-sovellusseminaarin kalvot
Exoven web-sovellusseminaarin kalvotExoven web-sovellusseminaarin kalvot
Exoven web-sovellusseminaarin kalvot
Exove
 
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
Paris Open Source Summit
 
Foss In Undergraduate Studies
Foss In Undergraduate StudiesFoss In Undergraduate Studies
Foss In Undergraduate Studies
guestec838a
 
Cilip Seminar 6th October - Integrating With Open Source
Cilip Seminar 6th October - Integrating With Open SourceCilip Seminar 6th October - Integrating With Open Source
Cilip Seminar 6th October - Integrating With Open Source
Jonathan Field
 
Office 365 - always the latest and greatest or too fast for you?
Office 365 - always the latest and greatest or too fast for you?Office 365 - always the latest and greatest or too fast for you?
Office 365 - always the latest and greatest or too fast for you?
Rene Modery
 

Similaire à TEI based dictionaries (20)

Timo Sirainen - Dovecot Story - Mindtrek 2016
Timo Sirainen - Dovecot Story - Mindtrek 2016Timo Sirainen - Dovecot Story - Mindtrek 2016
Timo Sirainen - Dovecot Story - Mindtrek 2016
 
OS Accelerate London - 09/16/15
OS Accelerate London - 09/16/15OS Accelerate London - 09/16/15
OS Accelerate London - 09/16/15
 
BLDS Migration to Koha (KohaCon12)
BLDS Migration to Koha (KohaCon12)BLDS Migration to Koha (KohaCon12)
BLDS Migration to Koha (KohaCon12)
 
Cisco webex zend con2010 presentation
Cisco webex zend con2010 presentationCisco webex zend con2010 presentation
Cisco webex zend con2010 presentation
 
Cisco webex zend con2010 presentation
Cisco webex zend con2010 presentationCisco webex zend con2010 presentation
Cisco webex zend con2010 presentation
 
Exoven web-sovellusseminaarin kalvot
Exoven web-sovellusseminaarin kalvotExoven web-sovellusseminaarin kalvot
Exoven web-sovellusseminaarin kalvot
 
Using rest to create responsive html 5 share point intranets
Using rest to create responsive html 5 share point intranetsUsing rest to create responsive html 5 share point intranets
Using rest to create responsive html 5 share point intranets
 
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches  ...
#OSSPARIS19 : Comment ONLYOFFICE aide à organiser les travaux de recherches ...
 
Delve and Office Graph
Delve and Office GraphDelve and Office Graph
Delve and Office Graph
 
Mark Hughes Annual Seminar Presentation on Open Source
Mark Hughes Annual Seminar Presentation on Open Source Mark Hughes Annual Seminar Presentation on Open Source
Mark Hughes Annual Seminar Presentation on Open Source
 
Automatic and rapid generation of massive knowledge repositories from data
Automatic and rapid generation of massive knowledge repositories from dataAutomatic and rapid generation of massive knowledge repositories from data
Automatic and rapid generation of massive knowledge repositories from data
 
Dean4j@Njug5
Dean4j@Njug5Dean4j@Njug5
Dean4j@Njug5
 
Top Deep Learning Frameworks.pdf
Top Deep Learning Frameworks.pdfTop Deep Learning Frameworks.pdf
Top Deep Learning Frameworks.pdf
 
Foss In Undergraduate Studies
Foss In Undergraduate StudiesFoss In Undergraduate Studies
Foss In Undergraduate Studies
 
Darin McBeath XML Holland
Darin McBeath XML HollandDarin McBeath XML Holland
Darin McBeath XML Holland
 
Six Principles of Software Design to Empower Scientists
Six Principles of Software Design to Empower ScientistsSix Principles of Software Design to Empower Scientists
Six Principles of Software Design to Empower Scientists
 
Cilip Seminar 6th October - Integrating With Open Source
Cilip Seminar 6th October - Integrating With Open SourceCilip Seminar 6th October - Integrating With Open Source
Cilip Seminar 6th October - Integrating With Open Source
 
es_cs_sid_lee_A4
es_cs_sid_lee_A4es_cs_sid_lee_A4
es_cs_sid_lee_A4
 
Collaboration With Moodle
Collaboration With  MoodleCollaboration With  Moodle
Collaboration With Moodle
 
Office 365 - always the latest and greatest or too fast for you?
Office 365 - always the latest and greatest or too fast for you?Office 365 - always the latest and greatest or too fast for you?
Office 365 - always the latest and greatest or too fast for you?
 

Dernier

Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Lisi Hocke
 

Dernier (20)

Transformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksTransformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with Links
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
A Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdfA Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdf
 
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit MilanWorkshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
Workshop: Enabling GenAI Breakthroughs with Knowledge Graphs - GraphSummit Milan
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements Engineering
 
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
 
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
 
^Clinic ^%[+27788225528*Abortion Pills For Sale In soweto
^Clinic ^%[+27788225528*Abortion Pills For Sale In soweto^Clinic ^%[+27788225528*Abortion Pills For Sale In soweto
^Clinic ^%[+27788225528*Abortion Pills For Sale In soweto
 
Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 

TEI based dictionaries

  • 1. Dear people, it's nice to be here, a pitty it can't be in greece. Welcome to this short journey through years of development and testing. My name is Eduard Drenth, I work at the Fryske Akademy in the Hetherlands as a software architect. 4 years ago at euralex Pieter Duijff and I talked about an online Frisian dictionary and our plans for a Frisian portal. Today I want to show you where we are now. We like what we have accomplished and increasingly enjoy the benefits. My main motivation to present is that I would like very much to seek cooperation. There is probably some good, some bad and perhaps a lot of similarity in what we all do. Let us investigate, connect and join forces. So, let me first explain what we have. 1 Frisian dictionaries digitized from A to Z Solutions for lowresource languages
  • 2. It all starts with our data model, based on investigating and testing existing solutions such as TEI-Lex0 and freedict. Important to mention is that our focus isn't the model itself, but instead what you can do with it. Hence we found we needed a more strict model enabling easier and more reliable software development. Basically, the model allows less elements compared to default TEI, offers enumerations for attributes such as type and adds validation rules and documentation. For example the content of gram is restricted to an enumeration of universal dependency terms. We trust you will find a place for most lexical information in our model. If not, join us to discuss and improve. 2 Data model... Open TEI UD Strict Reliable development teiCorpus TEI Entries Senses Forms + info Texts / translations https://bitbucket.org/fryske-akademy/online-dictionaries https://frisian.eu/dictionary-services/dictionary xsddocumentation.html
  • 3. As said before the data model is important but only a starting point, our solid ground. Being a TEI customization, a lot can be generated. We do that using the TEI stylesheets and custom transformations, automated in a robust maven building process. On top of the model and what is generated we develop generic code to query, export and transform dictionary data. We have for example a standard transformation to TEI-Lex0. Generic applications we developed, a user interface and a Json service, can be run directly from docker setups we provide and maintain. 3 … as basis for datamodel G enerate: validation documentation jaxb Develop: services webapp maven libs transformations editor support data sync Use: query export transform https://bitbucket.org/fryske-akademy/online-dictionaries
  • 4. Frisian.eu: json service (translate, usage, lex0), app (options, i18n, da*, details) 4 Demo • Frisian.eu/dictionaries
  • 5. Our next stop in the journey is this spiders web that should give you an idea of how the dictionaries fit in a greater picture. The green lines show how data are edited in one place and reused in other places. For example we have a generic transformer that extracts a parallel corpus from dictionaries for a translation solution. The Frisian lexicon is an important source for spelling, synonyms, variants and the paradigm of lemmata. The red lines show you how data in several systems are used in a generic language service we developed. Finally the blue lines show how user applications query language data via the language service, spelling being the exception. Of course we have more tools, for example a pos-tagger and dependency parser, that also reuse data, you can find those at frisian.eu. 5 A language api https://frisian.eu/languageapidocs/
  • 6. Apidocs: documented readable api, contract first, generation, available libraries Frysker: sykje nadat linken, dag*, teksten, nadat pagina 2 moeder, oersetter "dit congres had in griekenland moeten zijn" 6 Demo • frisian.eu/languageapidocs • Frysker.nl • https://bitbucket.org/fryske- akademy/languageapi/src/master/model/src/main/resources/graphql/languageapi.svg
  • 7. For our instituation the investment of unifying data and systems saved us a lot of time and effort, enabled us to develop new functionality exposed via a state of the art API. 7 Main achievements • Unified dictionaries (5), lexicon and corpora • Reliable build and deploy pipelines • Automaticallypublish data • Graphql API • Established our institution as data and service provider
  • 8. In the daily language labour of our institution we increasingly benefit from what we developed over the years. We love to work with automated pipelines from editing to publication. We love to work with focussed systems that integrate well. Standardized terminology and stable api's in most of our systems enable us to manage data in one place and reuse the data, giving us spare time. 8 Advantages • Limit effort: only manage data • Stable, usable datamodel based on well-known TEI and UD • Editor support: documentation, dropdowns, validation • Available service and webapp • Separate editing from running with data sync • Integrates well (i.e. TEI-Lex0, freedict) • Open, free • Stable institution to support and maintain solution
  • 9. Mainly on a technical level we have some doubts concerning complexity and developer support, it's not main stream widely supported technology. Still the communities involved are active, enthusiastic and willing to help. Performance needs constant attention and is a bit hard to tune, though dictionaries holding 65000+ lemma's are running fine. The issue of the installed base will be solved, after this presentation you will of course all line up to join in and use our solutions. In search of a friendly editing environment, we consider two options at the moment for further investigation: oxygen web author and e-editions.org. Both offer configurable editing support, e-edition also offers configurable publication, web author offers a bit of workflow. 9 Disadvantages • Complexity of TEI / Odd • Declining attention for Xml in ICT and Java • Exist-db / XQuery lacks sophisticated debugging and profiling • No large installed base (yet) • No friendly editing environment in place
  • 10. Luckily there is still a lot left to be organized. Provided we can find the resources, in the coming years we will extend functionality, following the open and modular approach that brought us here. 10 Wishes • Crowd involvement • More (dialect, historical) dictionaries • Phonetics and spoken • Integration • Corpora • Gis • Linked Open Data • Data pipelines
  • 11. We enjoy what we do and will continue to extend, improve and connect to other initiatives. We would like to see cooperation intensify. For example a community could be formed around our dictionary setup that steers its development and widens its use. I hope to hear from you, don't hesitate to take a peek and contact us, thank you for listening. 11 Join us, ...or we'll join you • Benefit from our applications and libraries • Provide data => run dictonaries • Use dictionary libraries in your applications • Maintain, improve datamodel • Develop functionality • Setup friendly editing environment https://bitbucket.org/fryske-akademy/online-dictionaries​ Edrenth@fryske-akademy.nl +31620943428