It has been shown that data management should start as early as possible in the research workflow to minimize the risks of data loss. Given the large numbers of datasets produced every day, curators may be unable to describe them all, so researchers should take an active part in the process. However, since they are not data management experts, they must be provided with user-friendly but powerful tools to capture the context information necessary for others to interpret and reuse their datasets. In this paper, we present Dendro, a fully ontology-based collaborative platform for research data management. Its graph data model innovates in the sense that it allows domain-specific lightweight ontologies to be used in resource description, acting as a staging area for later deposit in long-term preservation solutions.
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
The Dendro research data management platform: Applying ontologies to long-term preservation in a collaborative environment
1. The Dendro research data management
platform
!
Applying ontologies to long-term preservation in a collaborative
environment
João Rocha da Silva
joaorosilva@gmail.com
Faculdade de
Engenharia da
Universidade do
João Aguiar Castro Porto / INESC TEC
joaoaguiarcastro@gmail.com
Cristina Ribeiro
mcr@fe.up.pt DEI—Faculdade de
Engenharia da
Universidade do
Porto / INESC TEC João Correia Lopes
jlopes@fe.up.pt
iPRES 2014, October 06 - 10 2014, Melbourne, Australia
2. Contents
• Research data management in the long tail
• Linked Open Data: why do we need it?
• Collaboration for easier metadata production
• The Dendro platform
• Conclusions
2
4. The long tail of research
2011: Science magazine reviewers
are asked about their data requirements
~1700 replied
4
5. Source
Dealing with data. Challenges and opportunities. Introduction. (2011). Science
(New York, N.Y.), 331(6018), 692–3. doi:10.1126/science.331.6018.692
5
6. Source
Dealing with data. Challenges and opportunities. Introduction. (2011). Science
(New York, N.Y.), 331(6018), 692–3. doi:10.1126/science.331.6018.692
6
13. Linked Open Data
• Simplicity!
- LOD is a very simple model for representing knowledge
• Meaning!
- Resources are interlinked by properties with established
meaning
• Interoperability!
- Standard methods for querying data - SPARQL
- Representations use standard formats - RDF, OWL
13
14. nie:isLogicalPartOf
!!!!
http://dendro.fe.up.pt/
project/datanotes/data
“Base data of the
DCB experiments”
dc:title
nie:title
base data.xls
rdf:type
nie:File
dcb:initialCrackLength
180mm
!
!
!!!
!
http://dendro.fe.up.pt/project/
datanotes/data/base
%20data.xls
14
15. Analytical Chemistry
Dataset
Fracture Mechanics
Dataset
…
Generic
Author
Description
Creation date
…
Author
Description
Creation date
…
…
Domain
Specific
Sample Count
Analysed Substance
…
Initial Crack Length
Specimen Type
…
15
21. The Dendro platform
An open-source platform for Linked Open Data in
research environments
21
22. Metadata
Ontologies
Description
• Data store fully built on
Linked Data
• No relational database
to preserve
• Model can grow by
loading more ontologies
• External systems can
retrieve resources via
SPARQL
22
23. Metadata
Ontologies
File
Storage
!
!
Deposit
• GridFS cluster for
large or
numerous files
• Can work in the
cloud if needed
23
24. Metadata
Ontologies
Business
Logic
File
Storage
!
!
Collaboration
• Flexible access control
system
• Backup / Restore
• Versions history
• File type previews
• Integration
• DSpace (SWORD)
• ePrints (SWORD)
• CKAN
• Figshare
• ……..
24
25. Metadata
Ontologies
API
Business
Logic
File
Storage
!
!
Sharing
• All operations
available via RESTful
API using JSON
• All resources are de-referenceable
(HTTP
content negotiation)
• Plugin architecture
allows integration
with external systems
Web UI
25
26. For curators
• Curators can work with researchers to build more
ontologies using existing tools (e.g. Protégé)
• Established ontologies can be loaded (DC, FOAF…)
• Ontologies mature (reuse across Dendro instances)
• Data, metadata and its meaning go together
Beyond !
INSPIRE: An ontology for biodiversity metadata records Creating lightweight ontologies for dataset description: Practical applications in a
cross-domain research data management workflow
Rocha da Silva, J., Castro, J., Ribeiro, C., Honrado, J., Lomba, A., Gonçalves, J.
Castro, J., Rocha da Silva, J., Ribeiro, C.
10th International Workshop on Ontology Content (OntoContent 2014)
Digital Libraries 2014 (DL2014)
(pre-print available at http://dendro.fe.up.pt/) (pre-print available at http://dendro.fe.up.pt/)
26
27. For programmers
• 100% Open-source software
• Rich API allows Dendro to be connected to almost
any system (e.g. mobile apps)
Ontology-based multi-domain metadata for research data management using triple stores
LabTablet: semantic metadata collection on a multi-domain laboratory notebook
Rocha da Silva, J., Ribeiro, C., Correia Lopes, J.
Amorim,R., Castro, J., Rocha da Silva, J., Ribeiro, C.
18th International Database Engineering & Applications Symposium (IDEAS 2014)
8th Metadata and Semantics Research Conference (MTSR 2014)
(pre-print available at http://dendro.fe.up.pt/) (pre-print available at http://dendro.fe.up.pt/)
27
28. Dendro dies, data lives on
Triple Store Ontologies
“Database” “Documentation”
28
29. Conclusions
• Research data management should start early
• Linked Open Data: simple, interoperable, flexible
• Collaboration support helps researchers while
gathering metadata for later deposit
• Dendro: a fully open-source platform for RDM, built
on Linked Open Data
• Dendro integrates with major repository platforms
29
30. Conclusions (cont’d)
• Ontologies: source of metadata descriptors
• Data model grows as more ontologies are loaded
• Curators can model and share the ontologies
• Domain ontologies evolve with reuse
30
32. João Rocha da Silva!
PhD Student, Senior Web Developer, Semantic Web
at INESC TEC
João Rocha da Silva is an Informatics Engineering
PhD student at the Faculty of Engineering of the
University of Porto. He specializes on research
data management, applying the latest Semantic
Web Technologies to the adequate preservation
and discovery of research data assets.! !
He is also an experienced freelancer iOS
Developer with several Apps published on the App
Store, and a self-taught DIY mechanic with a
special interest in classic cars, particularly his 1987
Toyota Corolla GT Twin Cam, also known as Hachi-
Roku or AE86.!
João Aguiar Castro!
PhD Student, Research Data Management researcher
at INESC TEC
João Aguiar Castro holds a Masters degree in
Information Science, and is currently a Digital
Platforms PhD student at the Faculty of Engineering of
the University of Porto. He is a research data
management researcher, particularly in the definition of
application profiles that meet the metadata needs of
different research domains
Cristina Ribeiro! João Correia Lopes!
João Correia Lopes is an Assistant Professor in
Informatics Engineering at Universidade do Porto and a
researcher at INESC TEC. He has graduated in Electrical
Engineering in the University of Porto in 1984 and holds a
PhD in Computing Science by Glasgow University
in1997. His teaching includes undergraduate and
graduate courses in databases and web applications,
software engineering and object-oriented programming,
markup languages and semantic web. He has been
involved in research projects in the area of long-term
preservation, service-oriented architectures and e-
Science. Currently his main research interests are e-
Science and the management of research data.
Assistant Professor in Informatics Engineering at
Universidade do Porto, Researcher at INESC TEC
Cristina Ribeiro is an Assistant Professor in
Informatics Engineering at Universidade do Porto
and a researcher at INESC TEC. She has
graduated in Electrical Engineering, holds a Master
in Electrical and Computer Engineering and a Ph.D.
in Informatics. Her teaching includes undergraduate
and graduate courses in information retrieval,
digital libraries, knowledge representation and
markup languages. She has been involved in
research projects in the areas of cultural heritage,
multimedia databases and information retrieval.
Currently her main research interests are
information retrieval, digital preservation and the
management of research data.
Assistant Professor in Informatics Engineering at
Universidade do Porto, Researcher at INESC TEC
34. RDF/XML,
SPARQL
Endpoint
HTML
JSON
API
DB Adapter ES Endpoint GridFS Client
Presentation
Graph Database
(LOD)
Web Interface
Distributed
document index
AngularJS
(JavaScript)
NodeJS
(JavaScript)
File Storage
Cluster
Business Logic
Logic
Openlink
Virtuoso 7
ElasticSearch
MongoDB
(GridFS)
Web Human Users
JSON JSON JSON
Data
35. Curated
Dataset
Working
Files
Deposit
Curator
Dendro
FOAF
DC
dc:title
nie:isPartOf
dcb:specimenLength
Ontology
concept
reuse
Web Portal
SPARQL
Endpoint
Sharing &
evolution
“Mature”
ontologies on the web
Metadata
validation
Data
producers
Free-Text
Search
API
CKAN
Dryad
Domain-Specific
Lightweight Ontologies
dcb
dcb
Data
reuser
dcb
Specification of new metadata ontologies
1
2
3
4