The document discusses the Deep Carbon Virtual Observatory (DCvO), which aims to provide integrated access to scientific data and information related to deep carbon science. It describes the DCvO's architecture, which leverages semantic technologies and ontologies to link different types of data and entities. The DCvO provides visualization tools to discover information by clicking through linked concepts, facilitates potential collaborations, and acts as an integrated portal for diverse scientific content. It also discusses some of the DCvO's boundary activities, like linking datasets to publications and integrating with external data repositories.
From data portal to knowledge portal: Leveraging semantic technologies to support interdisciplinary studies
1. deepcarbon.net
Xiaogang Ma, Patrick West, John Erickson, Stephan Zednik,
Yu Chen, Han Wang, Hao Zhong, Peter Fox
Tetherless World Constellation
Rensselaer Polytechnic Institute
From data portal to knowledge portal:
Leveraging semantic technologies to support interdisciplinary studies
2. Outline
• Deep Carbon Observatory
• Deep Carbon Virtual Observatory (DCvO)
– Architecture of DCvO
– DCO Ontologies
– Boundary activities
– Discovering information by clicking through
• Summary
2
3. A 10-year (2009-2019) initiative to intensify
global attention and scientific effort in the
burgeoning field of deep carbon science
3
4. • Faculty, staff and students from the Tetherless
World Constellation (TWC) at Rensselaer
Polytechnic Institute (RPI)
• Responsible for
– DCO Architecture and technology infrastructure
– DCO Computer Cluster
– The Deep Carbon Virtual Observatory DCvO
Deep Carbon Observatory – Data Science
4
5. Deep Carbon Virtual Observatory
Scientists – actually ANYONE - should be able to access a
global, distributed knowledge base of scientific data and
information that:
• appears to be integrated
• appears to be locally available
• is in a language (written, programming, or science)
that is understandable and can be shared
Data intensive – volume, complexity, mode, scale,
heterogeneity, … in an OPEN WORLD
5
6. Deep Carbon Virtual Observatory
• A vision of the DCvO:
– A conceptual model of the interplay between data, people,
publication, instruments, models, organizations, etc.
– Identify, annotate and link all key entities, agents and activities
– A repository for datasets and associated metadata
– Unique and powerful data and metadata visualization for
dissemination of information
– Facilitates the discovery of potential collaborations
– An integrated portal for diverse content and applications
(Fox et al., 2014)
6
8. vivo.cornell.edu
VIVO - represents
academic research
communities
DCO ontology:
a model for concept types and relationships
DCO ontologies extend each
other and the VIVO ontology 8
9. Ontologies and schemas
used in the DCO web portal
9
Name Prefix
Dublin Core Metadata Element Set dc
DCMI Metadata Terms dct
VIVO Core vivo
VIVO Scientific Research Ontology scires
Data Catalog Vocabulary dcat
Bibliographic Ontology bibo
Citation Counting and Context Characterization Ontology c4o
Citation Typing Ontology cito
FRBR-Aligned Bibliographic Ontology fabio
Event Ontology event
Friend of a Friend foaf
vCard Ontology vcard
Geopolitical Ontology geo
Simple Knowledge Organization System skos
DCO Ontology dco
PROV Ontology prov
10. Ontologies and schemas
used in the DCO web portal
10
DCO Boundary Activities are driving the extensions
within the DCO Ontologies
12. 12
Dynamically generated list of Grants
that are part of the Deep Carbon
Observatory. Users can click through to
learn more, and members can create
reports to be sent to funding orgs
13. 13
Grant page lists all projects and
reporting updates for each of the
projects and field studies
15. 15
A Few Boundary Activities
• Given a DOI pull publication information from CrossRef
and/or Web of Science
• DCO IGSN Allocation Agent to work with the IGSN
Registry
• Integration with existing data portals and repositories
• Data Rescue activities
16. Modern informatics enables a new scale-free
framework approach
• Use cases
• Stakeholders
• Modeling
• Ontologies
• Evaluation
16
17. What does a DCO data
publication look like?
17
27. Repository for archiving datasets
Archived datasets of ‘Noble
gas isotope abundances in
terrestrial fluids’
27
28. Collaboration tools
Group Based CollaborationGroup data
deposit and
reporting
Listings of
group content
Group
management
and messaging
28
29. 29
RDA DTR and PIT adoption
The DTR
primitives are
comparable
to a list of
BASIC DATA
TYPE CLASSES
in the DCO
ontology, e.g.
Dataset,
Image, Video,
Audio, etc.
A registered DCO dataset is asserted as an instance of one of
those basic data type classes.
It is possible
to further
annotate the
dataset with
the SPECIFIC
DATA TYPES
defined
within a DTR,
and each
data type
has a unique
PID.
A Few Boundary Activities
30. Results of data type specification
• Updates to the DCO Ontology:
– A new class dco:DataType. Each specific data type is an instance of it
– An object property dco:hasDataType linking a dataset and a data type
– A collection of other classes and properties associated with dco:DataType
30
31. 31
• New datasets available via dataset browser
• Includes citations to the originating publication
• Data files accessible through dataset repository
Thermodynamic Data Rescue
35. Mediation
From: C. Borgman, 2008, NSF Cyberlearning Report, Illustration by Roy Pea and Jillian C. Wallis
6th
Generation
All these generations of
mediation are in effect as we
collaborate
35
Editor's Notes
Scientific research practices regularly adopt new technologies and platforms in an effort to increase information timeliness, sharing and discoverability. There are many initiatives related to open data, open code, open access, open collections, composing the topic of Open Science in academia. Being open has two levels of meanings. The first is to make the data, code, sample collections and publications, etc. freely accessible online. The other is the annotation and connection between those resources to establish linked open data, a web of data. Moving from a data portal, adding context and experience using semantic technology in order to create a knowledge portal.
We present here our work on Deep Carbon Observatory
The Alfred P. Sloan Foundation pledged $50 million over the duration of the initiative to fund infrastructure development, scientific workshops, novel technology development, and preliminary research and fieldwork. The project spans four different scientific communities, shown here. Deep Life, Reservoirs and Fluxes, Deep Energy and Extreme Physics and Chemistry. A wide range of science domains are covered, Earth and Environmental science, ocean science, volcanology and more. There are close to a thousand members from hundreds of organizations all across the world. Diversity of research, diversity of people, diversity of information.
There are a lot of initiatives out there creating their own data portals. Unfortunately in a lot of cases they are data silos. Information is obtained by different means (instruments, models, analysis) using various (often opaque) protocols, in differing vocabularies, using (sometimes unstated) assumptions, with inconsistent (or non-existent) meta-data.
It may be inconsistent, incomplete, evolving, and distributed and created and organized in a manner to facilitate its generation, over its use
And… there exist(ed) significant levels of semantic heterogeneity, large-scale data, complex data types, legacy systems, inflexible and unsustainable implementation technology…
Specifically the use cases generated as part of the boundary activity development, is driving the extensions and additions found within the DCO ontology
Boundary activities are independent collaborations with organizations in an effort to build additional features and integrate additional data for the virtual observatory
Ability for projects collaborators to enter information stating the progress of their projects, updating their project information for use in reports to funding agencies
Dynamically generating a page in Drupal showing all of the Grants that are and have been funded by the Alfred P. Sloan Foundation. The example shown here is the grant for the DCO-DS work. Shows different information about the grant, can click through to find additional information, and you can click through to a reporting year to …
See the list of projects, field studies and any instruments associated with the grant. With the projects and field studies you can see a textual representation of the progress that has been made in the projet. What we were able to do is print out this page as a PDF and send that PDF to the funding organization as the report for 2015.
Stems from our work to implement the Data Type Registry as defined by the Research Data Alliance
International Geo-Sample Number. For example taking a core sample from the Arctic, that sample gets assigned an IGSN. Take slices of that core to run experiments, each of those slices gets an IGSN with provenance to the original core sample
Many portals already exist with their own representation of information, if any. The goal here isn’t to smash the existing models into the DCO model but to extend the DCO ontology with concepts and relationships specific to that portal
Data Rescue, pulling data using OCR software from PDF documents from the 70s, 80s and 90s, creating new datasets and using provenance to tie it back to the original publications.
Look at scale free network definitions in wikipedia, web, semantic networks.
Boundary activity working with the DCO communities to pull publication information from CrossRef and Web of Science given a Digital Object ID
Here we can see the dataset registered within CKAN, and the two files that are a part of the distribution of the dataset. This gives you the access to the data. Discovering the data is done through the community portal and the data portal.
We’ve installed Drupal Organic Groups to facilitate teams of people working on a common initiative within DCO. Many of the collaboration tools are outside of DCO, such as Google Docs, Webex, Skype, Dropbox, Slack and more. Here a team can coordinate their efforts, links to these various collaboration tools, broadcast messages to team members, and provide the links to create content for the group, such as publications and datasets.
RDA: Research Data Alliance; DTR: Data Type Registry; and PIT: Persistent Identifier Type
The primitives in a DTR are comparable to a list of BASIC DATA TYPE CLASSES in the DCO ontology, such as Dataset, Image, Video, and Audio, etc.
A registered DCO dataset is asserted as an instance of one of those basic data type classes.
It is possible to further annotate a registered dataset with the SPECIFIC DATA TYPES defined within a DTR, and each data type has a unique PID.
Illustration by Roy Pea and Jillian C. Wallis, from C. L. Borgman, H. Abelson, L. Dirks, R. Johnson, K. R. Koedinger, M. C. Linn, C. A. Lynch, D. G. Oblinger, R. D. Pea, K. Salen, M. S. Smith, and A. Szalay, “Fostering Learning in the Networked World: The Cyberlearning Opportunity and Challenge. A 21st Century Agenda for the National Science Foundation. Report of the NSF Task Force on Cyberlearning,” Office of Cyberinfrastructure and Directorate for Education and Human Resources. National Science Foundation, Washington, D.C., 2008.
?=smart agents. Open world, semantic agents, with rules… it is notable that these capabilities NEVER made it into the top row of capabilities…
?=data agents. Ones that can find data for you, and perhaps even convert it to the right format, find contextual information, etc.