This presentation gives an overview of the dataset description specification developed in the Open PHACTS project (http://www.openphacts.org/). The creation of the specification was driven by a real need within the project to track the datasets used.
Details of the dataset metadata captured and the vocabularies used to model this metadata are given together with the tools developed to enable the specification's uptake.
Over the course of the last 12 months, the W3C Healthcare and Life Science Interest Group have been developing a community profile for dataset descriptions. This has drawn on the ideas developed in the Open PHACTS specification. A brief overview of the forthcoming community profile is given in the presentation.
This presentation was given to the Network Data Exchange project http://www.ndexbio.org/ on 2 April 2014.
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Dataset Descriptions in Open PHACTS and HCLS
1. Dataset Descriptions in
Open PHACTS and
W3C HCLS IG
Alasdair J G Gray
Heriot-Watt University
www.alasdairjggray.co.uk A.J.G.Gray@hw.ac.uk
NDEx Call, April 2014
2. Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public
Ontologies
User
Annotations
Apps
3. Data Cache
(Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Identifier
Management
Service
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
ChEMBL-
RDF
ChEMBL
Apps
Chem2Bio2
RDF
SD
v13v12
v2 or v8
4.
5. ChemSpider
• Data aggregator: over 400 sources
– What data does it contain?
– What version of ?? did they load?
– When are new versions loaded?
• OPS data covers
– ChEBI
– ChEMBL
– DrugBank
2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 5
6. Metadata Challenges
• Datasets available
– In many versions over time
– In different formats
– From many mirrors/registries
• Datasets build on each other
• Files do not carry metadata
• Registries
– Can be out-of-date
– Can contain conflicting information
2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 6
Users require
data
provenance!
7. 2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 7
8. 2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 8
10. Realisation of Dataset Descriptions
• Needs to be incorporated into data publishing
pipeline
• Hard for publishers to provide conformant
descriptions
– Datasets are complex
– Evolve over time
– Seen as yet another burden
2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 15
15. Future Vision
Metadata: Write once, use many times
• Provide rich and accurate provenance trail of
data
– Automatic pipeline from VoID file to registries
• Align Open PHACTS with W3C HCLS
– Update tools for HCLS profile
2 April 2014 OPS Dataset Descriptions – A. J. G. Gray 20
Large number of datasets: differing update ratesdifferent characteristicsRequire automated process
Specifies checklist of propertiesDrawers upon existing vocabulariesAims to be simple to use: extensive guidance notes
Checklist and guidance notes – user friendlyMinimal, easy to follow modelDrawer upon existing vocabulariesRequired and optional properties
Agent-entity-action model can be cumbersome for datasets; agent not always known beyond data provider, i.e. not individual.Extension requirement is by design
Provide two tools to help
Dataset description creatorGenerates outline description through web formAllows you to see generated content
Given a dataset description, does it conform to the OPS guidelinesGenerates error (red) and warning (orange) reportsError for MUST propertiesWarning for SHOULD propertiesInformation for MAY properties
Large community buy in – Including EBIBuilds on OPS document: Checklist and guidance notes!Wide range of use casesShould be finalised by end of May – not final URL
Three tier model – More complexMore required properties (not shown)Richer metadata
Open PHACTS: 28 partner9 Pharmaceuticals3 Biotechs1 Triplestore firm15 academic