Recombinant DNA technology (Immunological screening)
Make your data great again - Ver 2
1. Daniel Jacob – INRA - 2018
How to ensure that open data
works for research
Make your data great again
Daniel Jacob
INRA UMR 1332 BFP – Metabolism Group
Bordeaux Metabolomics Facility
Oct 2018
https://fr.slideshare.net/danieljacob771282/make-your-data-great-now
following
Give an open access to your data
and make them ready to be mined
Open Data for Access and Mining
ODAM Framework
2. Daniel Jacob – INRA - 2018
Develop if needed, lightweight tools
- R scripts, lightweight GUI (R shiny)
Minimal effort, Maximal efficiency
…
Use existing tools
- Spreadsheets, R studio,
BioStatFlow, Galaxy,
Cytoscape, …
Data
Format
TSV
EDTMS
ODAM
F
A
INTEROPERABLE
R
Experiment
Data Tables
2 metadata files
+
Research question Project Experiment Experimental set-up
Data emancipation
regarding Tools
Data API Tools
DataTools
https://fr.slideshare.net/danieljacob771282/make-your-data-great-now
following
3. Daniel Jacob – INRA - 2018
Develop if needed, lightweight tools
- R scripts, lightweight GUI (R shiny)
…
Use existing tools
- Spreadsheets, R studio,
BioStatFlow, Galaxy,
Cytoscape, …
Data
Format
TSV
Multi-species
Data Integration
Data integration
Towards Linked Data
Phenotype Information System
EDTMS
ODAM
F
A
INTEROPERABLE
R
« Plant Physiology and Metabolism»
https://www.quora.com/What-is-plant-physiology-and-metabolism
« Plant Growth»
4. Daniel Jacob – INRA - 2018
http://cgi.di.uoa.gr/~pms509/past_lectures/introduction-to-rdf.pdf
EDTMS
ODAM Resource Description Framework (RDF)
5. Daniel Jacob – INRA - 2018
s_subsets.tsv This metadata file allows to associate a key concept to each data subset file
Creation of the metadata files - Subsets
EDTMS
ODAM
Optional:
an annotation based on
ontology
CV Term
X
…
Optional:
an annotation based
on ontology
Plants
Harvests
Samples
Compounds
…
a_attributes.csv This metadata file allows each attribute (variable) to be annotated with some minimal but relevant metadata
CV Term
X
Resource Description Framework (RDF)
6. Daniel Jacob – INRA - 2018
Data / Metadata
Entities
Attributes
categories
subsets CV Term
s_subsets.tsv
a_attributes.tsv
CV Term ?
attributes CV Term
EDTMS
ODAM Resource Description Framework (RDF)
7. Daniel Jacob – INRA - 2018
Data / Metadata
Entities
Attributes
attributes CV Term
subsets CV Term
s_subsets.tsv
a_attributes.tsv
CV Term
Entity + Attribute = Trait
Trait (characteristic / feature)
categories
EDTMS
ODAM Resource Description Framework (RDF)
8. Daniel Jacob – INRA - 2018
TO
Plant Trait
Ontology EO
Plant Env.
Ontology
PO
Plant
Structure &
Dev. Stage
Ontology
CHEBI
Ontology
GO
Ontology
…
TO
EO
PO
Entity + Attribute = Trait
Trait (characteristic / feature)
Plant Trait Ontology
as the core / kernel of all ontologies
http://agroportal.lirmm.fr/ontologies
EDTMS
ODAM Resource Description Framework (RDF)
« Plant Physiology and Metabolism»
« Plant Growth»
9. Daniel Jacob – INRA - 2018
factor
quantitative
qualitative
identifier
categories
Plants
Compounds
Enzymes
Harvests
Samples
plants.tsv
PlanteID
harvests.tsv
Lot samples.tsv
SampleID
compounds.tsv
enzymes.tsv
SampleID
SampleID
Entities
TO
Plant Trait
Ontology
EO
Plant Env.
Ontology
PO
Plant Structure &
Dev. Stage
Ontology
GO
Ontology
CHEBI
Ontology
…
Attributes CV Term
CV Term
CV Term
http://agroportal.lirmm.fr/ontologies
CV Term
EDTMS
ODAM
a TBox is a "terminological component“
a conceptualization associated with a set of facts
TBox
Reference ontologies
Resource Description Framework (RDF)
10. Daniel Jacob – INRA - 2018
Data / Metadata
Category CV Term
Entities
Attributes
Typical queries:
Search for a particular Trait
Entity + Attribute = Trait
CV Term
Attribute Subset
CV Term
Category Species
EDTMS
ODAM
an ABox is an "assertion component“
a fact associated with a conceptual model or ontologies within a knowledge base.
ABox
Application ontologies
Resource Description Framework (RDF)
11. Daniel Jacob – INRA - 2018
factor
quantitative
qualitative
identifier
rdfs:range
categories
For each
Dataset
RDF
Schema
rdfs:label
<description>
rdfs:label
<description>
#description
Attributes Subsets
attribute
node
subset
node
rdf:type rdf:type
rdf:Bag
xsd:stringxsd:string
Attribute Entity
#hasEntity
#hasAttribute
Category Species
#hasCategory #hasSpecies
#description
#hasCategory
xsd:string
TO
EO
PO
CHE
BI
GO
…
Taxo
n
rdf:resource
rdf:resource
…
xsd:string
rdf:resource
CV Term
Abox - Application ontologies
Tbox - Reference ontologies
EDTMS
ODAM
https://schema.org/Dataset
measurementTechniquevariableMeasured
Resource Description Framework (RDF)
12. Daniel Jacob – INRA - 2018
Category CV Term
Entities
Attributes
Data / Metadata
Traits
Values
Phenotype (observed)
=
Traits + Values
Towards a Phenotype Information System
Automatic populating of the knowledge base
from the metadata files
defined within ODAM data subsets
Attributes Subsets
attribute
node
subset
node
rdf:type rdf:type
Attribute Entity
#hasEntity
#hasAttribute
Category Species
#hasCategory #hasSpecie
s
EDTMS
ODAM
13. Daniel Jacob – INRA - 2018
Fruit + weight = Fruit weightTrait
Constraint
and
Species = Tomato
Typical queries:
Search for a particular Trait
with or without Constraints
hasSynonym Tomato
Towards a Phenotype Information System
Attributes
Entities
EDTMS
ODAM
14. Daniel Jacob – INRA - 2018
Fruit + weight = Fruit weightTrait
Constraint
and
Species = Tomato
Typical queries:
Search for a particular Trait
with or without Constraints
Phenotype (observed)
=
(Entity + Attribute) + Values
Towards a Phenotype Information SystemEDTMS
ODAM
15. Daniel Jacob – INRA - 2018
Category CV Term
Entities
Attributes
Data mapping
Values
Data capture
EDTMS
Entity + Attribute = Trait
Trait (characteristic / feature)
Attributes Subsets
attribute
node
subset
node
rdf:type rdf:type
Attribute Entity
#hasEntity
#hasAttribute
Category Species
#hasCategory #hasSpecies
Data linking
Develop if needed, lightweight tools
- R scripts (Galaxy), lightweight GUI (R shiny)
EDTMS
ODAM
16. Daniel Jacob – INRA - 2018
Category CV Term
Entities
Attributes
Data mapping
Values
Data capture
EDTMS
Phenotype
(observed)
=
Traits + Values
Data Exploration
Entity + Attribute = Trait
Trait (characteristic / feature)
Towards a Phenotype
Information System
Attributes Subsets
attribute
node
subset
node
rdf:type rdf:type
Attribute Entity
#hasEntity
#hasAttribute
Category Species
#hasCategory #hasSpecies
Data linking
Data = Phenotypic data +
Molecular data +
Environment data
Phenotypic metadata =
Descriptors of Traits
(Entity-Attribute) +
Environment Factors
Data accumulation
Knowledge Base
EDTMS
ODAM
17. Daniel Jacob – INRA - 2018
Bayes' theorem, the general formula:
y : data : parameters
[ y, ] = [ y | ].[ ] = [ | y].[y]
Where [.] means a density or a probability
Posterior density
or simply the so-
called “posterior”
Prior density of or simply the
so-called “prior”
Likelihood (function of )
Marginal density
(data, model)
Model-Based Bayesian Inference:
Data mining
Phenotype
Information
System
Ex : model for
phenotypic variance and
biomass prediction (Y)
based on environmental
parameters ( )
Machine
Learning
« Plant Growth»
18. Daniel Jacob – INRA - 2018
Make your data great again
Metadata : not just on the "top"
linked to datasets but more
deeply linked to the variables.
The data management system becomes completely
independent of data usage.
One dataset Several applications
&
One application Several datasets
Making open data work for research
Data accumulation
Knowledge Base
Keep data “alive” into the data process loop
to similar way as for DNA/Protein
sequences where sequences can be
integrated into annotation pipelines.
Machine Learning
Model-Based Bayesian Inference:
an ABox is an "assertion component"—a fact associated with a terminological vocabulary within a knowledge base
TBox statements describe a system in terms of controlled vocabularies, for example, a set of classes and properties. ABox are TBox-compliant statements about that vocabulary.
Questions types:
Quel est l’ensemble des “Traits” (quantitative/qualitative) pour un échantillon (identifiant) donné ?
Quel est l’ensemble des “Traits” (quantitative/qualitative) pour un ou plusieurs CV donnés { type de subsets: ex: CV subset in (metabolite,enzyme)(CHEBI) ; type d’attribut: ex CV attribute ==tissue == “fruit pericarp” (PO) }, avec ou sans contrainte suppl.
Ex : type de factor