SlideShare une entreprise Scribd logo
1  sur  44
Scientific Lenses over 
Linked Data 
An approach to support 
multiple integrated views 
Alasdair J G Gray 
A.J.G.Gray@hw.ac.uk 
alasdairjggray.co.uk 
@gray_alasdair
Open PHACTS Use Case 
“Let me compare MW, logP 
and PSA for launched 
inhibitors of human & 
mouse oxidoreductases” 
 Chemical Properties (Chemspider) 
 Launched drugs (Drugbank) 
 Human => Mouse (Homologene) 
 Protein Families (Enzyme) 
 Bioactivty Data (ChEMBL) 
 … other info (Uniprot/Entrez etc.) 
16 October 2014 Scientific Lenses – A. J. G. Gray 1
Discovery Platform 
Apps 
Method 
Calls 
Domain API 
Drug Discovery Platform 
Interactive 
responses 
Production quality 
integration platform 
16 October 2014 Scientific Lenses – A. J. G. Gray 2
App Ecosystem An “App Store”? 
Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium 
MOE Collector Cytophacts Utopia Garfield SciBite 
KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna 
http://www.openphactsfoundation.org/apps.html 
16 October 2014
API Hits 
April 2013 – March 2014: 15.8m 
April 2014 – Sept 2014: 14m 
Total: 29.8 million 
16 October 2014 Scientific Lenses – A. J. G. Gray 4
Linked Data API 
Drug 
Target Pathway 
Disease (1.4) 
https://dev.openphacts.org/ 
16 October 2014 Scientific Lenses – A. J. G. Gray 5
Open PHACTS Data 
Source Initial Records Triples Properties 
ChEMBL 1,481,473 304,360,749 77 
DrugBank 19,628 517,584 74 
UniProt 564,246 405,473,138 82 
ENZYME 6,187 73,838 2 
ChEBI 40,575 1,673,863 2 
GeneOntology 38,137 2,447,682 26 
GOA 661,232 1,765,622,393 15 
ChemSpider 1,361,568 215,193,441 23 
ConceptWiki 2,828,966 4,291,131 1 
WikiPathways 946 1,949,074 34 
16 October 2014 Scientific Lenses – A. J. G. Gray 6
Dataset Descriptions in the 
Open Pharmacological 
Space 
14 January 2013 
Being replaced by W3C 
HCLS community profile 
http://tiny.cc/hcls-datadesc-ed 
OPS Dataset Descriptions – A. J. 
G. Gray 7
OPS Discovery Platform 
Linked Data API (RDF/XML, TTL, JSON) 
Semantic Workflow Engine 
VoID 
Nanopub 
Db 
Data Cache 
(Virtuoso Triple Store) 
Domain 
Specific 
Services 
Identity 
Resolution 
Service 
Chemistry 
Registration 
Normalisation 
& Q/C 
Identifier 
Management 
Service 
Indexing 
Core Platform 
“Adenosine 
receptor 2a” 
P12374 
EC2.43.4 
CS4532 
VoID 
Db 
VoID 
Nanopub 
Db 
VoID 
Db 
VoID 
Nanopub 
Public Content Commercial 
Public Ontologies 
User 
Annotations 
Apps
Multiple Identities 
Andy Law's Third Law 
“The number of unique identifiers assigned to an individual is 
never less than the number of Institutions involved in the study” 
http://bioinformatics.roslin.ac.uk/lawslaws/ 
GB:29384 
P12047 
X31045 
16 October 2014 Scientific Lenses – A. J. G. Gray 
9 
Are these the 
same thing?
Gleevec®: Imatinib Mesylate 
Imatinib 
Imatinib MesylateMesylate 
YLMAHDNUQAMNNX-UHFFFAOYSA-N 
ChemSpider Drugbank PubChem 
16 October 2014 Scientific Lenses – A. J. G. Gray 10
Gleevec®: Imatinib Mesylate 
Imatinib 
Are these records the same? 
It depends upon your task! 
Imatinib MesylateMesylate 
YLMAHDNUQAMNNX-UHFFFAOYSA-N 
ChemSpider Drugbank PubChem 
16 October 2014 Scientific Lenses – A. J. G. Gray 11
Genes == Proteins? 
BRCA1: Chromosome 17 
Breast cancer type 1 
susceptibility protein 
http://en.wikipedia.org/wiki/File:Protei 
n_BRCA1_PDB_1jm7.png 
http://en.wikipedia.org/wiki/File:BRCA1 
_en.png 
16 October 2014 Scientific Lenses – A. J. G. Gray 12
Genes == Proteins? 
BRCA1: Chromosome 17 
Breast cancer type 1 
susceptibility protein 
http://en.wikipedia.org/wiki/File:Protei 
n_BRCA1_PDB_1jm7.png 
http://en.wikipedia.org/wiki/File:BRCA1 
_en.png 
Are these records the same? 
It depends upon your task! 
16 October 2014 Scientific Lenses – A. J. G. Gray 13
Example Use Cases 
I need to perform an 
analysis, give me details 
of the active compound 
in Gleevec. 
Which targets are 
known to interact 
with Gleevec? 
16 October 2014 Scientific Lenses – A. J. G. Gray 14
Structure Lens 
I need to perform an analysis, give me 
Strict Relaxed 
Analysing Browsing 
skos:exactMatch 
(InChI) 
Scientific Lenses – A. J. G. Gray 15 
16 October 2014 
details of the active compound in 
Gleevec.
Name Lens 
Which targets are known to interact 
Strict Relaxed 
Analysing Browsing 
skos:closeMatch 
(Drug Name) 
skos:exactMatch 
(InChI) 
skos:closeMatch 
(Drug Name) 
Scientific Lenses – A. J. G. Gray 16 
16 October 2014 
with Gleevec?
What is a Scientific Lens? 
A lens defines a conceptual view over the data 
 Specifies operational equivalence conditions 
Consists of: 
 Identifier (URI) 
 Title 
(dct:title) 
 Description 
(dct:description) 
 Documentation link 
(dcat:landingPage) 
 Creator 
(pav:createdBy) 
 Timestamp 
(pav:createdOn) 
 Equivalence rules 
(bdb:linksetJustification) 
16 October 2014 Scientific Lenses – A. J. G. Gray 17
Lens Effects: Ibuprofen 
Ibuprofen consists of two equally active stereoisomers. 
• Stereoisomers not always represented in data 
Users wish to retrieve information for any stereoisomer. 
CHEMBL427526 
CHEMBL521 
CHEMBL175 
16 October 2014 Scientific Lenses – A. J. G. Gray 18
Default Lens 
Ibuprofen consists of two equally active stereoisomers. 
• Stereoisomers not always represented in data 
Users wish to retrieve information for any stereoisomer. 
16 October 2014 Scientific Lenses – A. J. G. Gray 19
Stereoisomer Lens 
Ibuprofen consists of two equally active stereoisomers. 
• Stereoisomers not always represented in data 
Users wish to retrieve information for any stereoisomer. 
16 October 2014 Scientific Lenses – A. J. G. Gray 20
Mapping Generation 
✔ 
ops:OPS437281 
has_stereoundefined_parent 
[ci:CHEMINF_000456] 
ops:OPS380297 
is_stereoisomer_of 
[ci:CHEMINF_000461] 
ops:OPS380292 
Other relationships 
• has part 
• is tautomer of 
• uncharged counterpart 
• isotope 
… 
16 October 2014 Scientific Lenses – A. J. G. Gray 21
Initial Connectivity 
Datasets 37 
Linksets 104 
Links 7,096,712 
Justifications 7 
16 October 2014 Scientific Lenses – A. J. G. Gray 22
Compound Information 
Scientific Lenses – A. J. G. Gray 23 
16 October 2014
Proceed with Caution! 
16 October 2014 Scientific Lenses – A. J. G. Gray 24
Co-reference 
Computation 
Rules ensure 
 Unrestricted 
transitivity within 
conceptual type 
 Restrict crossing 
conceptual types 
Based on justifications 
Provenance captured 
0..* 
0..* 
0..* 
0..1 
0..1 
16 October 2014 Scientific Lenses – A. J. G. Gray 25
Initial Connectivity 
Datasets 37 
Linksets 104 
Links 7,096,712 
Justification 
s 
7 
16 October 2014 Scientific Lenses – A. J. G. Gray 26
Inferred Connectivity 
Datasets 37 
Linksets 883 
Links 17,383,846 
Justifications 7 
16 October 2014 Scientific Lenses – A. J. G. Gray 27
BridgeDb 
16 October 2014 Scientific Lenses – A. J. G. Gray 28
Lenses: Under the hood 
GRAPH <http://rdf.chemspider.com> { 
cw:979b545d-f9a9 cheminf:logd ?logd . 
?iri cheminf:logd ?logd . 
FILTER (?iri = cw:979b545d-f9a9 || 
?iri = cs:2157 || 
?iri = chembl:1280 || 
?iri = db:db00945 ) 
} 
GRAPH <http://… 
Q, L1 Q’ 
Query 
Expander 
Service 
Identity 
Mapping 
Service 
(BridgeDB) 
Mappings 
Profiles 
cw:979b545d-f9a9, L1 
[cw:979b545d-f9a9, 
cs:2157, 
chembl:1280, 
db:db00945] 
• Can also be achieved through UNION 
• IMS call adds overhead 
16 October 2014 Scientific Lenses – A. J. G. Gray 29
Experiment 
Is it feasible to use a stand-off 
mapping service? 
 Base lines (no external call): 
 “Perfect” URIs 
 Linked data querying 
 Expansion approaches (external service 
call): 
 FILTER by Graph 
 UNION by Graph 
C. Y. A. Brenninkmeijer, C. A. Goble, A. J. G. Gray, P. T. Groth, A. Loizou, S. 
Pettifer: Including Co-referent URIs in a SPARQL Query. COLD 2013. 
http://ceur-ws.org/Vol-1034/BrenninkmeijerEtAl_COLD2013.pdf
“Perfect” URI Baseline 
WHERE { 
GRAPH <chemspider> { 
cs:2157 cheminf:logp ?logp . 
} 
GRAPH <chembl> { 
chembl_mol:m1280 cheminf:mw ?mw . 
} 
} 
16 October 2014 Scientific Lenses – A. J. G. Gray 31
Linked Data Baseline 
WHERE { 
GRAPH <chemspider> { 
cs:2157 cheminf:logp ?logp . 
} 
GRAPH <chembl> { 
?chemblid cheminf:mw ?mw . 
} 
cs:2157 skos:exactMatch ?chemblid . 
} 
16 October 2014 Scientific Lenses – A. J. G. Gray 32
Queries 
Drawn from Open PHACTS API: 
1. Simple compound information (1) 
2. Compound information (1) 
3. Compound pharmacology (M) 
4. Simple target information (1) 
5. Target information (1) 
6. Target pharmacology (M) 
16 October 2014 Scientific Lenses – A. J. G. Gray 33
Queries 
Drawn from Open PHACTS API: 
1. Simple compound information (1) 
2. Compound information (1) 
3. Compound pharmacology (M) 
4. Simple target information (1) 
5. Target information (1) 
6. Target pharmacology (M) 
16 October 2014 Scientific Lenses – A. J. G. Gray 34
Data: 
167,783,592 triples 
Mappings: 
2,114,584 triples 
Lenses: 
1 
Experiment Data 
16 October 2014 Scientific Lenses – A. J. G. Gray 35
Average execution times
Average execution times 
0.018
Q6: Target Pharmacology
Explorer Screenshot 
16 October 2014 Scientific Lenses – A. J. G. Gray 45
Explorer Screenshot 
16 October 2014 Scientific Lenses – A. J. G. Gray 46
Conclusions 
 Scientific data is complex and messy 
 Requires flexibility in linking 
 Equivalence depends upon context 
 Lenses provide support for operation 
equivalence 
 Chemical structures support automatic 
computing of links with justification 
16 October 2014 Scientific Lenses – A. J. G. Gray 47
Acknowledgements 
Royal Society of Chemistry 
 Colin Batchelor 
 Karen Karapetyan 
 Jon Steele 
 Valery Tkachenko 
 Antony Williams 
University of Manchester 
 Christian Brenninkmeijer 
 Ian Dunlop 
 Carole Goble 
 Steve Pettifer 
 Robert Stevens 
Swiss Institute for Bioinformatics 
 Christine Chichester 
European Bioinformatics Institute 
 Mark Davies 
 Anna Gaulton 
 John Overington 
University of Vienna 
 Daniela Digles 
Maastricht University 
 Chris Evelo 
 Andra Waagmeester 
 Egon Willighagen 
VU University of Amsterdam 
 Paul Groth 
 Antonis Loizou 
Connected Discovery 
 Lee Harland 
16 October 2014 Scientific Lenses – A. J. G. Gray 48
Questions 
Alasdair J G Gray 
A.J.G.Gray@hw.ac.uk 
alasdairjggray.co.uk 
@gray_alasdair 
Open PHACTS 
pmu@openphacts.org 
openphacts.org 
@open_phacts

Contenu connexe

Similaire à Scientific Lenses over Linked Data An approach to support multiple integrated views

Week 14Analysis and Presentation of Data - Hypothesis Tes.docx
Week 14Analysis and Presentation of Data -  Hypothesis Tes.docxWeek 14Analysis and Presentation of Data -  Hypothesis Tes.docx
Week 14Analysis and Presentation of Data - Hypothesis Tes.docx
melbruce90096
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Greg Landrum
 
Practical semantics in the pharmaceutical industry - the Open PHACTS project
Practical semantics in the pharmaceutical industry - the Open PHACTS projectPractical semantics in the pharmaceutical industry - the Open PHACTS project
Practical semantics in the pharmaceutical industry - the Open PHACTS project
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Similaire à Scientific Lenses over Linked Data An approach to support multiple integrated views (20)

ACRL Trust in Science Talk
ACRL Trust in Science TalkACRL Trust in Science Talk
ACRL Trust in Science Talk
 
Semantics and linked data at astra zeneca
Semantics and linked data at astra zenecaSemantics and linked data at astra zeneca
Semantics and linked data at astra zeneca
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Open Notebook Science HUBzero 2011
Open Notebook Science HUBzero 2011Open Notebook Science HUBzero 2011
Open Notebook Science HUBzero 2011
 
Linked Data for improved organization of research data
Linked Data  for improved organization  of research dataLinked Data  for improved organization  of research data
Linked Data for improved organization of research data
 
Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021 Tag.bio aws public jun 08 2021
Tag.bio aws public jun 08 2021
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and models
 
The crusade for big data in the AAL domain
The crusade for big data in the AAL domainThe crusade for big data in the AAL domain
The crusade for big data in the AAL domain
 
Week 14Analysis and Presentation of Data - Hypothesis Tes.docx
Week 14Analysis and Presentation of Data -  Hypothesis Tes.docxWeek 14Analysis and Presentation of Data -  Hypothesis Tes.docx
Week 14Analysis and Presentation of Data - Hypothesis Tes.docx
 
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...
 
Using Neo4j for exploring the research graph connections made by RD-Switchboard
Using Neo4j for exploring the research graph connections made by RD-SwitchboardUsing Neo4j for exploring the research graph connections made by RD-Switchboard
Using Neo4j for exploring the research graph connections made by RD-Switchboard
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Science Online 2011 ONS session
Science Online 2011 ONS sessionScience Online 2011 ONS session
Science Online 2011 ONS session
 
The FAIRDOM Commons for Systems Biology
The FAIRDOM Commons for Systems BiologyThe FAIRDOM Commons for Systems Biology
The FAIRDOM Commons for Systems Biology
 
Neo4j for Healthcare & Life Sciences
Neo4j for Healthcare & Life SciencesNeo4j for Healthcare & Life Sciences
Neo4j for Healthcare & Life Sciences
 
Practical semantics in the pharmaceutical industry - the Open PHACTS project
Practical semantics in the pharmaceutical industry - the Open PHACTS projectPractical semantics in the pharmaceutical industry - the Open PHACTS project
Practical semantics in the pharmaceutical industry - the Open PHACTS project
 
Repositories in an Open Data Ecosystem
Repositories in an Open Data EcosystemRepositories in an Open Data Ecosystem
Repositories in an Open Data Ecosystem
 
Exploring Coverage and Distribution of Scholarly Identifiers on the Web
Exploring Coverage and Distribution of Scholarly Identifiers on the WebExploring Coverage and Distribution of Scholarly Identifiers on the Web
Exploring Coverage and Distribution of Scholarly Identifiers on the Web
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 

Plus de Alasdair Gray

Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Alasdair Gray
 
An Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAn Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland Project
Alasdair Gray
 
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Alasdair Gray
 
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsThe HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
Alasdair Gray
 
Describing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community ProfileDescribing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community Profile
Alasdair Gray
 
Sensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-beingSensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-being
Alasdair Gray
 

Plus de Alasdair Gray (20)

Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
Using a Jupyter Notebook to perform a reproducible scientific analysis over s...
 
Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...Bioschemas Community: Developing profiles over Schema.org to make life scienc...
Bioschemas Community: Developing profiles over Schema.org to make life scienc...
 
An Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland ProjectAn Identifier Scheme for the Digitising Scotland Project
An Identifier Scheme for the Digitising Scotland Project
 
Supporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life SciencesSupporting Dataset Descriptions in the Life Sciences
Supporting Dataset Descriptions in the Life Sciences
 
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
Tutorial: Describing Datasets with the Health Care and Life Sciences Communit...
 
Validata: A tool for testing profile conformance
Validata: A tool for testing profile conformanceValidata: A tool for testing profile conformance
Validata: A tool for testing profile conformance
 
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and DistributionsThe HCLS Community Profile: Describing Datasets, Versions, and Distributions
The HCLS Community Profile: Describing Datasets, Versions, and Distributions
 
Open PHACTS: The Data Today
Open PHACTS: The Data TodayOpen PHACTS: The Data Today
Open PHACTS: The Data Today
 
Project X
Project XProject X
Project X
 
Data Integration in a Big Data Context: An Open PHACTS Case Study
Data Integration in a Big Data Context: An Open PHACTS Case StudyData Integration in a Big Data Context: An Open PHACTS Case Study
Data Integration in a Big Data Context: An Open PHACTS Case Study
 
Data Integration in a Big Data Context
Data Integration in a Big Data ContextData Integration in a Big Data Context
Data Integration in a Big Data Context
 
Data Linkage
Data LinkageData Linkage
Data Linkage
 
Describing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community ProfileDescribing Scientific Datasets: The HCLS Community Profile
Describing Scientific Datasets: The HCLS Community Profile
 
SensorBench
SensorBenchSensorBench
SensorBench
 
Data Science meets Linked Data
Data Science meets Linked DataData Science meets Linked Data
Data Science meets Linked Data
 
Sensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-beingSensors and Big Data for Health and Well-being
Sensors and Big Data for Health and Well-being
 
Dataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSDataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLS
 
Computing Identity Co-Reference Across Drug Discovery Datasets
Computing Identity Co-Reference Across Drug Discovery DatasetsComputing Identity Co-Reference Across Drug Discovery Datasets
Computing Identity Co-Reference Across Drug Discovery Datasets
 
Incorporating Commercial and Private Data into an Open Linked Data Platform f...
Incorporating Commercial and Private Data into an Open Linked Data Platform f...Incorporating Commercial and Private Data into an Open Linked Data Platform f...
Incorporating Commercial and Private Data into an Open Linked Data Platform f...
 
Including Co-Referent URIs in a SPARQL Query
Including Co-Referent URIs in a SPARQL QueryIncluding Co-Referent URIs in a SPARQL Query
Including Co-Referent URIs in a SPARQL Query
 

Dernier

Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
LeenakshiTyagi
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 

Dernier (20)

Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 

Scientific Lenses over Linked Data An approach to support multiple integrated views

  • 1. Scientific Lenses over Linked Data An approach to support multiple integrated views Alasdair J G Gray A.J.G.Gray@hw.ac.uk alasdairjggray.co.uk @gray_alasdair
  • 2. Open PHACTS Use Case “Let me compare MW, logP and PSA for launched inhibitors of human & mouse oxidoreductases”  Chemical Properties (Chemspider)  Launched drugs (Drugbank)  Human => Mouse (Homologene)  Protein Families (Enzyme)  Bioactivty Data (ChEMBL)  … other info (Uniprot/Entrez etc.) 16 October 2014 Scientific Lenses – A. J. G. Gray 1
  • 3. Discovery Platform Apps Method Calls Domain API Drug Discovery Platform Interactive responses Production quality integration platform 16 October 2014 Scientific Lenses – A. J. G. Gray 2
  • 4. App Ecosystem An “App Store”? Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium MOE Collector Cytophacts Utopia Garfield SciBite KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna http://www.openphactsfoundation.org/apps.html 16 October 2014
  • 5. API Hits April 2013 – March 2014: 15.8m April 2014 – Sept 2014: 14m Total: 29.8 million 16 October 2014 Scientific Lenses – A. J. G. Gray 4
  • 6. Linked Data API Drug Target Pathway Disease (1.4) https://dev.openphacts.org/ 16 October 2014 Scientific Lenses – A. J. G. Gray 5
  • 7. Open PHACTS Data Source Initial Records Triples Properties ChEMBL 1,481,473 304,360,749 77 DrugBank 19,628 517,584 74 UniProt 564,246 405,473,138 82 ENZYME 6,187 73,838 2 ChEBI 40,575 1,673,863 2 GeneOntology 38,137 2,447,682 26 GOA 661,232 1,765,622,393 15 ChemSpider 1,361,568 215,193,441 23 ConceptWiki 2,828,966 4,291,131 1 WikiPathways 946 1,949,074 34 16 October 2014 Scientific Lenses – A. J. G. Gray 6
  • 8. Dataset Descriptions in the Open Pharmacological Space 14 January 2013 Being replaced by W3C HCLS community profile http://tiny.cc/hcls-datadesc-ed OPS Dataset Descriptions – A. J. G. Gray 7
  • 9. OPS Discovery Platform Linked Data API (RDF/XML, TTL, JSON) Semantic Workflow Engine VoID Nanopub Db Data Cache (Virtuoso Triple Store) Domain Specific Services Identity Resolution Service Chemistry Registration Normalisation & Q/C Identifier Management Service Indexing Core Platform “Adenosine receptor 2a” P12374 EC2.43.4 CS4532 VoID Db VoID Nanopub Db VoID Db VoID Nanopub Public Content Commercial Public Ontologies User Annotations Apps
  • 10. Multiple Identities Andy Law's Third Law “The number of unique identifiers assigned to an individual is never less than the number of Institutions involved in the study” http://bioinformatics.roslin.ac.uk/lawslaws/ GB:29384 P12047 X31045 16 October 2014 Scientific Lenses – A. J. G. Gray 9 Are these the same thing?
  • 11. Gleevec®: Imatinib Mesylate Imatinib Imatinib MesylateMesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N ChemSpider Drugbank PubChem 16 October 2014 Scientific Lenses – A. J. G. Gray 10
  • 12. Gleevec®: Imatinib Mesylate Imatinib Are these records the same? It depends upon your task! Imatinib MesylateMesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N ChemSpider Drugbank PubChem 16 October 2014 Scientific Lenses – A. J. G. Gray 11
  • 13. Genes == Proteins? BRCA1: Chromosome 17 Breast cancer type 1 susceptibility protein http://en.wikipedia.org/wiki/File:Protei n_BRCA1_PDB_1jm7.png http://en.wikipedia.org/wiki/File:BRCA1 _en.png 16 October 2014 Scientific Lenses – A. J. G. Gray 12
  • 14. Genes == Proteins? BRCA1: Chromosome 17 Breast cancer type 1 susceptibility protein http://en.wikipedia.org/wiki/File:Protei n_BRCA1_PDB_1jm7.png http://en.wikipedia.org/wiki/File:BRCA1 _en.png Are these records the same? It depends upon your task! 16 October 2014 Scientific Lenses – A. J. G. Gray 13
  • 15. Example Use Cases I need to perform an analysis, give me details of the active compound in Gleevec. Which targets are known to interact with Gleevec? 16 October 2014 Scientific Lenses – A. J. G. Gray 14
  • 16. Structure Lens I need to perform an analysis, give me Strict Relaxed Analysing Browsing skos:exactMatch (InChI) Scientific Lenses – A. J. G. Gray 15 16 October 2014 details of the active compound in Gleevec.
  • 17. Name Lens Which targets are known to interact Strict Relaxed Analysing Browsing skos:closeMatch (Drug Name) skos:exactMatch (InChI) skos:closeMatch (Drug Name) Scientific Lenses – A. J. G. Gray 16 16 October 2014 with Gleevec?
  • 18. What is a Scientific Lens? A lens defines a conceptual view over the data  Specifies operational equivalence conditions Consists of:  Identifier (URI)  Title (dct:title)  Description (dct:description)  Documentation link (dcat:landingPage)  Creator (pav:createdBy)  Timestamp (pav:createdOn)  Equivalence rules (bdb:linksetJustification) 16 October 2014 Scientific Lenses – A. J. G. Gray 17
  • 19. Lens Effects: Ibuprofen Ibuprofen consists of two equally active stereoisomers. • Stereoisomers not always represented in data Users wish to retrieve information for any stereoisomer. CHEMBL427526 CHEMBL521 CHEMBL175 16 October 2014 Scientific Lenses – A. J. G. Gray 18
  • 20. Default Lens Ibuprofen consists of two equally active stereoisomers. • Stereoisomers not always represented in data Users wish to retrieve information for any stereoisomer. 16 October 2014 Scientific Lenses – A. J. G. Gray 19
  • 21. Stereoisomer Lens Ibuprofen consists of two equally active stereoisomers. • Stereoisomers not always represented in data Users wish to retrieve information for any stereoisomer. 16 October 2014 Scientific Lenses – A. J. G. Gray 20
  • 22. Mapping Generation ✔ ops:OPS437281 has_stereoundefined_parent [ci:CHEMINF_000456] ops:OPS380297 is_stereoisomer_of [ci:CHEMINF_000461] ops:OPS380292 Other relationships • has part • is tautomer of • uncharged counterpart • isotope … 16 October 2014 Scientific Lenses – A. J. G. Gray 21
  • 23. Initial Connectivity Datasets 37 Linksets 104 Links 7,096,712 Justifications 7 16 October 2014 Scientific Lenses – A. J. G. Gray 22
  • 24. Compound Information Scientific Lenses – A. J. G. Gray 23 16 October 2014
  • 25. Proceed with Caution! 16 October 2014 Scientific Lenses – A. J. G. Gray 24
  • 26. Co-reference Computation Rules ensure  Unrestricted transitivity within conceptual type  Restrict crossing conceptual types Based on justifications Provenance captured 0..* 0..* 0..* 0..1 0..1 16 October 2014 Scientific Lenses – A. J. G. Gray 25
  • 27. Initial Connectivity Datasets 37 Linksets 104 Links 7,096,712 Justification s 7 16 October 2014 Scientific Lenses – A. J. G. Gray 26
  • 28. Inferred Connectivity Datasets 37 Linksets 883 Links 17,383,846 Justifications 7 16 October 2014 Scientific Lenses – A. J. G. Gray 27
  • 29. BridgeDb 16 October 2014 Scientific Lenses – A. J. G. Gray 28
  • 30. Lenses: Under the hood GRAPH <http://rdf.chemspider.com> { cw:979b545d-f9a9 cheminf:logd ?logd . ?iri cheminf:logd ?logd . FILTER (?iri = cw:979b545d-f9a9 || ?iri = cs:2157 || ?iri = chembl:1280 || ?iri = db:db00945 ) } GRAPH <http://… Q, L1 Q’ Query Expander Service Identity Mapping Service (BridgeDB) Mappings Profiles cw:979b545d-f9a9, L1 [cw:979b545d-f9a9, cs:2157, chembl:1280, db:db00945] • Can also be achieved through UNION • IMS call adds overhead 16 October 2014 Scientific Lenses – A. J. G. Gray 29
  • 31. Experiment Is it feasible to use a stand-off mapping service?  Base lines (no external call):  “Perfect” URIs  Linked data querying  Expansion approaches (external service call):  FILTER by Graph  UNION by Graph C. Y. A. Brenninkmeijer, C. A. Goble, A. J. G. Gray, P. T. Groth, A. Loizou, S. Pettifer: Including Co-referent URIs in a SPARQL Query. COLD 2013. http://ceur-ws.org/Vol-1034/BrenninkmeijerEtAl_COLD2013.pdf
  • 32. “Perfect” URI Baseline WHERE { GRAPH <chemspider> { cs:2157 cheminf:logp ?logp . } GRAPH <chembl> { chembl_mol:m1280 cheminf:mw ?mw . } } 16 October 2014 Scientific Lenses – A. J. G. Gray 31
  • 33. Linked Data Baseline WHERE { GRAPH <chemspider> { cs:2157 cheminf:logp ?logp . } GRAPH <chembl> { ?chemblid cheminf:mw ?mw . } cs:2157 skos:exactMatch ?chemblid . } 16 October 2014 Scientific Lenses – A. J. G. Gray 32
  • 34. Queries Drawn from Open PHACTS API: 1. Simple compound information (1) 2. Compound information (1) 3. Compound pharmacology (M) 4. Simple target information (1) 5. Target information (1) 6. Target pharmacology (M) 16 October 2014 Scientific Lenses – A. J. G. Gray 33
  • 35. Queries Drawn from Open PHACTS API: 1. Simple compound information (1) 2. Compound information (1) 3. Compound pharmacology (M) 4. Simple target information (1) 5. Target information (1) 6. Target pharmacology (M) 16 October 2014 Scientific Lenses – A. J. G. Gray 34
  • 36. Data: 167,783,592 triples Mappings: 2,114,584 triples Lenses: 1 Experiment Data 16 October 2014 Scientific Lenses – A. J. G. Gray 35
  • 40. Explorer Screenshot 16 October 2014 Scientific Lenses – A. J. G. Gray 45
  • 41. Explorer Screenshot 16 October 2014 Scientific Lenses – A. J. G. Gray 46
  • 42. Conclusions  Scientific data is complex and messy  Requires flexibility in linking  Equivalence depends upon context  Lenses provide support for operation equivalence  Chemical structures support automatic computing of links with justification 16 October 2014 Scientific Lenses – A. J. G. Gray 47
  • 43. Acknowledgements Royal Society of Chemistry  Colin Batchelor  Karen Karapetyan  Jon Steele  Valery Tkachenko  Antony Williams University of Manchester  Christian Brenninkmeijer  Ian Dunlop  Carole Goble  Steve Pettifer  Robert Stevens Swiss Institute for Bioinformatics  Christine Chichester European Bioinformatics Institute  Mark Davies  Anna Gaulton  John Overington University of Vienna  Daniela Digles Maastricht University  Chris Evelo  Andra Waagmeester  Egon Willighagen VU University of Amsterdam  Paul Groth  Antonis Loizou Connected Discovery  Lee Harland 16 October 2014 Scientific Lenses – A. J. G. Gray 48
  • 44. Questions Alasdair J G Gray A.J.G.Gray@hw.ac.uk alasdairjggray.co.uk @gray_alasdair Open PHACTS pmu@openphacts.org openphacts.org @open_phacts

Notes de l'éditeur

  1. 1 of 83 business driver questions Took a team of 5 experienced researchers 6 hours to manually gather the answer
  2. A platform for integrated pharmacology data Relied upon by pharma companies Public domain, commercial, and private data sources Provides domain specific API Making it easy to build multiple drug discovery applications: examples developed in the project
  3. Actively being used Since launch (April 2013): 30million hits
  4. Linked data API: multiple response formats (JSON, RDF, XML, CSV …) 3scala deployment, extensive memcaching Public dataset Provenance of data returned in response
  5. Hosted on beefy hardware; data in memory (aim)
  6. Specifies MIM checklist Reuses terms from VoID, PAV, DCTerms, PROV predicates
  7. Import data into cache API calls populate SPARQL queries Integration approach Data kept in original model Data cached in central triple store API call translated to SPARQL query Query expressed in terms of original data Queries expanded by IMS to cover URIs of original datasets
  8. Concept appears in multiple datasets, each with its own identifier This talk is about supporting the multiple identities that exist Rather than define a single approach, we want to support the use of multiple identifiers
  9. Example drug: Gleevec Cancer drug for leukemia Lookup in three popular public chemical databases  Different results Chemistry is complicated, often simplified for convenience Data is messy!
  10. Are these records the same? It depends on what you are doing with the data! Each captures a subtly different view of the world Chemistry is complicated, often simplified for convenience Data is messy!
  11. Do genes == proteins? Different conceptual types: gene and protein Biological data is complicated  simplified for convenience ---- But if you’re saying why genes=proteins you may also want to be prepared for questions of when genes!=proteins. Splice variation is a common example, n the FAS receptor: http://en.wikipedia.org/wiki/Alternative_splicing#Exon_definition:_Fas_receptor there is one gene but it can be made into two distinct proteins - which have different biological effects), so you can obviously mix bio data that shouldnt be mixed by integrating these two functions on the same ID. [We currently dont handle this well in OPS] And the most used example here, the ghrelin gene is transcribed into a protein which is cleaved in two to form two completely different hormones, ghrelin and obestatin, which do very different things. But come from the same gene http://en.wikipedia.org/wiki/Ghrelin#Synthesis_and_variants
  12. Often used as a shortcut for retrieval: BRCA1 easier to remember and type! Require the ability to equate them in the IMS ---- But if you’re saying why genes=proteins you may also want to be prepared for questions of when genes!=proteins. Splice variation is a common example, n the FAS receptor: http://en.wikipedia.org/wiki/Alternative_splicing#Exon_definition:_Fas_receptor there is one gene but it can be made into two distinct proteins - which have different biological effects), so you can obviously mix bio data that shouldnt be mixed by integrating these two functions on the same ID. [We currently dont handle this well in OPS] And the most used example here, the ghrelin gene is transcribed into a protein which is cleaved in two to form two completely different hormones, ghrelin and obestatin, which do very different things. But come from the same gene http://en.wikipedia.org/wiki/Ghrelin#Synthesis_and_variants
  13. Analysis requires precise knowledge of the form of the compound across datasets Targets is a search activity, some likely to be mis-entered We use lenses to change the links between the data
  14. Interested in physiochemical properties of Gleevec
  15. Interested in biomedical and pharmacological properties sameAs != sameAs depends on your point of view Links relate individual data instances: source, target, predicate, reason. Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
  16. Lens enables certain relationships and disables others Alters links between the data
  17. Default lens matches structures Only get data back associated with the structure entered with Really want all information about Ibuprofen Need a different lens
  18. Validate structure: Source data is messy! Identify common problems: Charge imbalance Stereochemistry Compute physiochemical properties Identify related properties based on structure 17 relationship types
  19. Can enter with IDs from any of the supported datasets
  20. Platform extracts data from certain datasets These need to be connected Here there is no issue in computing transitive as they are all the same compound based on InChI key Would compute the full set of links
  21. Insulin Receptor Issue when linking through PDB due to the way that proteins are crystalised
  22. Can enter with IDs from any of the supported datasets
  23. These are 1.3 figures In 1.4 130 raw linksets with 6,985,278 links 40,802 computed linksets with 25,584,293 links
  24. Implementation available IMS takes query and expands URIs
  25. Query with URIs Extract URIs Find equivalents under a certain lens (Isolates lens behaviour) Expand query Optimise based on context
  26. Result size in brackets
  27. Orange are actual OPS queries
  28. Subset of the OPS data
  29. Linked data approach performs badly with query 6 due to the query construction Name being bound to the chemical structure returned
  30. Focus on other queries In general expansion is slower than base lines Worst case delta: 0.01842 (under 20ms) Human perception is 0.050 to 0.2 (50 -200ms)
  31. Focus on query 6 No linked data as it performed very poorly on this query Size of result obliterates external call cost
  32. Pharmacology count 2370  3044