5. Why graph? -> why not relational
• biomedical data / healthcare data is highly connected
• => variety of data
=> unstructured
=> heterogeneous
=> not connected
=> unFAIR
• easy to model
• extremely flexible / easy adoptable („re-shaping the graph“) vs. static SQL model
• scalable (Billion of nodes+relationships on a single machine
• easy to query (cyclic dependencies)
• GraphDataScience library + graph embeddings
6. Biological question:
Are human genes from GWAS T2D enzymes acting on metabolites which in turn are
regulated in pig diabetes model?
The actual question (from a data-point-of-view):
Is there a connection between A and R?
Easy scientific question
8. Back to the question
Are human genes from GWAS T2D enzymes acting on metabolites which in turn are regulated in pig diabetes model?
Genomics
Human diabetic data
Genes
SNPs
Proteins
Enzymes
Pathways
Metabolites
Metabolomics
Pre diabetic pig
Metabolites
List of SNPs
List of Genes of
(species 1)
List of Proteins of
(species 1)
List of loci
List of Enzymes of
(species 1)
List of Pathways of
(species 1)
List of Metabolites
of (species 1)
List of Metabolites
of (species 2)
graph
15. Use case 1
Handle mapping identifiers of molecular entities
Knowledge Graph
16. Query „friends of a friend“ on a gene level
Example: diabetes relevant gene ‚TCF7L2’
match path=(g:Gene{sid:'TCF7L2'})-[:MAPS|SYNONYM*0..2]-(g1:Gene) return path
17. Use case 2
Find information that is NOW connected
Knowledge Graph
18. Query for SNPs (mutations) associated to diabetes
Output: relevant protein and its function (ontology terms)
match (tr:Trait)
where tr.name contains ‚diabetes mellitus‘
with tr as disease
match path=(disease)<-[:ASSOCIATED_WITH_TRAIT]-(asso:Association)<-[:SNP_HAS_ASSOCIATION]-(snp:SNP)-
[:SNP_HAS_GENE]-(gene:Gene)-[:MAPS]-(g1:Gene)-[x:CODES]->(transcript:Transcript)-[:CODES]->
(prot:Protein)-[:ASSOCIATION]->(term:Term)—(o:Ontology)
return path
19. Use case 3
Transform text into knowledge
Annotate and enrich text information
Natural Language
Processing
Ontologies
Knowledge Graph
20. Angiotensin-converting enzyme 2 (ACE2) as a SARS-CoV-2 receptor:
molecular mechanisms and potential therapeutic target. SARS-CoV-2
has been sequenced [3]. A phylogenetic analysis [3, 4] found a bat
origin for the SARS-CoV-2. There is a diversity of possible intermediate
hosts for SARS-CoV-2, including pangolins, but not mice and rats [5].
There are many similarities of SARS-CoV-2 with the original SARS-CoV.
Using computer modeling, Xu et al. [6] found that the spike proteins of
SARS-CoV-2 and SARS-CoV have almost identical 3-D structures in the
receptor-binding domain that maintains van der Waals forces. SARS-CoV
spike protein has a strong binding affinity to human ACE2, based on
biochemical interaction studies and crystal structure analysis [7]. SARS-
CoV-2 and SARS-CoV spike proteins share 76.5% identity in amino acid
sequences
1 of 30m scientific abstracts
21. NLP: transform text into knowledge
Re-integrate Named Entities into the graph
Angiotensin-converting enzyme 2 GENE_OR_GENOME ( ACE2
GENE_OR_GENOME ) as a SARS-CoV-2 CORONAVIRUS receptor:
molecular mechanisms and potential therapeutic target. SARS-CoV-2
CORONAVIRUS has been sequenced [3 CARDINAL]. A phylogenetic
analysis [3 CARDINAL, 4 CARDINAL] found a bat WILDLIFE origin for
the SARS-CoV-2 CORONAVIRUS. There is a diversity of possible
intermediate hosts for SARS-CoV-2 CORONAVIRUS, including pangolins
WILDLIFE, but not mice EUKARYOTE and rats EUKARYOTE [5
CARDINAL].
There are many similarities of SARS-CoV-2 CORONAVIRUS with the
original SARS-CoV CORONAVIRUS. Using computer modeling, Xu et al.
[6 CARDINAL] found that the spike proteins GENE_OR_GENOME of
SARS-CoV-2 CORONAVIRUS and SARS-CoV CORONAVIRUS have almost
identical 3-D structures in the receptor-binding domain that maintains
van der Waals forces PHYSICAL_SCIENCE. SARS-CoV CORONAVIRUS
spike protein has a strong binding affinity to human ACE2
GENE_OR_GENOME, based on biochemical interaction studies and
crystal structure analysis [7 CARDINAL]. SARS-CoV-2 CORONAVIRUS and
SARS-CoV spike proteins GENE_OR_GENOME share 76.5% identity in
amino acid sequences
22. Use case 4
Using graph algorithms to infer new insights
Natural Language
Processing
Ontologies
Knowledge Graph
23. GDS - page rank - find the most relevant gene
finding ACE2 - the receptor the SARS-Cov2 virus uses to enter the cell
• 140’000 abstracts from
Covid19 related publications
• NER of gene names
• Page Rank identified
‚ACE2‘ as the most relevant
gene
24. Use case 5
Using node embeddings to sub phenotype diabetic patients
Natural
29. k-nearest neighbour clustering with k=5
representing the 5 diabetes subtypes
patient 01 patient 02
patient 03
Graph
algorithms
patient 04
patient 05
patient 02
p
a
t
i
e
n
t
0
4
patient 03
patient 05
patient 01
subphenotyping of diabetic patients
30. DZDconnect
connect patient data with knowledge graph
Transcript
Gene
Synonyms
Abstract
PubMed
Article
Keyword
MeSH-term
Ontology term
33. Take home message
• Knowledge graph
• as single point of truth
• connect in-house data
• scalability
• infer new insights
• Use cases:
• simple and advanced (Cypher) queries
• Graph Data Science library (page rank, kNN)
• Node embeddings for complex data
• NLP