1. Day 2 of Computing on the
shoulders of giants:
how existing knowledge is represented
and applied in bioinformatics
Benjamin Good
bgood@scripps.edu
Assistant Professor of the Department of
Molecular and Experimental Medicine
2. Recap from Day 1
• Make things (articles, genes,
antibodies, etc.) easier to find
• Answer questions
• Generate hypotheses
Controlled vocabularies (MeSH)
Ontologies (Gene Ontology)
knowledge graphs on the Web:
the SPARQL query language
knowledge plus computation =
inference, the ABC model
3. Computing with knowledge
• Challenges with knowledge graphs
• Too much data
• ->> query, sort, visualize, interact
• Not enough data
• ->> mine for more..
• Goal for practical day: Go beyond PubMed!
• gain hands on experience using a knowledge graph
• either with tools built for the purpose or with your own code…
4. Assignment: knowledge graph to hypothesis
• Option 1 Coding
• Implement and apply an ABC Model style hypothesis generating program (can adapt
from example provided)
• explain its logic, explain how you used it to generate a hypothesis, explain the
hypothesis (provide a visual)
• Option 2 Non-coding
• Use a knowledge discovery application(s) (list provided) to define a new hypothesis
• if you can’t think of where to start, try to explain why Metformin may contribute to
cancer survival
• Assignment deliverables: a document containing
• the inputs you gave to your program or the online tool(s) you used
• what was generated in response and the underlying logic
• an image and text describing the results, especially any hypothesis you could derive
• (for Option 1 also submit any code written or files generated as a tar or zip archive)
5. Online tools for knowledge discovery
• http://knowledge.bio (* we make this one…)
• http://www.biograph.be (this is a good tool, but often breaks down)
• http://epiphanet.uth.tmc.edu (also on the flaky side, but can be
good)
• https://skr3.nlm.nih.gov/SemMed/ (works okay, requires a (free)
account)
• http://arrowsmith.psych.uic.edu (ugly interface, but good tool)
8. Example question: repurposing all drugs
http://tinyurl.com/hwm9388
?drug
?disease
interacts
with
protein
geneencoded by
genetic
association
treats??
9. Example program (feel free to follow or adapt
to your interest)
• Example
• Input = a disease (A)
• Output = a ranked list of drugs (C) that might be used for treatment
• Render the results of your workflow as a cytoscape network that illustrates the
reasoning behind the predictions
• Implementation
• Python
• Use a SPARQL endpoint such as http://query.wikidata.org
• + identify and use another endpoint (e.g. EBI, UniProt)
• ++ access pubmed articles and MeSH indexing
10. Python setup
• pip install RDFLib, SPARQLWrapper, pandas….
• Hopefully Jupyter already installed ? else install it
http://jupyter.readthedocs.io/en/latest/install.html
• get notebook from
https://github.com/SuLab/sparql_to_pandas/blob/master/SPARQL_p
andas.ipynb
• go to directory where you put the notebook
• run it with
• >jupyter notebook
• should be ready to run
11. the notebook
• will run a basic search for disease-gene-drug connections in wikidata
• will sort the results by the number of intervening genes
• will export the data to a tab-delimited file you can view in Excel, text
editor, or load into cytoscape
• Your job:
• Run it and extend it by one or more of:
• adapting the query
• changing the way the results are sorted
• working with the output in cytoscape to produce an informative visualization
13. Other queries from Day 1 (slides 48-54)
• Drugs that target a cancer and impact a specific biological process
• http://tinyurl.com/j222k6g
• Drugs that target a new disease linked via biological pathway
with shared genes to disease the drug is now used to treat
• http://tinyurl.com/gpfr9kj
14. Possible inputs for adaptations
• Browse and examine wikidata.org to see what you might make use of
• e.g.
• Type of physical interaction between gene and drug
• Gene ontology annotation (what evidence codes?)
• Disease ontology hierarchy
• Drug characteristics
15. Other possible knowledge sources
• SPARQL
• UniProt http://sparql.uniprot.org
• EBI SPARQL https://www.ebi.ac.uk/rdf/documentation/sparql-endpoints
• look for unique identifiers on genes and proteins that you can use to link
wikidata content to their content
• Text
• use the NCBI the E-utils API to programmatically access pubmed articles and
MeSH indexing http://www.ncbi.nlm.nih.gov/books/NBK25501/
• Can use to build co-occurrence networks of e.g. MeSH terms
17. ABC ranking algorithms
• Out of all C, which are most strongly
related to A?
• Rank by N shared B concepts
• c2: 4
• c4:3
• c1: 1
• c3: 1
• c5:1
• c6:1
• Next level: adjust to down-weight highly
connected nodes
A B C
c1
c2
c3
c4
c5
c6
18. ABC ranking algorithms – advanced (require
large networks to be useful)
• Wren – Average Minimum Weight (AMW) (Wren)
• http://bioinformatics.oxfordjournals.org/content/20/3/389.full.pdf
• Linking Term Count with Average Minimum Weight (LTC-AMW)
(Yetisgen-Yildiz and Pratt)
• https://www.researchgate.net/publication/23759128_A_new_evaluation_me
thodology_for_literature-based_discovery_systems
• Predicate inter-dependence (Rastegar-Mojarad)
• https://s3.amazonaws.com/uploads.hipchat.com/25885/154162/UaGvvQqbr
hPBAWN/A%20new%20method.pdf
Notes de l'éditeur
This picture is derived from Greek mythology: the blind giant Orion carried his servant Cedalion on his shoulders to act as the giant's eyes.