Describes the tribulations of building a large biomedical knowledge graph. Provides a comparison between the UMLS and Wikidata in terms of content and structure. Concludes with the idea of anchoring the knowledge graph in Wikidata items and properties.
2. Unstructured data
PubMed
Clinical Trials
Etc.
NLP tools
SemRep
DeepDive
Implicitome
etc.
Knowledge Graph
SemmedDB
Literome
etc.
Applications
Semantic MEDLINE
BioGraph
etc.
Microtasks
Mark2Cure
AMT
Structured data
Gene Ontology etc.
http://tinyurl.com/jbmn8mz
The Knowledge Garden Idea.
Circa Jan. 2015.
3. The devil is in the details…
Unstructured data
PubMed
Clinical Trials
Etc.
NLP tools
SemRep
DeepDive
Implicitome
etc.
Knowledge Graph
SemmedDB
Literome
etc.
Application
Semantic MEDLINE
BioGraph
etc.
Microtasks
Mark2Cure
AMT
Structured data
Gene Ontology etc.
6. knowledge.bio – Data challenges
• V1 – V2.5
• All content from SemmedDB or Implicitome
• custom schema to support these.
• V3 key requirement:
?
allow import of content from many other sources,
Gene Ontology, DeepDive output, User-generated…
7. This part is important…
Not nailing it down makes everything else harder
Knowledge Garden
content managed as:
csv files
json documents
mysql databases
Postgress databases
neo4j databases
None of which had any
coherent plan or
structure
8. Requirements for a knowledge graph
• Syntax:
• How to refer to nodes and edges
• identifiers
• schema (structure of graph)
• Semantics:
• What things mean
• How you decide on the ‘?’:
• node1 ‘?’ node2
• are they the same (to you?)
• if not, what is the edge? Mind the Gap…
(one node in “Amino Acid” namespace
other in (“Biologically Active Substance” namespace)
9. Options at kb3 scale
(millions of concepts and relations)
• The Unified Medical Language System (UMLS)
• The Semantic Web
• Wikidata ?
10. The UMLS
(CUIs, Atoms, Types)
C0026106HP:0001256
Mild mental retardation,
Mild and nonprogressive
mental retardation
SNOMEDCT_US:86765009
Moron (mental age 8-12 years)
MEDCIN:35101
Mild intellectual disabilities
OMIM:MTHU035844
Intellectual disability, mild
Atoms
CUI
equivalent to
https://uts.nlm.nih.gov
C0233630
SNOMEDCT_US:32386009
Logical Thinking
Mental or Behavioral Dysfunction
Disease or Syndrome
isa
isa
Types
Behavior
Activity
affects
isa
Event
isa
isa
affects ?
Types organized into a
“Semantic Network”
~ 133 types, 54 predicates
13 high level ‘groups’
CUI
11. The UMLS in 2016
• 3,200,922 CUIs
• 211 source vocabularies (e.g. MeSH, SNOMED, RxNORM, etc.)
• 12,287,973 total terms (”ATOMS”)
• Every edge in the system is a manual product of NLM
• every Atom->CUI
• every CUI->Type
• every Type->Type
12. The Semantic Web
• Concepts uniquely identified by
resolvable URIs
• Meaning (e.g. equivalency)
encoded in OWL axioms
• Concepts and mappings
created and maintained by
anyone who can host them
• No other structure
• No governance
13. UMLS versus Semantic Web
• UMLS
• PROs: covers large portion of biomedical concept space, manually curated,
we are already using it by default, the semantic types are handy
• CONs: does not exist on the semantic web - no stable URI to associate with a
CUI, license is obscure and apparently limiting, weak representation of
molecular biology domain, no control over its extension (e.g. no Human
Disease Ontology)
• Semantic Web
• PROs: universal, open, infrastructure is the Web itself
• CONs: need for organization, curation, mapping
14. Not thrilled with my options
https://commons.wikimedia.org/wiki/File:A_frustrated_and_depressed_man_holds_his_head_in_his_hand.jpg
15. Meanwhile...
• human, mouse, rat, yeast,
macaque, 120+ microbes genes
and proteins
• Gene Ontology terms
• Human Disease Ontology terms
• 120,000+ chemicals
• Cancer genome variants
• Other people adding and using
data!!!
17. Wikidata
(QIDs, ids, Types)
Q183560HP:0001256
Mild mental retardation,
Mild and nonprogressive
mental retardation
SNOMEDCT_US:86765009
Moron (mental age 8-12 years)
MEDCIN:35101
Mild intellectual disabilities
OMIM:MTHU035844
Intellectual disability, mild
QID
external id
https://www.wikidata.org/wiki/Q412194
Q412194
PubChem: 2477
buspirone
Specific Developmental Disorder
developmental disorder of mental health
subclass of
subclass of
treated by
Poly-Ontology
Drug
QID
Chemical
isa
mental disorder
disorder
subclass of
subclass of
(DO)
ids
18. ACTIVE! Knowledge Flow for Wikidata
Unstructured data
The Internet
NLP tools
StrepHit
Knowledge Graph Applications
Wikipedia
Wikigenomes
Wikidata.org
Microtasks
Wikidata game
MixnMatch
Structured data
Gene Ontology etc.
19. Wikidata is a Functioning and Flourishing
Knowledge Garden
20. Wikidata
• ~27,000,000 concepts identified by Qids like ‘Q183560’
• ~1350 source vocabularies (e.g. MeSH, RxNORM, IMDB, ETC.)
• (Based on properties tagged with type ‘ExternalId’)
• ? total terms integrated = labels + aliases (a lot)
• Mappings to Qids product of the unwashed masses
• Constantly updated
21. What concept scheme do we use ?
• Wikidata
• PROs: universal, open, infrastructure,
active community, largely curated content
• CONs: limited biomedical content so far
?
22. Challenge: Relevant Scientific Applications
NLP tools
SemRep
Literome
Implicitome
PubTator
DeepDive
Snorkel
ContentMine
TEES
….
Knowledge Graph
Applications
Wikigenomes
HetioNet
Knowledge.Bio
…
Structured data
Gene Expression etc,
…
A. Advancing science is
the goal and this is
how we can help
B. We need experts to
help refine and build
the knowledge graph
and apps are the bait
23. On the plane
Oct. 11,2016…
“Screw it, lets go all in”
I got really excited..
https://www.flickr.com/photos/alexnormand/5992512756https://www.flickr.com/photos/k6lcs/15374887957
24. knowledge.bio 3.0
• All nodes to be concepts from wikidata
• All predicates to be properties from wikidata
• All edges to be linked to references that could be ‘stated in’ Wikidata
• Edges (‘claims’) can come from any source
• Now
• We have one consistent format for data import
• We have a consistent pattern for gathering more data about a concept
• We have access to 27 million concepts and growing (and we can add more)
• We have the beginnings of new tool for expert-sourcing curation of Wikidata content
• Our code is getting simpler and cleaner
25. KB3.0 – next step seeding content
• You are now basically up to date…
• Rest of talk is about mapping content from SemmedDB to the new
structure
• 3.0 release will allow users to add new nodes and edges
• If you want data in there:
1. map it to Wikidata items and properties
2. make a tab-delimited file (Qid Pid Qid referenceUrl sentence)
3. load it (or ask me to)
• Users needed!
26. How many concepts in the UMLS are now
items in Wikidata?
?
27,000,000
3,000,000
37. Adding label matching actually doesn’t help
that much…
• Checked only 460,080 (including all 288,552 from SemmedDB)
• 21% (96,843) had an identifier match
• 6.9% (31,645) had a match on the UMLS Prefered Label
• 3.1% (14,319) matched one of the UMLS synonyms
• Removing anything that matched more than 1 Wikidata item we get
129,726 concepts.
• Limiting to concepts used in SemmedDB we get 113,623
• (43% coverage with most matches coming from identifiers)
38. SemmedDB as Wikidata, version 1
• 15,957,582 predications with 13 relation types
• All Concepts Wikidata items
• All relation types Wikidata properties
• (Data available at http://tinyurl.com/cui2qid-1 )
• Will be accessible in kb3.0 next week or the following
39. Next steps / project opportunities
• More Wikidata bots!
• Establish a more consistent typing strategy in Wikidata (e.g. make
each item an instance of some semantic group)
• Finish the mapping of the UMLS predicates to Wikidata Properties
• Add missing properties (e.g. ‘Activates’, ‘Inhibits’)
• Use existing subproperty prop. to build a prop. ontology inside wikidata
• Populate kb3.0 with knowledge pertinent to your disease area
• Extend the user interface
• Use the underlying neo4j database to extend HetioNet and related (or
add HetioNet to it.
40. Pick an edge or node and create or improve it
Unstructured data
PubMed
Clinical Trials
Etc.
NLP tools
SemRep
DeepDive
Implicitome
etc.
Knowledge Graph
SemmedDB
Literome
etc.
Applications
Semantic MEDLINE
BioGraph
etc.
Microtasks
Mark2Cure
AMT
Structured data
Gene Ontology etc.
41. Thanks!
• Richard Bruskiewich! and Star Informatics team for persevering…
(v1,v2.1...5, v3.0)
• Gene Wiki team! Especially bot developers: Sebastian B, Andra W,
Tim P., Greg S. who planted the seeds that are making this possible.
• Su laboratory!
• I hope you can find something useful here and help grow the garden…
• Especially you HetNetters!
https://www.flickr.com/photos/alexnormand/5992512756
Notes de l'éditeur
Amino Acid, Peptide, or Protein
Biologically Active Substance
Is there one node for gene and one for the protein? Are orthologs different nodes ? What about sequence variants?
Note we could be checking the references to increase precision and provenance…
Note we could be checking the references to increase precision and provenance…
And that is largely the important take home message. Identifier mapping is hard, generally boring work that should never be repeated! Doing it in the context of Wikidata means that it can be done once and for all – and you can even describe how it was accomplished in the references and qualifiers!!! We should do this!
Noting that I was hammering the query service around 10 times/second for around 24 hours and it never complained or slowed down.