Building a Biomedical Knowledge Garden

Building a
Biomedical
Knowledge Garden
Benjamin Good
Su Laboratory, Group Meeting
Dec. 2, 2016

Unstructured data
PubMed
Clinical Trials
Etc.
NLP tools
SemRep
DeepDive
Implicitome
etc.
Knowledge Graph
SemmedDB
Literome
etc.
Applications
Semantic MEDLINE
BioGraph
etc.
Microtasks
Mark2Cure
AMT
Structured data
Gene Ontology etc.
http://tinyurl.com/jbmn8mz
The Knowledge Garden Idea.
Circa Jan. 2015.

The devil is in the details…
Unstructured data
PubMed
Clinical Trials
Etc.
NLP tools
SemRep
DeepDive
Implicitome
etc.
Knowledge Graph
SemmedDB
Literome
etc.
Application
Semantic MEDLINE
BioGraph
etc.
Microtasks
Mark2Cure
AMT
Structured data
Gene Ontology etc.

Reality November 2016
Knowledge Graph
SemmedDB
Application
knowledge.bio
Microtasks
Mark2Cure
AMT

knowledge.bio
Explore all biomedical knowledge as a graph with edges
connected back to supporting references
v2.5 demo

knowledge.bio – Data challenges
• V1 – V2.5
• All content from SemmedDB or Implicitome
• custom schema to support these.
• V3 key requirement:
?
allow import of content from many other sources,
Gene Ontology, DeepDive output, User-generated…

This part is important…
Not nailing it down makes everything else harder
Knowledge Garden
content managed as:
csv files
json documents
mysql databases
Postgress databases
neo4j databases
None of which had any
coherent plan or
structure

Requirements for a knowledge graph
• Syntax:
• How to refer to nodes and edges
• identifiers
• schema (structure of graph)
• Semantics:
• What things mean
• How you decide on the ‘?’:
• node1 ‘?’ node2
• are they the same (to you?)
• if not, what is the edge? Mind the Gap…
(one node in “Amino Acid” namespace
other in (“Biologically Active Substance” namespace)

Options at kb3 scale
(millions of concepts and relations)
• The Unified Medical Language System (UMLS)
• The Semantic Web
• Wikidata ?

The UMLS
(CUIs, Atoms, Types)
C0026106HP:0001256
Mild mental retardation,
Mild and nonprogressive
mental retardation
SNOMEDCT_US:86765009
Moron (mental age 8-12 years)
MEDCIN:35101
Mild intellectual disabilities
OMIM:MTHU035844
Intellectual disability, mild
Atoms
CUI
equivalent to
https://uts.nlm.nih.gov
C0233630
Logical Thinking
Mental or Behavioral Dysfunction
Disease or Syndrome
isa
isa
Types
Behavior
Activity
affects
isa
Event
isa
isa
affects ?
Types organized into a
“Semantic Network”
~ 133 types, 54 predicates
13 high level ‘groups’
CUI

The UMLS in 2016
• 3,200,922 CUIs
• 211 source vocabularies (e.g. MeSH, SNOMED, RxNORM, etc.)
• 12,287,973 total terms (”ATOMS”)
• Every edge in the system is a manual product of NLM
• every Atom->CUI
• every CUI->Type
• every Type->Type

The Semantic Web
• Concepts uniquely identified by
resolvable URIs
• Meaning (e.g. equivalency)
encoded in OWL axioms
• Concepts and mappings
created and maintained by
anyone who can host them
• No other structure
• No governance

UMLS versus Semantic Web
• UMLS
• PROs: covers large portion of biomedical concept space, manually curated,
we are already using it by default, the semantic types are handy
• CONs: does not exist on the semantic web - no stable URI to associate with a
CUI, license is obscure and apparently limiting, weak representation of
molecular biology domain, no control over its extension (e.g. no Human
Disease Ontology)
• Semantic Web
• PROs: universal, open, infrastructure is the Web itself
• CONs: need for organization, curation, mapping

Not thrilled with my options
https://commons.wikimedia.org/wiki/File:A_frustrated_and_depressed_man_holds_his_head_in_his_hand.jpg

Meanwhile...
• human, mouse, rat, yeast,
macaque, 120+ microbes genes
and proteins
• Gene Ontology terms
• Human Disease Ontology terms
• 120,000+ chemicals
• Cancer genome variants
• Other people adding and using
data!!!

Wikidata
(QIDs, ids, Types)
Q183560HP:0001256
Mild mental retardation,
Mild and nonprogressive
mental retardation
Moron (mental age 8-12 years)
MEDCIN:35101
Mild intellectual disabilities
OMIM:MTHU035844
Intellectual disability, mild
QID
external id
https://www.wikidata.org/wiki/Q412194
Q412194
PubChem: 2477
buspirone
Specific Developmental Disorder
developmental disorder of mental health
subclass of
subclass of
treated by
Poly-Ontology
Drug
QID
Chemical
isa
mental disorder
disorder
subclass of
subclass of
(DO)
ids

ACTIVE! Knowledge Flow for Wikidata
Unstructured data
The Internet
NLP tools
StrepHit
Knowledge Graph Applications
Wikipedia
Wikigenomes
Wikidata.org
Microtasks
Wikidata game
MixnMatch
Structured data
Gene Ontology etc.

Wikidata is a Functioning and Flourishing
Knowledge Garden

Wikidata
• ~27,000,000 concepts identified by Qids like ‘Q183560’
• ~1350 source vocabularies (e.g. MeSH, RxNORM, IMDB, ETC.)
• (Based on properties tagged with type ‘ExternalId’)
• ? total terms integrated = labels + aliases (a lot)
• Mappings to Qids product of the unwashed masses
• Constantly updated

What concept scheme do we use ?
• Wikidata
• PROs: universal, open, infrastructure,
active community, largely curated content
• CONs: limited biomedical content so far
?

Challenge: Relevant Scientific Applications
NLP tools
SemRep
Literome
Implicitome
PubTator
DeepDive
Snorkel
ContentMine
TEES
….
Knowledge Graph
Applications
Wikigenomes
HetioNet
Knowledge.Bio
…
Structured data
Gene Expression etc,
…
A. Advancing science is
the goal and this is
how we can help
B. We need experts to
help refine and build
the knowledge graph
and apps are the bait

On the plane
Oct. 11,2016…
“Screw it, lets go all in”
I got really excited..
https://www.flickr.com/photos/alexnormand/5992512756https://www.flickr.com/photos/k6lcs/15374887957

knowledge.bio 3.0
• All nodes to be concepts from wikidata
• All predicates to be properties from wikidata
• All edges to be linked to references that could be ‘stated in’ Wikidata
• Edges (‘claims’) can come from any source
• Now
• We have one consistent format for data import
• We have a consistent pattern for gathering more data about a concept
• We have access to 27 million concepts and growing (and we can add more)
• We have the beginnings of new tool for expert-sourcing curation of Wikidata content
• Our code is getting simpler and cleaner

KB3.0 – next step seeding content
• You are now basically up to date…
• Rest of talk is about mapping content from SemmedDB to the new
structure
• 3.0 release will allow users to add new nodes and edges
• If you want data in there:
1. map it to Wikidata items and properties
2. make a tab-delimited file (Qid Pid Qid referenceUrl sentence)
3. load it (or ask me to)
• Users needed!

How many concepts in the UMLS are now
items in Wikidata?
?
27,000,000
3,000,000

Direct identifier mapping (15 shared ontologies)
CUI Qid
UMLS_vocab Concepts Wikidata_property Prop id Usage
NCBI 1014837 NCBI Taxonomy ID P685 379589
MSH 359116 MeSH ID P486 5979
ICD10PCS 178278 ICD-10-PCS P1690 5
NCI 119620 NCI Thesaurus ID P1748 5562
ICD10CM 98899 ICD-10 P494 8826
OMIM 86181 OMIM ID P492 5835
FMA 82042 Foundational Model of Anatomy ID P1402 3378
GO 60412 Gene Ontology ID P686 43693
MDR 51961 Medical Dictionary for Regulatory Activities ID P3201 1
HGNC 39261 HGNC gene symbol P353 63691
HGNC Sometimes... HGNC-ID P354 39758
NDFRT 38206 NDF-RT ID P2115 1509
ICD9CM 20993 ICD-9-CM P1692 88
ICD10 11552 ICD-10 P494 8826
RXNORM 205998 RxNorm CUI P3345 5671
C0001629
Adrenal Medulla
FMA: 15633 ?qid wdt:P1402 “15633” Q934888
Local MySQL query Build sparql query.wikidata.org

Strict identifier mapping
CUI Qid
UMLS_vocab Concepts Wikidata_property Prop id Usage
NCBI 1014837 NCBI Taxonomy ID P685 379589
MSH 359116 MeSH ID P486 5979
ICD10PCS 178278 ICD-10-PCS P1690 5
NCI 119620 NCI Thesaurus ID P1748 5562
ICD10CM 98899 ICD-10 P494 8826
OMIM 86181 OMIM ID P492 5835
FMA 82042 Foundational Model of Anatomy ID P1402 3378
GO 60412 Gene Ontology ID P686 43693
MDR 51961 Medical Dictionary for Regulatory Activities ID P3201 1
HGNC 39261 HGNC gene symbol P353 63691
HGNC Sometimes... HGNC-ID P354 39758
NDFRT 38206 NDF-RT ID P2115 1509
ICD9CM 20993 ICD-9-CM P1692 88
ICD10 11552 ICD-10 P494 8826->8292
RXNORM 205998 RxNorm CUI P3345 0->5671
-> Thanks to Sebastian’s recent work..

How many concepts in the UMLS are now
items in Wikidata? (according to identifiers)
463,059
27,000,000
3,000,000
15%

463,059
Wikidata
items by
UMLS
source id

Coverage of shared identifiers by item
(cut off,
NCBI taxonomy
has > 1million)
UMLS cuis
Wikidata items
Good targets for wikidata bots

463,059 mapped concepts, by semantic group
1
10
100
1000
10000
100000
1000000
N 1 to 1
NCBI Taxons
Gene Ontology
Genes
Diseases
Drugs

Where are the Gaps?
0
100000
200000
300000
400000
500000
600000
700000
800000
N no Map
600,000 missing drugs
550,000 missing disorders

Where are(n’t) the Gaps?
0
0.1
0.2
0.3
0.4
0.5
0.6
percent_mapped

Adding label matching actually doesn’t help
that much…
• Checked only 460,080 (including all 288,552 from SemmedDB)
• 21% (96,843) had an identifier match
• 6.9% (31,645) had a match on the UMLS Prefered Label
• 3.1% (14,319) matched one of the UMLS synonyms
• Removing anything that matched more than 1 Wikidata item we get
129,726 concepts.
• Limiting to concepts used in SemmedDB we get 113,623
• (43% coverage with most matches coming from identifiers)

SemmedDB as Wikidata, version 1
• 15,957,582 predications with 13 relation types
• All Concepts Wikidata items
• All relation types Wikidata properties
• (Data available at http://tinyurl.com/cui2qid-1 )
• Will be accessible in kb3.0 next week or the following

Next steps / project opportunities
• More Wikidata bots!
• Establish a more consistent typing strategy in Wikidata (e.g. make
each item an instance of some semantic group)
• Finish the mapping of the UMLS predicates to Wikidata Properties
• Add missing properties (e.g. ‘Activates’, ‘Inhibits’)
• Use existing subproperty prop. to build a prop. ontology inside wikidata
• Populate kb3.0 with knowledge pertinent to your disease area
• Extend the user interface
• Use the underlying neo4j database to extend HetioNet and related (or
add HetioNet to it.

Pick an edge or node and create or improve it
Unstructured data
PubMed
Clinical Trials
Etc.
NLP tools
SemRep
DeepDive
Implicitome
etc.
Knowledge Graph
SemmedDB
Literome
etc.
Applications
Semantic MEDLINE
BioGraph
etc.
Microtasks
Mark2Cure
AMT
Structured data
Gene Ontology etc.

Thanks!
• Richard Bruskiewich! and Star Informatics team for persevering…
(v1,v2.1...5, v3.0)
• Gene Wiki team! Especially bot developers: Sebastian B, Andra W,
Tim P., Greg S. who planted the seeds that are making this possible.
• Su laboratory!
• I hope you can find something useful here and help grow the garden…
• Especially you HetNetters!
https://www.flickr.com/photos/alexnormand/5992512756

Building a Biomedical Knowledge Garden

Building a Biomedical Knowledge Garden

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Building a Biomedical Knowledge Garden

Similaire à Building a Biomedical Knowledge Garden (20)

Plus de Benjamin Good

Plus de Benjamin Good (18)

Dernier

Dernier (20)

Building a Biomedical Knowledge Garden

Notes de l'éditeur