The Monarch Initiative: An integrated genotype-phenotype platform for disease...
Haendel clingenetics.3.14.14
1. Expanding the Clinical
Phenotype Space with
Semantics and Model Systems
Melissa Haendel
March 14th, 2014
Updates in Clinical Genetics 2014
2. Outline
Issues in candidate prioritization
Computational techniques for comparing
phenotypes
Undiagnosed Disease Program semantic
phenotyping
Minimum phenotype requirements
Tools leveraging phenotypes
3. The Challenge: Interpretation of
Disease Candidates
?
What’s in the box?
How are
candidates
identified?
How do they
compare?
Prioritized
Candidates, Models,
functional validation
M1
M2
M3
M4
...
Phenotypes
P1
P2
P3
…
Genotype info
G1
G2
G3
G4
…
Pathogenicity, frequency, p
rotein interactions, gene
expression, gene
networks, epigenomics, m
etabolomics….
4. Candidate gene prioritization
Phenot ypic inf or mat ionGenet ic inf or mat ion
gene/ gene pr oduct Inf o
Phenotypes
collected for
individual patients
Sequences from an
individual,family,or
related group
Candidate interpretation
Human sequence reference
sequences (e.g.reference
sequence,1K genome data,
genomic location)
Community phenotype data (e.g.
literature MODS,KOMP2,OMIM,
EHRs,GWAS,ClinVar,disease
specific repositories,etc.)
Pathway
Functional (GO)
Gene
expression,
OMICS data
Protein-Protein
Interactions
Enrichment analysis
(e.g.GATACA,Galaxy)
Combined variant +
phenotype candidate
reporting(e.g.Exomizer)
BiomedicalKnowledgeIndividual'sInformation
Phenotypic comparison
methods
Variant calling
(e.g.GATK)
Pathogenicity
/Impact
calling (e.g.
VAAST,SIFT)
Orthologs
Network module analysis
8. “Expanding” the phenotypic coverage
of the human genome
0%
20%
40%
60%
80%
100%
%humancodinggenes
OMIM
OMIM+GWA
S
Ortholog only
Human+Ortholog
Human only
Five model organisms (mouse, zebrafish, fly, yeast, rat)
provide almost 80% phenotypic coverage of the human
genome
9. How can we take advantage
this model organism
phenotype data?
10. Outline
Issues in candidate prioritization
Computational techniques for comparing
phenotypes
Undiagnosed Disease Program semantic
phenotyping
Minimum phenotype requirements
Tools leveraging phenotypes
11. Using ontologies to compare phenotypes
across species
Washington, N. L., Haendel, M. A., Mungall, C. J., Ashburner, M., Westerfield, M., & Lewis, S. E. (2009).
Linking Human Diseases to Animal Models Using Ontology-Based Phenotype Annotation. PLoS
Biol, 7(11). doi:10.1371/journal.pbio.1000247
12. What is an ontology?
A set of logically defined, inter-related terms
used to annotate data
Use of common or logically related terms across
databases enables integration
Relationships between terms allow annotations to
be grouped in scientifically meaningful ways
Reasoning software enables computation of inferred
knowledge
Groups of annotations can be compared using
semantic similarity algorithms
13. An ontology provides the logical
basis of classification
Any sense organ that functions in the
detection of smell is an olfactory sense
organ
sense organ
capable_of
some
detection of
smell
olfactory
sense
organ
14. nose
sense organ
nose
capable_of some
detection of smell
sense organ
capable_of
some
detection of
smell
olfactory
sense
organ
nose
=> These are necessary and sufficient conditions
Classifying
16. Human Phenotype Ontology
Used to annotate:
• Patients
• Disorders
• Genotypes
• Genes
• Sequence variants
In human
Reduced pancreatic
beta cells
Abnormality of
pancreatic islet
cells
Abnormality of endocrine
pancreas physiology
Pancreatic islet
cell adenoma
Pancreatic islet cell
adenoma
Insulinoma
Multiple pancreatic
beta-cell adenomas
Abnormality of exocrine
pancreas physiology
Köhler et al. The Human Phenotype Ontology project: linking molecular biology and
disease through phenotype data. Nucleic Acids Res. 2014 Jan 1;42(1):D966-74.
17. Mammalian Phenotype Ontology
Smith et al. (2005). The Mammalian Phenotype Ontology as a
tool for annotating, analyzing and comparing phenotypic
information. Genome Biol, 6(1). doi:10.1186/gb-2004-6-1-r7
Used to annotate and
query:
• Genotypes
• Alleles
• Genes
In mice
abnormal
pancreatic
beta cell
mass
abnormal
pancreatic
beta cell
morphology
abnormal
pancreatic islet
morphology
abnormal
endocrine
pancreas
morphology
abnormal
pancreatic
beta cell
differentiation
abnormal
pancreatic
alpha cell
morphology
abnormal
pancreatic
alpha cell
differentiation
abnormal
pancreatic
alpha cell
number
18. Post-composed models of
phenotype annotation
Entity
Anatomy: head
Anatomy: heart
Anatomy: ventral mandibular arch
Gene Ontology: swim bladder inflation
Quality
Small size
Edematous
Thick
Arrested
19. A human phenotype example
Abnormality
of the eye
Vitreous
hemorrhage
Abnormal
eye
morphology
Abnormality of the
cardiovascular system
Abnormal
eye
physiology
Hemorrhage
of the eye
Internal
hemorrhage
Abnormality
of the globe
Abnormality of
blood circulation
20. lung
lung
lobular organ
parenchymatous
organ
solid organ
pleural sac
thoracic
cavity organ
thoracic
cavity
abnormal lung
morphology
abnormal respiratory
system morphology
Mammalian Phenotype
Mouse Anatomy
FMA
abnormal pulmonary
acinus morphology
abnormal pulmonary
alveolus morphology
lung
alveolus
organ system
respiratory
system
Lower
respiratory
tract
alveolar sac
pulmonary
acinus
organ system
respiratory
system
Human development
lung
lung bud
respiratory
primordium
pharyngeal region
Problem: Data silos
develops_from
part_of
is_a (SubClassOf)
surrounded_by
21. Solution: bridging semantics
Mungall, C. J., Torniai, C., Gkoutos, G. V., Lewis, S. E., & Haendel, M. A. (2012). Uberon, an integrative
multi-species anatomy ontology. Genome Biology, 13(1), R5. doi:10.1186/gb-2012-13-1-r5
anatomical
structure
endoderm of
forgut
lung bud
lung
respiration organ
organ
foregut
alveolus
alveolus of lung
organ part
FMA:lung
MA:lung
endoderm
GO: respiratory
gaseous exchange
MA:lung
alveolus
FMA:
pulmonary
alveolus
is_a (taxon equivalent)
develops_from
part_of
is_a (SubClassOf)
capable_of
NCBITaxon: Mammalia
EHDAA:
lung bud
only_in_taxon
pulmonary acinus
alveolar sac
lung primordium
swim bladder
respiratory
primordium
NCBITaxon:
Actinopterygii
Köhler et al. (2014) Construction and accessibility of a cross-species phenotype ontology along with
gene annotations for biomedical research F1000Research 2014, 2:30
22. Phenotype representation requires
more than “phenotype ontologies”
glucose
metabolism
(GO:0006006
)
Gene/protein
function data
glucose
(CHEBI:172
34)
Metabolomics, t
oxicogenomics
data
Disease &
phenotype
data
type II
diabetes
mellitus
(DOID:9352)
pyruvate
(CHEBI:153
61)
Disease Gene Ontology Chemical
pancreatic
beta cell
(CL:0000169)
transcriptomic
data
Cell
23. OWLsim: Phenotype similarity
across patients or organisms
Unstable
posture
Constipation
Neuronal loss in
Substantia Nigra
Shuffling gait
Resting tremors
REM disorder
Hyposmia
poor rotarod
performance
decreased gut
peristalsis
axon
degeneration
decreased
stride length
sterotypic
behavior
abnormal
EEG
failure to find
food
abnormal
coordination
abnormal
digestive
physiology
CNS neuron
degeneration
abnormal
locomotion
abnormal
motor function
sleep
disturbance
abnormal
olfaction
https://code.google.com/p/owltools/wiki/OwlSim
24. Outline
Issues in candidate prioritization
Computational techniques for comparing
phenotypes
Undiagnosed Disease Program semantic
phenotyping
Minimum phenotype requirements
Tools leveraging phenotypes
25. General exome analysis
Single Exome
Remove off-target and common
variants, filter on predicted
deleteriousness, candidate gene
strategies
Prioritize based on known
genes, allele frequency, and
pathogenicity
Homozygous recessive, X-
linked, De novo (if trio)
26. Undiagnosed Disease Program
exome analysis
Family exome data
Prioritize based on alignment quality, allele
frequency, predicted deleterious, and PubMed
Filter using SNP chip data,
Mendelian models of inheritance
and Population frequency
27. exome analysis
Recessive, De novo filters
Remove off-target, common
variants, and variants not in known
disease causing genes
Zemojtelet al., manuscript submittedhttp://compbio.charite.de/PhenIX/
28. Remove off-target and common
variants
Recessive, De novo filters
https://www.sanger.ac.uk/resourc
es/databases/exomiser/
Robinson et al.
http://genome.cshlp.org/content/early/2013/10/2
Exomiser exome analysis
29. Current UDP analysis with
semantic phenotyping
Family Exome Data
Combined
Score
Phenotype
Data
Filter using SNP chip
data, Mendelian models of
inheritance, and population
frequency
30. Benchmarking
1092 unaffected
exomes 28,516 disease
associated variants
100,000
simulated
exomes
Annotate variants
Remove off-target, syn and common(>1% MAF)
variants (plus optional inheritance model
filtering)
Prioritize based on combined score
32. Results
Correct gene as top scoring hit in 68.3% of exomes out
of an average of 272 post-filtering candidate genes
Improvement of between 1.8 and 5.1 fold in the
percentage of candidate genes correctly ranked in first
place compared to just using pathogenicity and
frequency data
Shows utility of structured phenotype data for
computational analysis
39. Defining minimum phenotype
standard:
1. Is the annotation specificity similar to or better than the
corpus of available phenotype data?
2. Is the number of annotations/patient similar or better?
3. How does the ontology and annotation set differ across
anatomical systems in terms of granularity? Does this
change specificity requirements for phenotypic profiles?
4. How does use of NOT annotations help further specify
the uniqueness of an undiagnosed patient?
5. How do onset, temporal ordering, and severity affect
specificity?
40. UDP phenotype annotation
metrics
UDP annotations have a similar Information content (IC) and a
larger number of average annotations per disease/patient
41. Anatomical annotation distribution in
the corpus
Nervous system, skeletal system, and immune system is highest =>
these categories require greater specificity and numbers of annotations
43. Making the patient phenotype profiles
as good as can be
Total requests from UDP 614 Examples
Number of requests assigned to
HPO terms 423 Chronic limb pain -> limb pain
Number of terms that need
consideration by UDP 145
Expressive language -> delay?
Increase? Abnormal?
Number of requests that belong
in other parts of the patient
record 68
Abnormal aCGH 12q21.1-
12q.2 (662 kb duplication)
paternal origin -> move to
genotype information portion
of the record
It is a community effort to contribute requests to the ontologies and
quality profiling helps make our tools work better for everyone
44. Limitations and ongoing work
Adding negation to the algorithm
Temporal ordering of phenotypes
Leveraging severity, expressivity, and
penetrance data
52. Exome Walker: Network based exploration
of phenotypically similar diseases
http://compbio.charite.de/ExomeWalker/
Walking the interactome for prioritization of candidate disease genes.
Am J Hum Genet. 2008 Apr;82(4):949-58. doi: 10.1016/j.ajhg.2008.02.013.
Bare Lymphocyte Syndrome Type 1 Protein-Interaction Network
Exploits vicinity in the protein interaction network between phenotypically related
diseases and uses this to rank exome candidates
Large boost in rankings of candidate genes using 250 disease gene-families
Prototype version online, manuscript in preparation
53. PhenoViz: Integrate all human, mouse, and
fish data to understand CNVs
Desktop application
for differential
diagnostics in CNVs
Explain manifestations of CNV diseases based on genes
contained in CNV
E.g., Supravalcular aortic stenosis in Williams syndrome can be
explained by haploinsufficiency for elastin
Double the number of explanations using model data
Doelken, Köhler, et al. (2013) Dis Model Mech 6:358-72
54. Conclusions
Cross-species phenotype data can be used to
perform semantic similarity
Structured phenotype data for rare and
undiagnosed disease patients can aid
candidate evaluation
We are experimenting with these methods for
UDP patient phenotypes to aid candidate
prioritization, identify models, explore
mechanisms, and find collaborators
55. NIH-UDP
William Bone
Murat Sincan
David Adams
Amanda Links
David Draper
Neal Boerkoel
Cyndi Tifft
Bill Gahl
OHSU
Nicole Vasilesky
Matt Brush
Lawrence Berkeley
Nicole Washington
Suzanna Lewis
Chris Mungall
UCSD
Amarnath Gupta
Jeff Grethe
Anita Bandrowski
Maryann Martone
U of Pitt
Chuck Boromeo
Jeremy Espino
Harry Hochheiser
Acknowledgments
Sanger
Anika Oehlrich
Jules Jacobson
Damian Smedley
Toronto
Marta Girdea
Sergiu Dumitriu
Mike Brudno
JAX
Cynthia Smith
Charité
Sebastian Kohler
Sandra Doelken
Sebastian Bauer
Peter Robinson
Funding:
NIH Office of Director: 1R24OD011883
NIH-UDP: HHSN268201300036C
Notes de l'éditeur
Note: these searches don’t seem to work in OMIM anymore, they may have gotten rid of the ability to search for quoted strings.
Different terminology is used to describe clinical manifestations than is used to describe model system biological features.
Distribution of human annotations from GWAS catalog and OMIM Morbidmap are largely disjoint and touch only 38% of protein-coding genes. Combining together human and ortholog data, nearly 80% of humanprotein-coding genes have phenotype annotations in at least one organism, with more than half only present in animal models.Note that human "phenotypes" are those things liked via GWAS catalog and OMIM. it means that some of the inferences might be artificially low because we aren't yet mapping CNVs to their constituent genes. Note that this also does not include the ClinVar data stats that we recently ingested, and only the model organisms: mouse, zebrafish, fly, yeast, rat. We have a lot more phenotype data now coming from other databases and organisms. These statistics will be available soon.
Also point out the functional classification axis
Things like finding models of sirenomelia due to disruption of the lateral plate mesoderm . Helping to find models and gene candidates based on the relationships in the development
Without additional knowledge and linking, computers can’t make the connections. These links take us from the molecular to the protein, to the cellular and anatomical, to the disease level of phenotypes
OWLsim computes semantic similarity between sets of phenotypes within and across species using the bridging semantics. Phenotypes in common from the bridging ontologies relate human clinical phenotypes with model organism phenotypes.Examples include motor systems, olfaction, and digestion. In this case, data encoded using the human phenotype ontology has been made interoperable with mouse, zebrafish and other model system ontologies. This also enables the use of more complex algorithms to detect similarity – not bases solely on mapping or string matching; e.g. constipation and decreased gut peristalsis are both subtypes of abnormal digestive system physiology.
The norm in exome analysis is to run either single exomes or to do trio analysis. These methods generally use some combination of quality filter, frequency filter, a form of predicted deleteriousness and often a candidate gene method. This is followed by a some basic Mendalian filtration, and then the remain variants are ranked by allele frequency, correlation to phenotype according to an annotation like HGMD, apparent pathogenicity.single exomes or to do trio analysisCANDIDATE GENE LIST
The procedure used at the Undiagnosed Disease Program puts more emphasis on the Mendelian inheritance models. Normally we use SNP chip data coupled with Mendelian filters for the exome data. A script in this case or the program Varsifter is used to filter out all variants that do not meet a homozygous rec, compound het, de novo dominant, or X-linked. Then after using the BAM files to check the quality of the variants, a final, very labor intensive step is done where these variants are currated and annotated by hand based on allele frequecy, predicted deleteriousness and PubMed articles. It is not uncommon that it ends up there is no way to distinguish between the last few variant for which is causing disease.emphasis on the Mendelian inheritance modelsSNPshomozygous rec, compound het, de novo dominant, or X-linkedFinally step done by hand labour intensive
PhenIXusese human data and predicted deleteriousness HGMD ClinVAR OMIM Orphanet
Exomiser Mouse Pheno and deleteriousness
The analysis that we have been experimenting with has been the use of the UDP standard operating procedure script being run on a families’ exome data, then the output of those filters was then put through phenotypic and variant analysis using either Mouse phenotype data via Exomiser or Human data via PhenIX.UDP standard operating procedure script Homo rec, comp het, denovo, X-linked Frequencyunneled into Exomizer or PhenIX for ranking
Run through pipeline: Exomiser LOT is a version of exomiser that is less restrictive as far as what transcripts it recognizes (not at worried about off target reads because of the Mendalian filters and the ability to look at the BAMS)ExomiserExomiser LOT Pheno only
The goal of the following computational analysis is to specifically understand the minimum human phenotype annotation that will enable useful identification of candidate genes and additional related phenotypes for UDP patients based on the current corpus available in Monarch (covering a large set of annotated human diseases from OMIM, Decipher, and Orphanet, as well as phenotype data from mouse, zebrafish and many other species).
Shown is a survey of the human annotations currently in the Monarch system. IC is information content, and higher numbers are a graph measure of specificity. sumIC is a combined indicator of depth of annotation. For the UDP set, each patient id is considered a distinct disease.
Each anatomical system is indicated in a color that is inversely listed in the legend to the graph (e.g. Skeletal System is at the top of the graph). Data are combined from Orphanet, OMIM, and Decipher. The graph shows that the systems with the largest proportion of annotations are the skeletal system and the nervous system. Note that the data is not disjoint - some annotations may fall into multiple categories according to the structure of the ontology.
First implementation. User guidelines written by Monarch will be implemented in PhenoTips in the next few days as a help menu.
Monarch is curating and assisting clinicians to create quality annotation profiles and the clinicians are helping to improve the ontology and therefore the corpus against which the similarity algorithms run.
Large scale data integration of genotypes, phenotypes and many other dataBased on NIF, contains large number of integrated databases (157 to date, more added every week).Building innovative visualization tools to explore model system phenotype data in context of other biomedical data. Widgets and services publicly availableWhy an initiative? Because it is a partnership to promote standardization and integration across model systems and clinical applications and all are welcome.
Using the phenotypes associated with the patient, one can query all model systems to find the ones that have the most related sets of phenotypes. Choosing the right model for a co-clinical trial, or for further analysis must involve an understanding of how well the model recapitulates the full spectrum (or not) of phenotypes. One would want to choose the model with the phenotype that one is most interested in understanding, assaying, or treating. This also have the benefit of providing collaborator suggestions, since the person who phenotyped each model is related to the model. They could be tasked to help perform the co-clinical trial or further phenotyping.These people are best at phenotyping the model, can inform human phenotyping, and conversely be trained to perform additional clinical assays in the modelsThis visualization is under active development and will be available in PhenoTips in the next few weeks.Also, one can drill down on the right side to see more specific annotations as to how, for example, the cerebrum is abnormal.
Lewy bodies, a hallmark of this disease, seems to only manifest phenotypes from a few of the genes, resulting in cerebral abnormalities and other CNS morphological changes in the mice. Lewy Bodies maps to these LCS (genes):Abnormality of the cerebrum (Snca, Slc6a3, Mfn2, Cox8a)Morphological abnormality of the central nervous system (Uchl1, Uchl3)
If we take a closer look at Bradykinesia, and the double-mutant mouse in Uchl1 and Uchl3 (Uchl1<gad>/Uchl1<gad>; Uchl3<tm1Tilg>/Uchl3<tm1Tilg>), Here, we examine the mouse phenotypes in our model that are related to Bradykinesia. There are three recorded phenotypes for this mouse that show some similarity.
Each model organism has a different suite of phenotypes that are examined, because different models are used to explore different types of biological function and malfunction. By using a diversity of model systems, we have the potential to identify candidates based on partial overlaps with the patient phenotype profile by looking at different models with mutations in potential candidates or related via interactions, co-expression, genomic regulatory region, etc.
Bare Lymphocyte Syndrome Type 1 Protein-Interaction NetworkThe protein-interaction network associated with bare lymphocyte syndrome type 1, which comprises the genes TAP1, TAP2, and TAPBP. Each of these genes is shown in red. The DI and SP methods additionally identified the unrelated genes PSMB8 and PSMB9 (shown in yellow) as potential disease genes because they each have an interaction with one of the true disease genes. The RWR method ranks the true disease genes higher because each true disease gene has interactions with two other family members and because there is a dense net of proteins that connect the disease genes via paths with two interactions.
Phenoviz is a new graphviz plugin that can be used as a standalone app for Windows, Mac, or Linux. The user uploads a list of CNVs detected by Array CGH (SNP Chips, or even genome sequence data would also work as a starting point, but the program expects a simple list). You also enter a list of the HPO terms observed in the patient. The application then tries to find “matches” based on the single gene disorders (human – HPO annotations) or the mouse models (mainly knockouts, MP annotations from MGI) or fish models (ZFIN E/Q annotations). This is being in the Charite Array CGH diagnostics service to help with interpretation of CNVs. Subjectively, the tool helps you to quickly find good candidates in order to write reports. The program also picks out the best matching CNV in case the user enters several (a typical array CGH finding in our lab has up to 50 CNVs, of which 2-5 are not found in databases of common variants like DGV).
There are a lot of people who have contributed to this work over many years.