2. Progress in biomedical discovery has
been enabled by technological progress
•
New sequencing technology – 100s of genomes a day are now produced
•
Advances in software
•
Standard DNA analysis codes have emerged
•
New versions continuously released
•
Custom software developed for unconventional analysis
•
Development of analysis pipelines for automated analysis and compilation of results
•
Advances in computational hardware
•
•
•
Codes standardized on Intel processor based systems ease porting to new systems
Continuous advances in Intel product line enable us to easily “keep up”
The bottom line – With process advances and new Intel MIC processors we have seen
speedups from 1 genome/2 weeks to 50 genomes/day. It is straightforward to expand
hardware in response to computational demand
3. Computing is primarily done on a
machine we developed: SHADOWFAX
A heterogeneous computing
environment for data intensive
computations
~2,524 CPUs, > 12TB RAM
(spectrum of Intel)
8 Intel® Xeon® E5-2600/FPGA
hybrid core systems (in partnership
with Convey)
~0.8 PB Disk Arrays (DDN)
100 PB Sun/Oracle tape storage
system
4. Computing is primarily done on a
machine we developed: SHADOWFAX
With local synchronized copies of major
databases:
Medline, arXiv, PubMed Central,
Genbank, SwissProt,
1,000 Genomes Project,
The Cancer Genome Atlas,
Wikipedia
To meet the needs of applications that
demand HPC:
deep sequencing assembly and
analysis, molecular modeling,
simulations, proteomics analysis,
text mining, Health IT
5. NextGen DNA sequence analysis is now
the rate limiting step
•
The cost of sequencing has dropped from $3B/genome to ~$1K/genome.
•
•
•
New genomes are sequenced daily.
It is estimated that there are 30,000 human genomes complete, with 15,000
of these in the public domain.
Analysis has focused on on Single Nucleotide Polymorphisms (“ SNPs”), which
are single letter changes in the DNA code.
•
For complex diseases like cancer, heart disease and mental disorders,
extensive work has still only explains 10-20% of the known genetic
component.
•
Recent research indicates that do to experimental measurement noise,
perhaps most of the measured variations are false positives.
6. Microsatellites, or repetitive DNA
sequences are particularly challenging
•
Microsatellites, also called Simple Sequence Repeats or Short Tandem
Repeats, are an understudied portion of genome; because they are considered
part of our “Junk DNA” or more recently “Dark Matter” DNA; research focus
has been on Single Nucleotide Polymorphisms (“ SNPs”)
•
Microsatellites have known value: long used for paternity and forensic testing
and linked to neurological diseases (e.g. Huntington’s and Fragile-X)
•
None of major genomic research projects have focused on Microsatellites: not
Human Genome Project, 1000 Genome Project, The Cancer Genome Atlas,
ENCODE or the iCOGS study.
7. Genomeon’s Research Methodology
Download and rebuild thousands
of “healthy and “affected”
genomes
Create genotype distributions for
“healthy” and “affected”
populations
Compute Fishers Exact Test pvalue for each of ~1 million loci
and rank results
Identify “Patterns of Informative
Microsatellites” (PIM) from loci
that pass Bonferroni and
Benjamini–Hochberg False
Discovery Rate tests
Manually review, do QC, compute
sensitivity and specificity
Annotate with ontologies,
literature, input from experts
Validate PIM with
sequencing of wellcharacterized samples
Business analysis;
product definition; IP
Publish; translate, regulatory approval,
reimbursement; team with established
clinical services co.
8. Genomeon has created a unique library of
over 7700 genomes from 1000 Genomes
Project and The Cancer Genome Atlas with
corrected microsatellites
• “Healthy Population” representing many ethnicities
• Ovarian cancer
• Breast cancer
• Brain cancer: Glioma; Glioblastoma; Medulloblastoma
• Lung adenocarcinoma
• Prostate cancer
• Melanoma
• Autism
10. Pattern of 55 informative microsatellites
differentiates Breast Cancer germlines from
healthy germlines
Sensitivity = 84%
Specificity = 87%
BRCA
positive
samples
11. Applications of these microsatellite loci
variations – Microsatellite profiling for increased risk of cancer, and the
Cancer Risk Diagnostics
tissues at highest risk
Companion/Treatment Diagnostics - Many informative microsatellites are functional
elements implicated in therapeutic response
Clinical Trial Support - Use of microsatellite profile to differentiate sub-populations in
clinical trials
Drug Targets - Identification of large number of genes previously unassociated with cancer many with functions associated with cancer processes
Toxicology - Quantification of stress induced exposures via microsatellite mutation screen
Prognosis - Comparison of microsatellite variations between germlines and tumors
Non-cancer Diseases - PTSD, Autism, MS, cardiac diseases, aging