Using Supercomputers and Data Analytics to Discover the Differences in Health and Disease
1. “Using Supercomputers and Data Analytics to Discover
the Differences in Health and Disease”
Briefing for
Dell Analytics Team
Calit2’s Qualcomm Institute
University of California, San Diego
April 7, 2016
Dr. Larry Smarr
Director, California Institute for Telecommunications and Information Technology
Harry E. Gruber Professor,
Dept. of Computer Science and Engineering
Jacobs School of Engineering, UCSD
http://lsmarr.calit2.net
1
2. We Gathered Raw Illumina Reads on 275 Humans
and Generated a Time Series of My Gut Microbiome
5 Ileal Crohn’s Patients,
3 Points in Time
2 Ulcerative Colitis Patients,
6 Points in Time
“Healthy” Individuals
Source: Jerry Sheehan, Calit2
Weizhong Li, Sitao Wu, CRBS, UCSD
Total of 27 Billion Reads
Or 2.7 Trillion Bases
Inflammatory Bowel Disease (IBD) Patients
250 Subjects
1 Point in Time
7 Points in Time
Each Sample Has 100-200 Million Illumina Short Reads (100 bases)
Larry Smarr
(Colonic Crohn’s)
3. To Map Out the Dynamics of Autoimmune Microbiome Ecology
Couples Next Generation Genome Sequencers to Big Data Supercomputers
Source: Weizhong Li, UCSD
Our Team Used 25 CPU-years
to Compute
Comparative Gut Microbiomes
Starting From
2.7 Trillion DNA Bases
of My Samples
and Healthy and IBD Controls
Illumina HiSeq 2000 at JCVI
SDSC Gordon Data Supercomputer
4. To Expand IBD Project the Knight/Smarr Labs Were Awarded
~ 1 CPU-Century Supercomputing Time
• Smarr Gut Microbiome Time Series
– From 7 Samples Over 1.5 Years
– To 50 Samples Over 4 Years
• IBD Patients: From 5 Crohn’s Disease and 2 Ulcerative Colitis
Patients to ~100 Patients
– 50 Carefully Phenotyped Patients Drawn from Sandborn BioBank
– 43 Metagenomes from the RISK Cohort of Newly Diagnosed IBD patients
• New Software Suite from Knight Lab
– Re-annotation of Reference Genomes, Functional / Taxonomic Variations
– Novel Compute-Intensive Assembly Algorithms from Pavel Pevzner
8x Compute Resources
Over Prior Study
5. Next Step
Programmability, Scalability, and Reproducibility using bioKepler
www.kepler-project.org
www.biokepler.org
National
Resources
(Gordon) (Comet)
(Stampede)(Lonestar)
Cloud
Resources
Optimized
Local Cluster
Resources
Source:
Ilkay
Altintas,
SDSC
6. Using HPC and Data Analytics
to Discover Microbial Diagnostics for Disease Dynamics
• Can Data Distinguish Between Health and Disease Subtypes?
• Can Data Track the Time Development of the Disease State?
• Can Data Create Novel Microbial Diagnostics for Identifying Health and Disease States?
• Can Data Discover Functional Microbiome Gene Changes Between Health and Disease?
8. Dell Analytics Separates The 4 Patient Types in Our Data
Using Our Microbiome Species Data
Source: Thomas Hill, Ph.D.
Executive Director Analytics
Dell | Information Management Group, Dell Software
Healthy
Ulcerative Colitis
Colonic Crohn’s
Ileal Crohn’s
10. I Built on Dell Analytics to Show Dynamic Evolution of My Microbiome
Toward and Away from Healthy State – Colonic Crohn’s
Healthy
Ileal Crohn’s
Seven Time Samples Over 1.5 Years
Colonic Crohn’s
Source: Thomas Hill, Ph.D.
Executive Director Analytics
Dell | Information Management Group, Dell Software
11. Variation in My Gut Microbiome by 16S Families –
40 Samples Over 3.5 Years
Data from Justine Debelius & Jose Navas, Knight Lab, UCSD; Larry Smarr Analysis, January 2016
12. Larry Smarr Gut Microbiome Ecology Shifted After Drug Therapy
Between Two Time-Stable Equilibriums Correlated to Physical Symptoms
Lialda
&
Uceris
12/1/13 to 1/1/14
12/1/13-
1/1/14
Frequent IBD Symptoms
Weight Loss
5/1/12 to 12/1/14
Blue Balls on Diagram
to the Right
Few IBD Symptoms
Weight Gain
1/1/14 to 1/1/16
Red Balls on Diagram
to the Right
Principal Coordinate Analysis of
Microbiome Ecology
PCoA by Justine Debelius and Jose Navas,
Knight Lab, UCSD
Weight Data from Larry Smarr, Calit2, UCSD
Antibiotics
Prednisone
1/1/12 to 5/1/12
5/1/12
Weekly Weight (Red Dots Stool Sample)
Few IBD Symptoms
Weight Gain
1/1/14 to 1/1/16
Red Balls on Diagram
to the Right
13. Can Data Create Novel Microbial Diagnostics
for Identifying Health and Disease States?
14. Dell Analytics Tree Graphs Classifies
the 4 Health/Disease States With Just 3 Microbe Species
Source: Thomas Hill, Ph.D.
Executive Director Analytics
Dell | Information Management Group, Dell Software
15. Our Relative Abundance Results Across ~300 People
Show Why Dell Analytics Tree Classifier Works
UC 100x Healthy
LS 100x UC
We Produced Similar Results for ~2500 Microbial Species
Healthy 100x CD
16. Ayasdi Enables Discovery of Differences Between
Healthy and Disease States Using Microbiome Species
Healthy LS
Ileal Crohn’s Ulcerative Colitis
Using Multidimensional
Scaling Lens with
Correlation Metric
High in Healthy and LS
High in Healthy and
Ulcerative Colitis
High in Both LS and
Ileal Crohn’s Disease
Analysis by Mehrdad Yazdani, Calit2
17. Can Data Discover Functional Microbiome Gene Changes
Between Health and Disease?
18. We Computed the Relative Abundance of Microbial Gene Families -
~10,000 KEGG Orthologous Genes, Across Healthy and IBD Subjects
How Large is the Microbiome’s Genetic Change
Between Health and Disease States?
19. In a “Healthy” Gut Microbiome:
Large Taxonomy Variation, Low Protein Family Variation
Source: Nature, 486, 207-212 (2012)
Over 200 People
20. Ratio of HE11529 to Ave HE
Test to see How Much Variation There is Within Healthy
Most KEGGs Are Within 10x
Of Healthy for a Random HE
Ratio of Random HE11529 to Healthy Average for Each Nonzero KEGG
Similar to HMP Healthy Results
21. Our Research Shows Large Changes
in Protein Families Between Health and Disease – Ileal Crohns
KEGGs Greatly Increased
In the Disease State
KEGGs Greatly Decreased
In the Disease State
Over 7000 KEGGs Which Are Nonzero
in Health and Disease States
Ratio of CD Average to Healthy Average for Each Nonzero KEGG
Note Hi/Low
Symmetry
Similar Results for UC and LS
22. We Found a Set of Ayasdi Lenses That Separate Out
the 43 Extreme KEGGs Common to the Disease States
K00108(choline_dehydrogenase)
K00673(arginine_N-succinyltransferase)
K00867(type_I_pantothenate_kinase)
K01169(ribonuclease_I_(enterobacter_ribonuclease))
K01484(succinylarginine_dihydrolase)
K01682(aconitate_hydratase_2)
K01690(phosphogluconate_dehydratase)
K01825(3-hydroxyacyl-CoA_dehydrogenase_/_enoyl-CoA_hydratase_/3-hydroxybutyryl-CoA_epimerase_/_e
K02173(hypothetical_protein)
K02317(DNA_replication_protein_DnaT)
K02466(glucitol_operon_activator_protein)
K02846(N-methyl-L-tryptophan_oxidase)
K03081(3-dehydro-L-gulonate-6-phosphate_decarboxylase)
K03119(taurine_dioxygenase)
K03181(chorismate--pyruvate_lyase)
K03807(AmpE_protein)
K05522(endonuclease_VIII)
K05775(maltose_operon_periplasmic_protein)
K05812(conserved_hypothetical_protein)
K05997(Fe-S_cluster_assembly_protein_SufA)
K06073(vitamin_B12_transport_system_permease_protein)
K06205(MioC_protein)
K06445(acyl-CoA_dehydrogenase)
K06447(succinylglutamic_semialdehyde_dehydrogenase)
K07229(TrkA_domain_protein)
K07232(cation_transport_protein_ChaC)
K07312(putative_dimethyl_sulfoxide_reductase_subunit_YnfH_(DMSO_reductaseanchor_subunit))
K07336(PKHD-type_hydroxylase)
K08989(putative_membrane_protein)
K09018(putative_monooxygenase_RutA)
K09456(putative_acyl-CoA_dehydrogenase)
K09998(arginine_transport_system_permease_protein)
K10748(DNA_replication_terminus_site-binding_protein)
K11209(GST-like_protein)
K11391(ribosomal_RNA_large_subunit_methyltransferase_G)
K11734(aromatic_amino_acid_transport_protein_AroP)
K11735(GABA_permease)
K11925(SgrR_family_transcriptional_regulator)
K12288(pilus_assembly_protein_HofM)
K13255(ferric_iron_reductase_protein_FhuF)
K14588()
K15733()
K15834()
L-Infinity Centrality Lens
Using Norm Correlation
as Metric
(Resolution: 242, Gain: 5.7)
Entropy & Variance Lens
Using Angle as Metric
(Resolution: 30, Gain 3.00)
Analysis by Mehrdad Yazdani, Calit2
23. Disease Arises from Perturbed Protein Family Networks:
Dynamics of a Prion Perturbed Network in Mice
Source: Lee Hood, ISB 23
Our Next Goal is to Create
Such Perturbed Networks in Humans
24. Calit2’s Qualcomm Institute Has Developed
Interactive Scalable Visualization for Biological Networks
20,000 Samples
60,000 OTUs
18 Million Edges
Runs Native on 64Million Pixels
26. Thanks to Our Great Team!
Calit2@UCSD
Future Patient Team
Jerry Sheehan
Tom DeFanti
Joe Keefe
John Graham
Kevin Patrick
Mehrdad Yazdani
Jurgen Schulze
Andrew Prudhomme
Philip Weber
Fred Raab
Ernesto Ramirez
JCVI Team
Karen Nelson
Shibu Yooseph
Manolito Torralba
Ayasdi
Devi Ramanan
Pek Lum
UCSD Metagenomics Team
Weizhong Li
Sitao Wu
SDSC Team
Michael Norman
Mahidhar Tatineni
Robert Sinkovits
Ilkay Altintas
UCSD Health Sciences Team
David Brenner
Rob Knight Lab
Justine Debelius
Jose Navas
Bryn Taylor
Gail Ackermann
Greg Humphrey
William J. Sandborn Lab
Elisabeth Evans
John Chang
Brigid Boland
Dell/R Systems
Brian Kucic
John Thompson
Thomas Hill