SlideShare une entreprise Scribd logo
1  sur  58
Télécharger pour lire hors ligne
RIDING THE DATA 
TIDAL WAVE IN 
MICROBIOLOGY 
Adina Howe 
germslab.org (Genomics and Environmental Research in Microbial Systems) 
Argonne National Laboratory / Michigan State University 
Iowa State University, Ag & Biosystems Engr (January) 
Slides available at www.slideshare.com/adinachuanghowe 
Future of Big Data, Lincoln, NE 11/6/2014
Microbes are critical 
Climate Change 
USGCRP 2009 
Energy Supply 
www.alutiiq.com 
Human & Animal 
Health 
http://guardianlv.com/ 
Global Food 
Security 
An understanding 
of microbial ecology
Understanding community 
dynamics 
 Who is there? 
 What are they doing? 
 How are they doing it?
Understanding community 
dynamics 
 Who is there? 
 What are they doing? 
 How are they doing it? 
Kim Lewis, 2010
Gene / Genome Sequencing 
 Collect samples 
 Extract DNA 
 Sequence DNA 
 “Analyze” DNA to identify its content and origin 
Taxonomy 
(e.g., pathogenic E. Coli) 
Function 
(e.g., degrades cellulose)
Cost of Sequencing 
100,000,000 
1,000,000 
100,000 
10,000 
1,000 
100 
10 
1 
Stein, Genome Biology, 2010 
E. Coli genome 4,500,000 bp ($4.5M, 1992) 
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 
Year 
0.1 
DNA Sequencing, Mbp per $ 
10,000,000
Rapidly decreasing costs with 
NGS Sequencing 
100,000,000 
10,000,000 
1,000,000 
100,000 
10,000 
1,000 
100 
10 
1 
Stein, Genome Biology, 2010 
Next Generation Sequencing 
4,500,000 bp (E. Coli, $200, presently) 
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 
Year 
0.1 
DNA Sequencing, Mbp per $
Effects of low cost 
sequencing… 
First free-living bacterium sequenced 
for billions of dollars and years of 
analysis 
Personal genome can be 
mapped in a few days and 
hundreds to few thousand 
dollars
The experimental continuum 
Single Isolate 
Pure Culture 
Enrichment 
Mixed Cultures 
Natural systems
The era of big data in biology 
NGS (Shotgun) Sequencing 
(doubling time 5 months) 
100,000,000 
100,000,000 
10,000,000 
1,000,000 
1,000,000 
100,000 
100,000 
10,000 
10,000 
1,000 
1,000 
100 
100 
10 
10 
1 
1 
Stein, Genome Biology, 2010 
Computational Hardware 
(doubling time 14 months) 
Sanger Sequencing 
(doubling time 19 months) 
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 
Year 
1,000,000 
100,000 
10,000 
1,000 
100 
10 
1 
0 
Disk Storage, Mb/$ 
0.1 
DNA Sequencing, Mbp per $ 
10,000,000 
0.1
Postdoc experience with data 
2003-2008 Cumulative sequencing in PhD = 2000 bp 
2008-2009 Postdoc Year 1 = 50 Gbp 
2009-2010 Postdoc Year 2 = 450 Gbp 
2014 = 50 Tbp 
2015 = 500 Tbp budgeted
THE DIRT ON SOIL 
MAGNIFICENT BIODIVERSITY 
Biodiversity in the dark, Wall et al., Nature Geoscience, 2010 Jeremy Burgress
THE DIRT ON SOIL 
SPATIAL HETEROGENEITY 
http://www.fao.org/ www.cnr.uidaho.edu
THE DIRT ON SOIL 
DYNAMIC
THE DIRT ON SOIL 
INTERACTIONS: BIOTIC, ABIOTIC, ABOVE, BELOW, SCALES 
Philippot, 2013, Nature Reviews Microbiology
I. Technical side of microbial big data in 
biology 
II. Future of big data in soil microbial 
communities 
III. Bottlenecks for microbiologists
Tackling Soil Biodiversity 
Source: Chuck Haney 
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU) 
Janet Jansson, Susannah Tringe (JGI)
Lesson #1: Accessing information in 
data 
http://siliconangle.com/files/2010/09/image_thumb69.png
de novo assembly 
Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes 
Compresses dataset size significantly 
Improved data quality (longer sequences, gene order) 
Reference not necessary (novelty)
Metagenome assembly…a scaling 
problem.
Shotgun sequencing and de novo 
assembly 
It was the Gest of times, it was the wor 
, it was the worst of timZs, it was the 
isdom, it was the age of foolisXness 
, it was the worVt of times, it was the 
mes, it was Ahe age of wisdom, it was th 
It was the best of times, it Gas the wor 
mes, it was the age of witdom, it was th 
isdom, it was tIe age of foolishness 
It was the best of times, it was the worst of times, it was the 
age of wisdom, it was the age of foolishness
Practical Challenges – Intensive 
computing 
Months of 
“computer 
crunching” on a 
super computer 
Howe et al, 2014, PNAS
Practical Challenges – Intensive 
computing 
Months of 
“computer 
crunching” on a 
super computer 
Howe et al, 2014, PNAS 
Assembly of 300 Gbp (70,000 
genomes worth) can be done with 
any assembly program in less 
than 14 GB RAM and less than 
24 hours. 
50 Gbp = 10,000 genomes
Natural community characteristics 
 Diverse 
 Many organisms 
(genomes)
Natural community characteristics 
 Diverse 
 Many organisms 
(genomes) 
 Variable abundance 
 Most abundant organisms, sampled 
more often 
 Assembly requires a minimum amount 
of sampling 
 More sequencing, more errors 
Sample 1x
Natural community characteristics 
 Diverse 
 Many organisms 
(genomes) 
 Variable abundance 
 Most abundant organisms, sampled 
more often 
 Assembly requires a minimum amount 
of sampling 
 More sequencing, more errors 
Sample 1x Sample 10x
Natural community characteristics 
 Diverse 
 Many organisms 
(genomes) 
 Variable abundance 
 Most abundant organisms, sampled 
more often 
 Assembly requires a minimum amount 
of sampling 
 More sequencing, more errors 
Overkill 
Sample 1x Sample 10x
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One
Digital normalization 
Brown et al., 2012, arXiv 
Howe et al., 2014, PNAS 
Zhang et al., 2014, PLOS One 
 Scales datasets for assembly up to 95% - same assembly 
outputs. 
 Genomes, mRNA-seq, metagenomes (soils, gut, water)
Tackling Soil Biodiversity 
Source: Chuck Haney 
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU) 
Janet Jansson, Susannah Tringe (JGI)
The reality?
More like… 
Howe et. al, 2014, PNAS Source: Chuck Haney
What we learned from deeply sequencing 
soil 
 Grand Challenge effort – 
10% of soil biodiversity 
sampled 
 Incredible soil biodiversity 
(estimate required 10 
Tbp/sample) 
 “To boldly go where no man 
has gone before”: >60% 
Unknown 
400 
300 
200 
100 
0 
amino acid metabolism 
carbohydrate metabolism 
membrane transport 
signal transduction 
translation 
folding, sorting and degradation 
metabolism of cofactors and vitamins 
energy metabolism 
transport and catabolism 
lipid metabolism 
transcription 
cell growth and death 
replication and repair 
xenobiotics biodegradation and metabolism 
nucleotide metabolism 
glycan biosynthesis and metabolism 
metabolism of terpenoids and polyketides 
cell motility 
Total Count 
KO 
corn and prairie 
corn only 
prairie only 
Howe et al, 2014, PNAS 
Managed agriculture soils exhibit less 
diversity, likely from its history of 
cultivation.
If soil is so diverse, what are the most 
consistent signals we can see at the plot 
levels?
Is there an identifiable “soil 
functional core” (carbon cycling 
focus)? 
Ames, Iowa, COBS Field Site, Fertilized Prairie Whole Soil Samples 
4 deeply sampled whole soil metagenomes (16S rRNA 5000 reads, Shotgun 20-50 million Kirsten Hofmockel, Iowa State University
How much genetic sequence do 
you think is shared in 4 replicates? 
 More than 1%? 10%? 50%? 
 What kind of genes do you expect in this core? 
 Minimal critical genes will be abundant & diverse
How much genetic sequence do 
you think is shared in 4 replicates? 
 More than 1%? 10%? 50%? 
 What kind of genes do you expect in this core? 
 Minimal critical genes are varying abundances & 
diverse
Core genes: soil-specific
My vision (and future research) 
Microbial markers for ecosystem services 
Nutrient cycling 
Pathogens 
Antibiotic resistance 
Biodiversity
Capabilities exist already… 
Indoor Microbiome Project Animation 
http://vimeo.com/90059732
Is more data better? 
Bottlenecks for the emerging 
microbiologists
Technical obstacles in the big data 
deluge 
 Access to the data 
 Access to the resources 
Democratization of both data and resource access 
“80% of awards and 50% of $$ are for grants < 
$350,000” (Ian Foster) 
 Data volume and velocity 
 Previous efforts are difficult to integrate 
 Innovation is necessary
Data intensive microbiology 
Software Developers 
Computer Scientists 
Clinicians 
PIs 
Data generators 
Microbiologists 
Data Analyzers 
Statisticians 
Bioinformaticians 
http://ivory.idyll.org/blog/2014-the-emerging-field-of-data-intensive-biology.html
Big data nebraska
Big data nebraska
Social obstacles – the main 
challenge 
Shift of costs do not mean shift of 
expectations 
http://www.deluxebattery.com/25-hilarious-expectation-vs-reality-photos/ 
Dear PI, 
It will take longer than 
the time it took you to do 
your experiment to 
analyze the data. Please 
do not write me for 
results within 24 hours of 
your sequences 
becoming available. 
- Adina
Culture of sharing 
Metagenomic Datasets 
http://www.heathershumaker.com/
Training / Incentives 
Emails between collaborators don’t contain as 
much “science” as I’d like:
All analysis: accessible, 
reproducible, and automated
All analysis: accessible, 
reproducible, and automated 
To reproduce analysis in a publication, 
1. Rent Amazon EC2 computer 
2. Clone github repository containing data and scripts 
3. Open IPython notebook and execute 
To run same analysis on different dataset, 
1. Replace data files with your own data, execute notebook. 
2. Tweak scripts as needed.
The journey in summary
RIDING THE BIG DATA 
TIDAL WAVE OF MODERN 
MICROBIOLOGY 
Adina Howe 
Argonne National Laboratory / Michigan State University 
Iowa State University, Ag & Biosystems Engr (January) 
“ 
”
RIDING THE BIG DATA 
TIDAL WAVE OF MODERN 
MICROBIOLOGY 
Adina Howe 
Argonne National Laboratory / Michigan State University 
Iowa State University, Ag & Biosystems Engr (January)
Acknowledgements 
 C. Titus Brown (MSU) 
 James Tiedje (MSU) 
 Daina Ringus (UC) 
 Folker Meyer (ANL) 
 Eugene Chang (UC) 
 NSF Biology Postdoc Fellowship 
 DOE Great Lakes Bioenergy Research Center

Contenu connexe

Tendances

Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Larry Smarr
 
Analyzing the Human Gut Microbiome Dynamics in Health and Disease Using Super...
Analyzing the Human Gut Microbiome Dynamics in Health and Disease Using Super...Analyzing the Human Gut Microbiome Dynamics in Health and Disease Using Super...
Analyzing the Human Gut Microbiome Dynamics in Health and Disease Using Super...Larry Smarr
 
Bayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation MetagenomicsBayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation MetagenomicsJonathan Eisen
 
Using Supercomputers and Gene Sequencers to Discover Your Inner Microbiome
Using Supercomputers and Gene Sequencers to Discover Your Inner MicrobiomeUsing Supercomputers and Gene Sequencers to Discover Your Inner Microbiome
Using Supercomputers and Gene Sequencers to Discover Your Inner MicrobiomeLarry Smarr
 
Using Supercomputers and Supernetworks to Explore the Ocean of Life
Using Supercomputers and Supernetworks to Explore the Ocean of LifeUsing Supercomputers and Supernetworks to Explore the Ocean of Life
Using Supercomputers and Supernetworks to Explore the Ocean of LifeLarry Smarr
 
Quantified Self On Being A Personal Genomic Observatory
Quantified Self On Being A Personal Genomic ObservatoryQuantified Self On Being A Personal Genomic Observatory
Quantified Self On Being A Personal Genomic ObservatoryLarry Smarr
 
ISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesAdina Chuang Howe
 
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic SciencesBuilding an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic SciencesLarry Smarr
 
Supercomputing Your Inner Microbiome
Supercomputing Your Inner MicrobiomeSupercomputing Your Inner Microbiome
Supercomputing Your Inner MicrobiomeNicole McLaughlin
 
Using Supercomputers and Data Science to Reveal Your Inner Microbiome
Using Supercomputers and Data Science to Reveal Your Inner MicrobiomeUsing Supercomputers and Data Science to Reveal Your Inner Microbiome
Using Supercomputers and Data Science to Reveal Your Inner MicrobiomeLarry Smarr
 
Quantifying Your Superorganism Body Using Big Data Supercomputing
Quantifying Your Superorganism Body Using Big Data SupercomputingQuantifying Your Superorganism Body Using Big Data Supercomputing
Quantifying Your Superorganism Body Using Big Data SupercomputingLarry Smarr
 
Using Supercomputers to Discover the 100 Trillion Bacteria Living Within Each...
Using Supercomputers to Discover the 100 Trillion Bacteria Living Within Each...Using Supercomputers to Discover the 100 Trillion Bacteria Living Within Each...
Using Supercomputers to Discover the 100 Trillion Bacteria Living Within Each...Larry Smarr
 
The Human Microbiome and the Revolution in Digital Health
The Human Microbiome and the Revolution in Digital HealthThe Human Microbiome and the Revolution in Digital Health
The Human Microbiome and the Revolution in Digital HealthLarry Smarr
 
Discovering the Other 90% of our Human Superorganism
Discovering the Other 90% of our Human SuperorganismDiscovering the Other 90% of our Human Superorganism
Discovering the Other 90% of our Human SuperorganismLarry Smarr
 
Scott Edmunds at DataCite 2012: Adventures in Data Citation
Scott Edmunds at DataCite 2012: Adventures in Data CitationScott Edmunds at DataCite 2012: Adventures in Data Citation
Scott Edmunds at DataCite 2012: Adventures in Data CitationGigaScience, BGI Hong Kong
 
Tim Brown ACEAS Phenocams
Tim Brown ACEAS PhenocamsTim Brown ACEAS Phenocams
Tim Brown ACEAS Phenocamsaceas13tern
 
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome SequencingMicrobial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome SequencingJonathan Eisen
 
How Studying Astrophysics and Coral Reefs Enabled Me to Become an Empowered,...
How Studying Astrophysics and Coral Reefs Enabled Me to Become an Empowered,...How Studying Astrophysics and Coral Reefs Enabled Me to Become an Empowered,...
How Studying Astrophysics and Coral Reefs Enabled Me to Become an Empowered,...Larry Smarr
 
Reframing Phylogenomics
Reframing PhylogenomicsReframing Phylogenomics
Reframing PhylogenomicsJoe Parker
 
Quantifying The Dynamics of Your Superorganism Body Using Big Data Supercompu...
Quantifying The Dynamics of Your Superorganism Body Using Big Data Supercompu...Quantifying The Dynamics of Your Superorganism Body Using Big Data Supercompu...
Quantifying The Dynamics of Your Superorganism Body Using Big Data Supercompu...Larry Smarr
 

Tendances (20)

Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
 
Analyzing the Human Gut Microbiome Dynamics in Health and Disease Using Super...
Analyzing the Human Gut Microbiome Dynamics in Health and Disease Using Super...Analyzing the Human Gut Microbiome Dynamics in Health and Disease Using Super...
Analyzing the Human Gut Microbiome Dynamics in Health and Disease Using Super...
 
Bayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation MetagenomicsBayesian Taxonomic Assignment for the Next-Generation Metagenomics
Bayesian Taxonomic Assignment for the Next-Generation Metagenomics
 
Using Supercomputers and Gene Sequencers to Discover Your Inner Microbiome
Using Supercomputers and Gene Sequencers to Discover Your Inner MicrobiomeUsing Supercomputers and Gene Sequencers to Discover Your Inner Microbiome
Using Supercomputers and Gene Sequencers to Discover Your Inner Microbiome
 
Using Supercomputers and Supernetworks to Explore the Ocean of Life
Using Supercomputers and Supernetworks to Explore the Ocean of LifeUsing Supercomputers and Supernetworks to Explore the Ocean of Life
Using Supercomputers and Supernetworks to Explore the Ocean of Life
 
Quantified Self On Being A Personal Genomic Observatory
Quantified Self On Being A Personal Genomic ObservatoryQuantified Self On Being A Personal Genomic Observatory
Quantified Self On Being A Personal Genomic Observatory
 
ISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar Slides
 
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic SciencesBuilding an Information Infrastructure to Support Microbial Metagenomic Sciences
Building an Information Infrastructure to Support Microbial Metagenomic Sciences
 
Supercomputing Your Inner Microbiome
Supercomputing Your Inner MicrobiomeSupercomputing Your Inner Microbiome
Supercomputing Your Inner Microbiome
 
Using Supercomputers and Data Science to Reveal Your Inner Microbiome
Using Supercomputers and Data Science to Reveal Your Inner MicrobiomeUsing Supercomputers and Data Science to Reveal Your Inner Microbiome
Using Supercomputers and Data Science to Reveal Your Inner Microbiome
 
Quantifying Your Superorganism Body Using Big Data Supercomputing
Quantifying Your Superorganism Body Using Big Data SupercomputingQuantifying Your Superorganism Body Using Big Data Supercomputing
Quantifying Your Superorganism Body Using Big Data Supercomputing
 
Using Supercomputers to Discover the 100 Trillion Bacteria Living Within Each...
Using Supercomputers to Discover the 100 Trillion Bacteria Living Within Each...Using Supercomputers to Discover the 100 Trillion Bacteria Living Within Each...
Using Supercomputers to Discover the 100 Trillion Bacteria Living Within Each...
 
The Human Microbiome and the Revolution in Digital Health
The Human Microbiome and the Revolution in Digital HealthThe Human Microbiome and the Revolution in Digital Health
The Human Microbiome and the Revolution in Digital Health
 
Discovering the Other 90% of our Human Superorganism
Discovering the Other 90% of our Human SuperorganismDiscovering the Other 90% of our Human Superorganism
Discovering the Other 90% of our Human Superorganism
 
Scott Edmunds at DataCite 2012: Adventures in Data Citation
Scott Edmunds at DataCite 2012: Adventures in Data CitationScott Edmunds at DataCite 2012: Adventures in Data Citation
Scott Edmunds at DataCite 2012: Adventures in Data Citation
 
Tim Brown ACEAS Phenocams
Tim Brown ACEAS PhenocamsTim Brown ACEAS Phenocams
Tim Brown ACEAS Phenocams
 
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome SequencingMicrobial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
Microbial Phylogenomics (EVE161) Class 10-11: Genome Sequencing
 
How Studying Astrophysics and Coral Reefs Enabled Me to Become an Empowered,...
How Studying Astrophysics and Coral Reefs Enabled Me to Become an Empowered,...How Studying Astrophysics and Coral Reefs Enabled Me to Become an Empowered,...
How Studying Astrophysics and Coral Reefs Enabled Me to Become an Empowered,...
 
Reframing Phylogenomics
Reframing PhylogenomicsReframing Phylogenomics
Reframing Phylogenomics
 
Quantifying The Dynamics of Your Superorganism Body Using Big Data Supercompu...
Quantifying The Dynamics of Your Superorganism Body Using Big Data Supercompu...Quantifying The Dynamics of Your Superorganism Body Using Big Data Supercompu...
Quantifying The Dynamics of Your Superorganism Body Using Big Data Supercompu...
 

Similaire à Big data nebraska

Job Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringJob Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringAdina Chuang Howe
 
ISB nov 2014
ISB nov 2014ISB nov 2014
ISB nov 2014mcdonadt
 
The Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics ResearchersThe Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics ResearchersLarry Smarr
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
2015 Soil Science of America Meeting
2015 Soil Science of America Meeting2015 Soil Science of America Meeting
2015 Soil Science of America MeetingAdina Chuang Howe
 
Diversity Diversity Diversity Diversity ....
Diversity Diversity Diversity Diversity ....Diversity Diversity Diversity Diversity ....
Diversity Diversity Diversity Diversity ....Jonathan Eisen
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
 
Bioinformatics issues and challanges presentation at s p college
Bioinformatics  issues and challanges  presentation at s p collegeBioinformatics  issues and challanges  presentation at s p college
Bioinformatics issues and challanges presentation at s p collegeSKUASTKashmir
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global communityExternalEvents
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK Cyndy Parr
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingGigaScience, BGI Hong Kong
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 
"The Quest for A field Guide to the Microbes" talk by Jonathan Eisen February...
"The Quest for A field Guide to the Microbes" talk by Jonathan Eisen February..."The Quest for A field Guide to the Microbes" talk by Jonathan Eisen February...
"The Quest for A field Guide to the Microbes" talk by Jonathan Eisen February...Jonathan Eisen
 

Similaire à Big data nebraska (20)

Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Job Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringJob Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio Engineering
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
ISB nov 2014
ISB nov 2014ISB nov 2014
ISB nov 2014
 
The Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics ResearchersThe Emerging Global Community of Microbial Metagenomics Researchers
The Emerging Global Community of Microbial Metagenomics Researchers
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
2015 Soil Science of America Meeting
2015 Soil Science of America Meeting2015 Soil Science of America Meeting
2015 Soil Science of America Meeting
 
Diversity Diversity Diversity Diversity ....
Diversity Diversity Diversity Diversity ....Diversity Diversity Diversity Diversity ....
Diversity Diversity Diversity Diversity ....
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
Bioinformatics issues and challanges presentation at s p college
Bioinformatics  issues and challanges  presentation at s p collegeBioinformatics  issues and challanges  presentation at s p college
Bioinformatics issues and challanges presentation at s p college
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
 
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data HandlingScott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
"The Quest for A field Guide to the Microbes" talk by Jonathan Eisen February...
"The Quest for A field Guide to the Microbes" talk by Jonathan Eisen February..."The Quest for A field Guide to the Microbes" talk by Jonathan Eisen February...
"The Quest for A field Guide to the Microbes" talk by Jonathan Eisen February...
 

Plus de Adina Chuang Howe

Merrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, NebraskaMerrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, NebraskaAdina Chuang Howe
 
Adina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina Chuang Howe
 
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do thisANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do thisAdina Chuang Howe
 
Metagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON WorkshopMetagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON WorkshopAdina Chuang Howe
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesAdina Chuang Howe
 
EPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkEPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkAdina Chuang Howe
 

Plus de Adina Chuang Howe (6)

Merrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, NebraskaMerrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, Nebraska
 
Adina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABE
 
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do thisANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
 
Metagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON WorkshopMetagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON Workshop
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 
EPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkEPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data Talk
 

Dernier

biosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibioticsbiosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibioticsSafaFallah
 
Main Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearMain Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearmarwaahmad357
 
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptx
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptxSCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptx
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptxROVELYNEDELUNA3
 
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...marwaahmad357
 
Pests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPRPests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPRPirithiRaju
 
Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...Sérgio Sacani
 
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky WayShiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky WaySérgio Sacani
 
CW marking grid Analytical BS - M Ahmad.docx
CW  marking grid Analytical BS - M Ahmad.docxCW  marking grid Analytical BS - M Ahmad.docx
CW marking grid Analytical BS - M Ahmad.docxmarwaahmad357
 
Role of herbs in hair care Amla and heena.pptx
Role of herbs in hair care  Amla and  heena.pptxRole of herbs in hair care  Amla and  heena.pptx
Role of herbs in hair care Amla and heena.pptxVaishnaviAware
 
IB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxIB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxUalikhanKalkhojayev1
 
Physics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and EngineersPhysics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and EngineersAndreaLucarelli
 
Substances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening TestSubstances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening TestAkashDTejwani
 
TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)chatterjeesoumili50
 
Basic Concepts in Pharmacology in molecular .pptx
Basic Concepts in Pharmacology in molecular  .pptxBasic Concepts in Pharmacology in molecular  .pptx
Basic Concepts in Pharmacology in molecular .pptxVijayaKumarR28
 
Contracts with Interdependent Preferences (2)
Contracts with Interdependent Preferences (2)Contracts with Interdependent Preferences (2)
Contracts with Interdependent Preferences (2)GRAPE
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsHassan Jolany
 
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptxQ3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptxArdeniel
 

Dernier (20)

biosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibioticsbiosynthesis of the cell wall and antibiotics
biosynthesis of the cell wall and antibiotics
 
Main Exam Applied biochemistry final year
Main Exam Applied biochemistry final yearMain Exam Applied biochemistry final year
Main Exam Applied biochemistry final year
 
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptx
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptxSCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptx
SCIENCE 6 QUARTER 3 REVIEWER(FRICTION, GRAVITY, ENERGY AND SPEED).pptx
 
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
001 Case Study - Submission Point_c1051231_attempt_2023-11-23-14-08-42_ABS CW...
 
Applying Cheminformatics to Develop a Structure Searchable Database of Analyt...
Applying Cheminformatics to Develop a Structure Searchable Database of Analyt...Applying Cheminformatics to Develop a Structure Searchable Database of Analyt...
Applying Cheminformatics to Develop a Structure Searchable Database of Analyt...
 
Pests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPRPests of ragi_Identification, Binomics_Dr.UPR
Pests of ragi_Identification, Binomics_Dr.UPR
 
Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...
 
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky WayShiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
 
CW marking grid Analytical BS - M Ahmad.docx
CW  marking grid Analytical BS - M Ahmad.docxCW  marking grid Analytical BS - M Ahmad.docx
CW marking grid Analytical BS - M Ahmad.docx
 
Role of herbs in hair care Amla and heena.pptx
Role of herbs in hair care  Amla and  heena.pptxRole of herbs in hair care  Amla and  heena.pptx
Role of herbs in hair care Amla and heena.pptx
 
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
Data delivery from the US-EPA Center for Computational Toxicology and Exposur...
 
IB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptxIB Biology New syllabus B3.2 Transport.pptx
IB Biology New syllabus B3.2 Transport.pptx
 
Physics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and EngineersPhysics Serway Jewett 6th edition for Scientists and Engineers
Physics Serway Jewett 6th edition for Scientists and Engineers
 
Substances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening TestSubstances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening Test
 
TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)TORSION IN GASTROPODS- Anatomical event (Zoology)
TORSION IN GASTROPODS- Anatomical event (Zoology)
 
Basic Concepts in Pharmacology in molecular .pptx
Basic Concepts in Pharmacology in molecular  .pptxBasic Concepts in Pharmacology in molecular  .pptx
Basic Concepts in Pharmacology in molecular .pptx
 
Contracts with Interdependent Preferences (2)
Contracts with Interdependent Preferences (2)Contracts with Interdependent Preferences (2)
Contracts with Interdependent Preferences (2)
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbits
 
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
 
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptxQ3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
Q3W4part1-SSSSSSSSSSSSSSSSSSSSSSSSCI.pptx
 

Big data nebraska

  • 1. RIDING THE DATA TIDAL WAVE IN MICROBIOLOGY Adina Howe germslab.org (Genomics and Environmental Research in Microbial Systems) Argonne National Laboratory / Michigan State University Iowa State University, Ag & Biosystems Engr (January) Slides available at www.slideshare.com/adinachuanghowe Future of Big Data, Lincoln, NE 11/6/2014
  • 2. Microbes are critical Climate Change USGCRP 2009 Energy Supply www.alutiiq.com Human & Animal Health http://guardianlv.com/ Global Food Security An understanding of microbial ecology
  • 3. Understanding community dynamics  Who is there?  What are they doing?  How are they doing it?
  • 4. Understanding community dynamics  Who is there?  What are they doing?  How are they doing it? Kim Lewis, 2010
  • 5. Gene / Genome Sequencing  Collect samples  Extract DNA  Sequence DNA  “Analyze” DNA to identify its content and origin Taxonomy (e.g., pathogenic E. Coli) Function (e.g., degrades cellulose)
  • 6. Cost of Sequencing 100,000,000 1,000,000 100,000 10,000 1,000 100 10 1 Stein, Genome Biology, 2010 E. Coli genome 4,500,000 bp ($4.5M, 1992) 1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 Year 0.1 DNA Sequencing, Mbp per $ 10,000,000
  • 7. Rapidly decreasing costs with NGS Sequencing 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 Stein, Genome Biology, 2010 Next Generation Sequencing 4,500,000 bp (E. Coli, $200, presently) 1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 Year 0.1 DNA Sequencing, Mbp per $
  • 8. Effects of low cost sequencing… First free-living bacterium sequenced for billions of dollars and years of analysis Personal genome can be mapped in a few days and hundreds to few thousand dollars
  • 9. The experimental continuum Single Isolate Pure Culture Enrichment Mixed Cultures Natural systems
  • 10. The era of big data in biology NGS (Shotgun) Sequencing (doubling time 5 months) 100,000,000 100,000,000 10,000,000 1,000,000 1,000,000 100,000 100,000 10,000 10,000 1,000 1,000 100 100 10 10 1 1 Stein, Genome Biology, 2010 Computational Hardware (doubling time 14 months) Sanger Sequencing (doubling time 19 months) 1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 Year 1,000,000 100,000 10,000 1,000 100 10 1 0 Disk Storage, Mb/$ 0.1 DNA Sequencing, Mbp per $ 10,000,000 0.1
  • 11. Postdoc experience with data 2003-2008 Cumulative sequencing in PhD = 2000 bp 2008-2009 Postdoc Year 1 = 50 Gbp 2009-2010 Postdoc Year 2 = 450 Gbp 2014 = 50 Tbp 2015 = 500 Tbp budgeted
  • 12. THE DIRT ON SOIL MAGNIFICENT BIODIVERSITY Biodiversity in the dark, Wall et al., Nature Geoscience, 2010 Jeremy Burgress
  • 13. THE DIRT ON SOIL SPATIAL HETEROGENEITY http://www.fao.org/ www.cnr.uidaho.edu
  • 14. THE DIRT ON SOIL DYNAMIC
  • 15. THE DIRT ON SOIL INTERACTIONS: BIOTIC, ABIOTIC, ABOVE, BELOW, SCALES Philippot, 2013, Nature Reviews Microbiology
  • 16. I. Technical side of microbial big data in biology II. Future of big data in soil microbial communities III. Bottlenecks for microbiologists
  • 17. Tackling Soil Biodiversity Source: Chuck Haney C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU) Janet Jansson, Susannah Tringe (JGI)
  • 18. Lesson #1: Accessing information in data http://siliconangle.com/files/2010/09/image_thumb69.png
  • 19. de novo assembly Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes Compresses dataset size significantly Improved data quality (longer sequences, gene order) Reference not necessary (novelty)
  • 21. Shotgun sequencing and de novo assembly It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  • 22. Practical Challenges – Intensive computing Months of “computer crunching” on a super computer Howe et al, 2014, PNAS
  • 23. Practical Challenges – Intensive computing Months of “computer crunching” on a super computer Howe et al, 2014, PNAS Assembly of 300 Gbp (70,000 genomes worth) can be done with any assembly program in less than 14 GB RAM and less than 24 hours. 50 Gbp = 10,000 genomes
  • 24. Natural community characteristics  Diverse  Many organisms (genomes)
  • 25. Natural community characteristics  Diverse  Many organisms (genomes)  Variable abundance  Most abundant organisms, sampled more often  Assembly requires a minimum amount of sampling  More sequencing, more errors Sample 1x
  • 26. Natural community characteristics  Diverse  Many organisms (genomes)  Variable abundance  Most abundant organisms, sampled more often  Assembly requires a minimum amount of sampling  More sequencing, more errors Sample 1x Sample 10x
  • 27. Natural community characteristics  Diverse  Many organisms (genomes)  Variable abundance  Most abundant organisms, sampled more often  Assembly requires a minimum amount of sampling  More sequencing, more errors Overkill Sample 1x Sample 10x
  • 28. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One
  • 29. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One
  • 30. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One
  • 31. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One
  • 32. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One
  • 33. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS Zhang et al., 2014, PLOS One  Scales datasets for assembly up to 95% - same assembly outputs.  Genomes, mRNA-seq, metagenomes (soils, gut, water)
  • 34. Tackling Soil Biodiversity Source: Chuck Haney C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU) Janet Jansson, Susannah Tringe (JGI)
  • 36. More like… Howe et. al, 2014, PNAS Source: Chuck Haney
  • 37. What we learned from deeply sequencing soil  Grand Challenge effort – 10% of soil biodiversity sampled  Incredible soil biodiversity (estimate required 10 Tbp/sample)  “To boldly go where no man has gone before”: >60% Unknown 400 300 200 100 0 amino acid metabolism carbohydrate metabolism membrane transport signal transduction translation folding, sorting and degradation metabolism of cofactors and vitamins energy metabolism transport and catabolism lipid metabolism transcription cell growth and death replication and repair xenobiotics biodegradation and metabolism nucleotide metabolism glycan biosynthesis and metabolism metabolism of terpenoids and polyketides cell motility Total Count KO corn and prairie corn only prairie only Howe et al, 2014, PNAS Managed agriculture soils exhibit less diversity, likely from its history of cultivation.
  • 38. If soil is so diverse, what are the most consistent signals we can see at the plot levels?
  • 39. Is there an identifiable “soil functional core” (carbon cycling focus)? Ames, Iowa, COBS Field Site, Fertilized Prairie Whole Soil Samples 4 deeply sampled whole soil metagenomes (16S rRNA 5000 reads, Shotgun 20-50 million Kirsten Hofmockel, Iowa State University
  • 40. How much genetic sequence do you think is shared in 4 replicates?  More than 1%? 10%? 50%?  What kind of genes do you expect in this core?  Minimal critical genes will be abundant & diverse
  • 41. How much genetic sequence do you think is shared in 4 replicates?  More than 1%? 10%? 50%?  What kind of genes do you expect in this core?  Minimal critical genes are varying abundances & diverse
  • 43. My vision (and future research) Microbial markers for ecosystem services Nutrient cycling Pathogens Antibiotic resistance Biodiversity
  • 44. Capabilities exist already… Indoor Microbiome Project Animation http://vimeo.com/90059732
  • 45. Is more data better? Bottlenecks for the emerging microbiologists
  • 46. Technical obstacles in the big data deluge  Access to the data  Access to the resources Democratization of both data and resource access “80% of awards and 50% of $$ are for grants < $350,000” (Ian Foster)  Data volume and velocity  Previous efforts are difficult to integrate  Innovation is necessary
  • 47. Data intensive microbiology Software Developers Computer Scientists Clinicians PIs Data generators Microbiologists Data Analyzers Statisticians Bioinformaticians http://ivory.idyll.org/blog/2014-the-emerging-field-of-data-intensive-biology.html
  • 50. Social obstacles – the main challenge Shift of costs do not mean shift of expectations http://www.deluxebattery.com/25-hilarious-expectation-vs-reality-photos/ Dear PI, It will take longer than the time it took you to do your experiment to analyze the data. Please do not write me for results within 24 hours of your sequences becoming available. - Adina
  • 51. Culture of sharing Metagenomic Datasets http://www.heathershumaker.com/
  • 52. Training / Incentives Emails between collaborators don’t contain as much “science” as I’d like:
  • 53. All analysis: accessible, reproducible, and automated
  • 54. All analysis: accessible, reproducible, and automated To reproduce analysis in a publication, 1. Rent Amazon EC2 computer 2. Clone github repository containing data and scripts 3. Open IPython notebook and execute To run same analysis on different dataset, 1. Replace data files with your own data, execute notebook. 2. Tweak scripts as needed.
  • 55. The journey in summary
  • 56. RIDING THE BIG DATA TIDAL WAVE OF MODERN MICROBIOLOGY Adina Howe Argonne National Laboratory / Michigan State University Iowa State University, Ag & Biosystems Engr (January) “ ”
  • 57. RIDING THE BIG DATA TIDAL WAVE OF MODERN MICROBIOLOGY Adina Howe Argonne National Laboratory / Michigan State University Iowa State University, Ag & Biosystems Engr (January)
  • 58. Acknowledgements  C. Titus Brown (MSU)  James Tiedje (MSU)  Daina Ringus (UC)  Folker Meyer (ANL)  Eugene Chang (UC)  NSF Biology Postdoc Fellowship  DOE Great Lakes Bioenergy Research Center

Notes de l'éditeur

  1. Thank organizers Big data new, journey with big data 3 facts, 1” = 1000 yr, Major nutrient for crop growth, 50 yr recovery, Phosphorus, sustainable ag can mitiaget 0.3 – 1 ton of C/ha per year, 10% of all car carbon emissions
  2. Microbes are responsible for biogeochemical cyling of nutrients, proviidng benefits for plants for crop growth (especially in non ideal conditions), and protection against pathogens and disease. global population will require 70-100% increase in agricultural yields by 2050 Soil is black box Little luck in the past figuring out some very simple questions….
  3. The questions we have in understanding microbes have not changed much…
  4. Historically, we have been asking these questions in model organisms. The challenge of model organisms…comparing them to what we know is in the environment…
  5. First automated DNA sequencing machines late 80s, New ay of asking questions.
  6. Sequencing opened up the door to start building a catalog of some observed key microbial players. Driven by health and biotech interests. This is the same set of reference genes that are still in use today. Expensive.
  7. What changed the field was the invention of next generation technlogies, bascially allowing the throughput of these automated sequencers to be much higher and the cost of sequencing much cheaper. So cheap in fact that instead of sequencing only one bacteria you could start sequencing multiple, even bacteria from complex environments.
  8. Highlighted in recent news
  9. Opportunities and changes in the systems we study. So then the question is not only who is there and what they are doing? But what are they doing together and how?
  10. The growth – point out NGS imapct Accompanied by challenges of computation…even to store data on.
  11. Data during my career really reflects this groth. During postdoc, first year, 50 million reads to about 40x that within literally 9 months. data increased 25x million times…. Notice the gap from 2010 – 2014, figuring stuff out.
  12. Why do we need such vol. of data? Most of us now recognize that microbial communities generally exhibit a high level of diversity, much highter than previously assume by what was revealed by classical microscopy and basic culturing techniques. In soil, even in one gram of soil, there is estimated to be more microbial species than there are stars in the galaxy. We have far to go for any comprehensive characterization of any single soil community. A key question then Is why is soil diversity so high?
  13. One reason may be that the soil structure provides unique niche that provide a high diversity of food resources. Its varied structure provides stable, protective, and even ancient environments for microorganims.
  14. Soil investigations are further complicated by the primarily dormant state of the large majority of the soil microbial population. The turnover rate of soil microbes is predicted to be over 30 fold and even up to 300 fold slower than that of microbes in the oceans. And these microbes live in relatively unpredicatlbe patterns of pertubations – for example rainfall or leaf litter introduction. They also undergo defined temporal perturbations – diurnal energy input.
  15. This complexity in the soil has formed a dynamic microbial ecosystem which interacts with nutrients, plants, and the soil structure itself at multiple scales. I would argue that we as a field are still trying to find tractable methods of accessing these interactions and understanding the drivers of “healthy” or “productive” soils.
  16. Given soils complexity
  17. Soil biodiversity is amazing. Great Prairie – world’s most fertile. Important reference site for the biological baseis and ecosystems of soil microbial communities. It sequesters most carbon, produces large amount of biomass anually, key for biofuels and security. Know little about the who / what in these soils. Excitement about what we could clean now with the technologies.
  18. With growing volumes of data, the most obvious way to tractably access this data is to “smartly” reduce this data.
  19. One genomic way to reduce data is a process known as assembly. Assembly has been around since the sequencing of single organisms.
  20. Metagenomics…a problem of scale
  21. Assembly is the process of trying to come up with a consensus sequence based on finding overlaps in small fragments. In this example, we are coming up with a solution of one sentence using 8 fragments. In metagenomic assembly, you are trying to come up with hundreds to thousands to even millions of genomes using billions of fragments. And to do this, you have to compare each fragment to every other one in the dataset, making it very computationally intensive.
  22. Even the smallest dataset that I had at the beginning of my postdoc required several months on a supercomputer, something having over 100 GB of RAM. These were resources I simply didn’t have at this time. And for my larger datasets, there was simply nothing I could do with them, they would essentially crash any available assembly program that existed. So I had to come up with a way to deal with all of this data or essentially, there were a handful of Pis that had just invested tens of thousands of dollars in a project where we couldn’t tractably handle the datasets.
  23. I’m going to tell you now a little bit about how we were able to do this and there actually two different strategies we had to combine. Point out how many genomes this is. 70000
  24. First start thinking about what tare the natural chracteristics of environemntal communities. Diverse.: There are multiple genomes, and even potentially millions of species, in a sample.
  25. Variable abundance in nature, some are highly abundant some are not.
  26. This diversity and distribution of abundances means that we are unevenly sampling strains in the environment. If we want to sample the rarerest species….we need
  27. A strategy we came up with was can we come up with a way to come up with the minimal dataset that you need for assembly, discarding these reads from this overkill section?
  28. Lossy compression From a sequencing standpoint then, what we see is that for a given genome (represented here as a dotted line), we start sampling fragments from it.
  29. As we sample more, we will have some sequences which will have errors in it.
  30. And we’ll keep sequencing this genome, randomly sampling different parts of it. We’ll get to a point, where we’ll have enough sequences where we can make a good guess at what the original sequence may have looked like.
  31. For assembly, you need a minimum amount of information. So anything beyond this 6 is excessive or redundant information.
  32. So we can discard or set aside this read and not use it for our assembly. And that actually turns out to be a good thing because in discarding this information, we’re actually removing data with errors in it.
  33. minimal dataset needed for an assembly of the dataset here in pink and a redundant set of information which we have set aside. In setting aside these reads here in the red, discard errors Improve assembly
  34. Soil biodiversity is amazing. Great Prairie – world’s most fertile. Important reference site for the biological baseis and ecosystems of soil microbial communities. It sequesters most carbon, produces large amount of biomass anually, key for biofuels and security. Know little about the who / what in these soils. Excitement about what we could clean now with the technologies.
  35. More than half, 50-80% sequences unknown in soil, gut microbiome
  36. Overall, many funcitons are shared between corn and prairies soils. Interestingly, prairie soils have much many more unique functions (indicated here as blue bars) compared to unique functions in the corn (here green). This result may reflect the varying management history of these two soils. Unlike the prairie soils, which have never been tilled, the corn soils have been cultivated for more than 100 y and have had annual additions of animal manure that potentially could enrich specific metabolic pathways with decreased diversity.
  37. I took multiple samples from the same soil plot, and I wanted to know if there was an identifiable soil functional or even taxonomic core among all samples. In collaboration with a team at ISU, we extracted DNA from soils from a fertilzed prairie plot in Iowa and performed whole genome shotgun random sequencing on the extracted DNA. I then took this DNA, I processed with the data analysis tools I’d developed, and I think looked for sequences that were present in ALL four replicates and there at a minimal pretty conservative abundance. Further, I focused this analysis specifically on sequences which were similar to known carbon cycling genes.
  38. Explain why this is important. For biological markers in the soil, we have very site specific microbes that are providing services.
  39. The challenge of this is determining what these markers are…big data and sequencing Determining what these are across multiple ecosystems (variety of soils and even ater and air)…. a transition perhaps to breadth of sequencing in multiple sample rather than depth. combine it with sensors and environmental quality measurements. Identify site specific, function, specific, strain specific markers for ecosystem services.
  40. Agricultural microbes – over the course of a growing season, pathogens in livestock agriculture, microbes transferred from livestock through manure applition to soil that then runoff to water supplies…
  41. Finally, as the last part of my perspective on big data, I wanted to talk about the theme of this workshop? Is more data better? For me, the answer is always yes. I always want more data. This is largely attributable to the fact that I have a lot of experience working with this data and the resources to play with it. But that is not always the case. So what are the challenges of big data to the microbiologist or biologist?
  42. Syntactic incompatibility The first 90% of bioinformatics: your IDs are different from my IDs. Semantic incompatibility The second 90% of bioinformatics: what does “gene” mean in your database? Unstructured data
  43. More challenging is the emerging role a microbiologist now has to fill and the changing teams we are now involved in. I’m asked to play all these roles in various projects I’m involved in. And definitely, I’m asked to communicate to people in all these roles and they are asked to communicate with me. This communication can be challenging.
  44. For example, if you asked us all to describe a tire swing building project, you’d undoubtably get many varied descriptions
  45. Communication and social obstacles are the most difficult,
  46. The need to share and participate in interdiscipinary research come along with a culture of needing to demonstrate individual impact
  47. Total reproducibility of all figures – one button Change the dataset, redo entire analysis on your own data
  48. Total reproducibility of all figures – one button Change the dataset, redo entire analysis on your own data
  49. Journey begins with TG 2008
  50. 6 years later…