SlideShare une entreprise Scribd logo
1  sur  60
A data intensive future: how
can biology best take
advantage of the coming data
deluge?
C. Titus Brown
ctbrown@ucdavis.edu
Associate Professor, UC Davis
Choose your own adventure:
Either you believe that all this “Big Data” stuff is nonsense
and/or overblown:
 Please help me out by identifying my
misconceptions!
Or, you are interested in strategies and techniques for working
with lots of data, in which case:
 I hope to make some useful technical and
social/cultural points.
The obligatory slide about
abundant sequencing data.
http://www.genome.gov/sequencingcosts/
Also see: https://biomickwatson.wordpress.com/2015/03/25/the-cost-of-sequencing-is-
still-going-down/
Three general uses for
abundant sequencing data.
 Computational hypothesis falsification.
 Model comparison or evaluation of
sufficiency.
 Hypothesis generation.
http://ivory.idyll.org/blog/2015-what-to-do-with-sequencing-data.html
My lab’s goals re “data
intensive biology”
 Build open tools and evaluate approaches for
moving quickly from raw-ish data to
hypotheses.
 Work with collaborators to identify emerging
challenges that are preventing them from
doing their science.
 Train peers in data analysis techniques.
Investigating soil microbial
communities
 95% or more of soil microbes cannot be cultured in
lab.
 Very little transport in soil and sediment =>
slow mixing rates.
 Estimates of immense diversity:
 Billions of microbial cells per gram of soil.
 Million+ microbial species per gram of soil (Gans
et al, 2005)
 One observed lower bound for genomic sequence
complexity => 26 Gbp (Amazon Rain Forest
Microbial Observatory)
N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS
http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.html
“By 'soil' we understand (Vil'yams, 1931) a loose surface
layer of earth capable of yielding plant crops. In the
physical sense the soil represents a complex disperse
system consisting of three phases: solid, liquid, and
gaseous.”
Microbies live in & on:
• Surfaces of aggregate
particles;
• Pores within
microaggregates;
Questions to address
 Role of soil microbes in nutrient cycling:
 How does agricultural soil differ from native soil?
 How do soil microbial communities respond to
climate perturbation?
 Genome-level questions:
 What kind of strain-level heterogeneity is present
in the population?
 What are the phage and viral populations &
dynamic?
 What species are where, and how much is shared
between different geographical locations?
Must use culture
independent approaches
 Many reasons why you can’t or don’t want to
culture: cross-feeding, niche specificity, dormancy,
etc.
 If you want to get at underlying function, 16s
analysis alone is not sufficient.
Single-cell sequencing & shotgun metagenomics
are two common ways to investigate complex
microbial communities.
Shotgun metagenomics
 Collect samples;
 Extract DNA;
 Feed into sequencer;
 Computationally analyze.
Wikipedia: Environmental shotgun
sequencing.png
“Sequence it all and let the
bioinformaticians sort it out”
Great Prairie Grand Challenge -
-SAMPLING LOCATIONS
2008
A “Grand Challenge” dataset
(DOE/JGI)
0
100
200
300
400
500
600
Iowa,
Continuous
corn
Iowa, Native
Prairie
Kansas,
Cultivated
corn
Kansas,
Native
Prairie
Wisconsin,
Continuous
corn
Wisconsin,
Native
Prairie
Wisconsin,
Restored
Prairie
Wisconsin,
Switchgrass
BasepairsofSequencing(Gbp)
GAII HiSeq
Rumen (Hess et. al, 2011), 268 Gbp
MetaHIT (Qin et. al, 2011), 578 Gbp
NCBI nr database,
37 Gbp
Total: 1,846 Gbp soil metagenome
Rumen K-mer Filtered,
111 Gbp
Why do we need so much data?!
 20-40x coverage is necessary; 100x is ~sufficient.
 Mixed population sampling => sensitivity driven by
lowest abundance.
 For example, for E. coli in 1/1000 dilution, you
would need approximately 100x coverage of a 5mb
genome at 1/1000, or 500 Gbp of sequence!
(For soil, estimate is 50 Tbp)
 Sequencing is straightforward; data analysis is not.
“$1000 genome with $1m analysis”
Great Prairie Grand
Challenge - goals
 How much of the source metagenome can we reconstruct
from ~300-600 Gbp+ of shotgun sequencing? (Largest
data set ever sequenced, ~2010.)
 What can we learn about soil from looking at the
reconstructed metagenome? (See list of questions)
Great Prairie Grand
Challenge - goals
 How much of the source metagenome can we reconstruct
from ~300-600 Gbp+ of shotgun sequencing? (Largest
data set ever sequenced, ~2010.)
 What can we learn about soil from looking at the
reconstructed metagenome? (See list of questions)
(For complex ecological and evolutionary systems, we’re just
starting to get past the first question. More on that later.)
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
De novo assembly scales with size of data, not
size of (meta)genome.
Why do assemblers scale
badly?
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
Our problem, in a nutshell:
We had so much data that we couldn’t
compute on it.
(This was, and is, a common problem in non-
model systems.)
Our solution: abundance
normalization (diginorm)
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
Random sampling => deep sampling
needed
Typically 10-100x needed for robust recovery (30-300 Gbp for human)
Actual coverage varies widely from the average.
Low coverage introduces unavoidable breaks.
But! Shotgun sequencing is very redundant!
Lots of the high coverage simply isn’t needed.
(unnecessary data)
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Contig assembly now scales with richness, not diversity.
Most samples can be assembled on commodity computers.
(information) (data)
Diginorm is widely useful:
1. Assembly of the H. contortus parasitic nematode
genome, a “high polymorphism/variable coverage”
problem.
(Schwarz et al., 2013; pmid 23985341)
2. Reference-free assembly of the lamprey (P. marinus)
transcriptome, a “big assembly” problem. (in prep)
3. Osedax symbiont metagenome, a “contaminated
metagenome” problem (Goffredi et al, 2013; pmid
24225886)
Changes the way analyses scale.
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
Question: does this approach
negatively affect results? (No.)
3
70
25
1
36
13563
35
13
7
4 23 8 1
6
5
Diginorm V/O Raw V/O
Diginorm trinity Raw trinity
Evaluation of Molgula occulta transcriptome assembly approaches.
Lowe et al., 2014, https://peerj.com/preprints/505/
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Back to soil - what about the assembly results
for Iowa corn and prairie??
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Adina Howe
Resulting contigs are low
coverage.
Figure11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
So, for soil:
 We really do need quite a bit more data to
comprehensively sample gene content of agricultural
soil;
 But at least now we can assemble what we already
have.
 Estimate required sequencing depth at 50 Tbp;
 Now also have 2-8 Tbp from Amazon Rain Forest
Microbial Observatory.
 …still not saturated coverage, but getting closer.
Biogeography: Iowa sample
overlap?
Corn and prairie De Bruijn graphs have 51% overlap.
Corn Prairie
Suggests that at greater depth, samples may have similar genomic content.
Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Blocking problem: we don’t know what
most genes do!
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Howe et al, 2014; pmid 24632729
Reminder: the real challenge
is understanding
 We have gotten distracted by shiny toys:
sequencing!! Data!!
 Data is now plentiful! But:
 We typically have no knowledge of what > 50%
of an environmental metagenome “means”,
functionally.
http://ivory.idyll.org/blog/2014-function-of-unknown-genes.html
Data integration as a next
challenge
In 5-10 years, we will have nigh-infinite data.
(Genomic, transcriptomic, proteomic,
metabolomic, …?)
How do we explore these data sets?
Registration, cross-validation, integration with
models…
Carbon cycling in the ocean -
“DeepDOM” cruise, Kujawinski & Longnecker et al.
Integrating many different data types
to build understanding.
Figure 2. Summary of challenges associated with the data integration in the proposed project.
“DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism
vs physical parameters – potential collab.
Data/analysis lifecycle
A few thoughts on next
steps.
 Enable scientists with better tools.
 Train a bioinformatics “middle class.”
 Accelerate science via the open science
“network effect”.
That is… what now?
Once you have all this data, what do you do?
"Business as usual simply cannot work.”
- David Haussler, 2014
Looking at millions to billions of (human)
genomes in the next 5-10 years.
Enabling scientists with
better tools -
Build robust, flexible computational frameworks
for data exploration, and make them open and
remixable.
Develop theory, algorithms, & software together,
and train people in its use.
(Oh, and stop pretending that we can develop
“black boxes” that will give you the right answer.)
Education and training - towards a
bioinformatics “middle class”
Biology is underprepared for data-intensive investigation.
We must teach and train the next generations.
=> Build a cohort of “data intensive biologists” who can use
data and tools as an intrinsic and unremarkable part of their
research.
~10-20 workshops / year, novice -> masterclass; open
materials.
dib-training.rtfd.org/
Can open science trigger a
“network effect”?
http://prasoondiwakar.com/wordpress/trivia/the-network-effect
The open science “network
effect”
If we have open tools, and trained users,
then what remains to hold us back?
Access to data.
The data deluge is here – it’s
just somewhat hidden.
I actually think this graph should be a much steeper.
Tackling data availability…
In 5-10 years, we will have nigh-infinite data.
(Genomic, transcriptomic, proteomic,
metabolomic, …?)
We currently have no good way of querying,
exploring, investigating, or mining these data
sets, especially across multiple locations..
Moreover, most data is unavailable until after
publication, and often it must then be “curated”
to become useful.
Pre-publication data sharing?
There is no obvious reason to make data available prior
to publication of its analysis.
There is no immediate reward for doing so.
Neither is there much systematized reward for doing
so.
(Citations and kudos feel good, but are cold comfort.)
Worse, there are good reasons not to do so.
If you make your data available, others can take
advantage of it…
This bears some similarity to
the Prisoners’ Dilemma:
Where “confession” is not
sharing your data.
Note: I’m not a game theorist
(but some of my best friends
are).
(Leighton Pritchard modification of
http://www.acting-man.com/?p=34313)
So, how do we get academics to
share their data!?
Well, what are people doing now?
Two successful “systems” (send me more!!)
1. Oceanographic research
2. Biomedical research
1. Research cruises are
expensive!
In oceanography,
individual researchers cannot
afford to set up a cruise.
So, they form scientific consortia.
These consortia have data sharing
and preprint sharing agreements.
(I’m told it works pretty well (?))
2. Some data makes more sense
when you have more data
Omberg et al., Nature Genetics, 2013.
Sage Bionetworks et al.:
Organize a consortium to generate
data;
Standardize data generation;
Share via common platform;
Store results, provenance, analysis
descriptions, and source code;
Run a leaderboard for a subset of
analyses;
Win!
This “walled garden” model
is interesting!
“Compete” on analysis, not on data.
Some notes -
 Sage model requires ~similar data in
common format;
 Common analysis platform then becomes
immediately useful;
 Data is ~easily re-usable by participants;
 Publication of data becomes straightforward;
 Both models are centralized and
coordinated. :(
So: can we drive data sharing via a decentralized
model, e.g. a distributed graph database?
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
ivory.idyll.org/blog/2014-moore-ddd-award.html
My larger research vision:
100% buzzword compliantTM
Enable and incentivize sharing by providing
immediate utility; frictionless sharing.
Permissionless innovation for e.g. new data
mining approaches.
Plan for poverty with federated infrastructure
built on open & cloud.
Solve people’s current problems, while
remaining agile for the future.
ivory.idyll.org/blog/2014-moore-ddd-award.html
Thanks!
Please contact me at ctbrown@ucdavis.edu!
Soil collaborators: Tiedje (MSU), Jansson (PNNL), Tringe (JGI/DOE)

Contenu connexe

Tendances

ABIcurator.doc
ABIcurator.docABIcurator.doc
ABIcurator.docbutest
 
Gene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshopGene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshopBenjamin Good
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyMaté Ongenaert
 
NRNB Annual Report 2011
NRNB Annual Report 2011NRNB Annual Report 2011
NRNB Annual Report 2011Alexander Pico
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsLeighton Pritchard
 
Big data in biology
Big data in biologyBig data in biology
Big data in biologyOmkar Reddy
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KBenjamin Good
 
Technology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network RepresentationsTechnology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network RepresentationsAlexander Pico
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
Trends In Genomics
Trends In GenomicsTrends In Genomics
Trends In GenomicsSaul Kravitz
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKpetermurrayrust
 
NCI systems epidemiology 03012019
NCI systems epidemiology 03012019NCI systems epidemiology 03012019
NCI systems epidemiology 03012019Chirag Patel
 
DNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal GenomicsDNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal GenomicsMelanie Swan
 
Personal Genomes: what can I do with my data?
Personal Genomes: what can I do with my data?Personal Genomes: what can I do with my data?
Personal Genomes: what can I do with my data?Melanie Swan
 
Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09Sean Davis
 

Tendances (20)

Big Data Field Museum
Big Data Field MuseumBig Data Field Museum
Big Data Field Museum
 
ABIcurator.doc
ABIcurator.docABIcurator.doc
ABIcurator.doc
 
Gene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshopGene Wiki and Wikimedia Foundation SPARQL workshop
Gene Wiki and Wikimedia Foundation SPARQL workshop
 
Biomarker-Vol9_reduced
Biomarker-Vol9_reducedBiomarker-Vol9_reduced
Biomarker-Vol9_reduced
 
Large scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biologyLarge scale machine learning challenges for systems biology
Large scale machine learning challenges for systems biology
 
NRNB Annual Report 2011
NRNB Annual Report 2011NRNB Annual Report 2011
NRNB Annual Report 2011
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
Big data in biology
Big data in biologyBig data in biology
Big data in biology
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2K
 
Technology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network RepresentationsTechnology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network Representations
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
Trends In Genomics
Trends In GenomicsTrends In Genomics
Trends In Genomics
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
NCI systems epidemiology 03012019
NCI systems epidemiology 03012019NCI systems epidemiology 03012019
NCI systems epidemiology 03012019
 
DNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal GenomicsDNA Testing: Living Longer Via Personal Genomics
DNA Testing: Living Longer Via Personal Genomics
 
Personal Genomes: what can I do with my data?
Personal Genomes: what can I do with my data?Personal Genomes: what can I do with my data?
Personal Genomes: what can I do with my data?
 
Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09Bioc strucvariant seattle_11_09
Bioc strucvariant seattle_11_09
 
Use of data
Use of dataUse of data
Use of data
 
Bio Informatics
Bio InformaticsBio Informatics
Bio Informatics
 

En vedette

2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
AMD Putting Server Virtualization to Work
AMD Putting Server Virtualization to WorkAMD Putting Server Virtualization to Work
AMD Putting Server Virtualization to WorkJames Price
 
Cross-Border Transactions from a U.S. Perspective
Cross-Border Transactions from a U.S. PerspectiveCross-Border Transactions from a U.S. Perspective
Cross-Border Transactions from a U.S. PerspectiveKegler Brown Hill + Ritter
 
ITP Instance Management Process V2
ITP Instance Management Process V2ITP Instance Management Process V2
ITP Instance Management Process V2Mahesh Vallampati
 
وظائف القيادة
وظائف القيادةوظائف القيادة
وظائف القيادةAhmad Darwish
 
BlackBerry Clinique-Short Review OS 7.1
BlackBerry Clinique-Short Review OS 7.1BlackBerry Clinique-Short Review OS 7.1
BlackBerry Clinique-Short Review OS 7.1Khomeini Mujahid
 
Ele 2009 Opening Pvu
Ele 2009 Opening PvuEle 2009 Opening Pvu
Ele 2009 Opening PvuPiet van Vugt
 
Catalyst Eye Tracking: Bing vs Google
Catalyst Eye Tracking: Bing vs GoogleCatalyst Eye Tracking: Bing vs Google
Catalyst Eye Tracking: Bing vs GoogleJennifer Hsieh
 
The Great Murder
The Great MurderThe Great Murder
The Great Murdergranilla
 
Dock It Customer Intro 14 Aug 09
Dock It Customer Intro 14 Aug 09Dock It Customer Intro 14 Aug 09
Dock It Customer Intro 14 Aug 09DockIT
 
Undangan (Kak Melly n Kak Dicky)
Undangan (Kak Melly n Kak Dicky)Undangan (Kak Melly n Kak Dicky)
Undangan (Kak Melly n Kak Dicky)@rtNya
 
PPP Project Development Fund Initiative-PbyR
PPP Project Development Fund Initiative-PbyRPPP Project Development Fund Initiative-PbyR
PPP Project Development Fund Initiative-PbyRAkwu OKOLO
 
Seniors u00 e9lu00e9ment_u00e9conomique_indispensable11-1
Seniors u00 e9lu00e9ment_u00e9conomique_indispensable11-1Seniors u00 e9lu00e9ment_u00e9conomique_indispensable11-1
Seniors u00 e9lu00e9ment_u00e9conomique_indispensable11-1André Thépin
 
Theoretical framework d1 2016 11-18
Theoretical framework d1 2016 11-18Theoretical framework d1 2016 11-18
Theoretical framework d1 2016 11-18Zafar Ahmad
 

En vedette (20)

City of San Antonio passes Social Host Ordinance December 15, 2016
City of San Antonio passes Social Host Ordinance December 15, 2016City of San Antonio passes Social Host Ordinance December 15, 2016
City of San Antonio passes Social Host Ordinance December 15, 2016
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
AMD Putting Server Virtualization to Work
AMD Putting Server Virtualization to WorkAMD Putting Server Virtualization to Work
AMD Putting Server Virtualization to Work
 
Theguesswho
TheguesswhoTheguesswho
Theguesswho
 
Cross-Border Transactions from a U.S. Perspective
Cross-Border Transactions from a U.S. PerspectiveCross-Border Transactions from a U.S. Perspective
Cross-Border Transactions from a U.S. Perspective
 
ITP Instance Management Process V2
ITP Instance Management Process V2ITP Instance Management Process V2
ITP Instance Management Process V2
 
Underage Drinking Parties in San Antonio 2016
Underage Drinking Parties in San Antonio 2016Underage Drinking Parties in San Antonio 2016
Underage Drinking Parties in San Antonio 2016
 
وظائف القيادة
وظائف القيادةوظائف القيادة
وظائف القيادة
 
Tips And Tricks For Photos
Tips And Tricks For PhotosTips And Tricks For Photos
Tips And Tricks For Photos
 
Portfolio
PortfolioPortfolio
Portfolio
 
BlackBerry Clinique-Short Review OS 7.1
BlackBerry Clinique-Short Review OS 7.1BlackBerry Clinique-Short Review OS 7.1
BlackBerry Clinique-Short Review OS 7.1
 
Ele 2009 Opening Pvu
Ele 2009 Opening PvuEle 2009 Opening Pvu
Ele 2009 Opening Pvu
 
Catalyst Eye Tracking: Bing vs Google
Catalyst Eye Tracking: Bing vs GoogleCatalyst Eye Tracking: Bing vs Google
Catalyst Eye Tracking: Bing vs Google
 
The Great Murder
The Great MurderThe Great Murder
The Great Murder
 
Dock It Customer Intro 14 Aug 09
Dock It Customer Intro 14 Aug 09Dock It Customer Intro 14 Aug 09
Dock It Customer Intro 14 Aug 09
 
Undangan (Kak Melly n Kak Dicky)
Undangan (Kak Melly n Kak Dicky)Undangan (Kak Melly n Kak Dicky)
Undangan (Kak Melly n Kak Dicky)
 
PPP Project Development Fund Initiative-PbyR
PPP Project Development Fund Initiative-PbyRPPP Project Development Fund Initiative-PbyR
PPP Project Development Fund Initiative-PbyR
 
RealTimeSchool
RealTimeSchoolRealTimeSchool
RealTimeSchool
 
Seniors u00 e9lu00e9ment_u00e9conomique_indispensable11-1
Seniors u00 e9lu00e9ment_u00e9conomique_indispensable11-1Seniors u00 e9lu00e9ment_u00e9conomique_indispensable11-1
Seniors u00 e9lu00e9ment_u00e9conomique_indispensable11-1
 
Theoretical framework d1 2016 11-18
Theoretical framework d1 2016 11-18Theoretical framework d1 2016 11-18
Theoretical framework d1 2016 11-18
 

Similaire à 2015 mcgill-talk

2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...c.titus.brown
 
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainIowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainAdina Chuang Howe
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...GigaScience, BGI Hong Kong
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynotec.titus.brown
 
Metagenomics and it’s applications
Metagenomics and it’s applicationsMetagenomics and it’s applications
Metagenomics and it’s applicationsSham Sadiq
 
metagenomicsanditsapplications-161222180924.pdf
metagenomicsanditsapplications-161222180924.pdfmetagenomicsanditsapplications-161222180924.pdf
metagenomicsanditsapplications-161222180924.pdfVisheshMishra20
 
Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!adcobb
 
Job Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringJob Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringAdina Chuang Howe
 
2011Field talk at iEVOBIO 2011
2011Field talk at iEVOBIO 20112011Field talk at iEVOBIO 2011
2011Field talk at iEVOBIO 2011MIBBI Checklists
 

Similaire à 2015 mcgill-talk (20)

2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
2013 alumni-webinar
2013 alumni-webinar2013 alumni-webinar
2013 alumni-webinar
 
Pathogen Genome Data
Pathogen Genome DataPathogen Genome Data
Pathogen Genome Data
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
 
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainIowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...
Rick Stevens: Prospects for a Systematic Exploration of Earths Microbial Dive...
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
Metagenomics and it’s applications
Metagenomics and it’s applicationsMetagenomics and it’s applications
Metagenomics and it’s applications
 
metagenomicsanditsapplications-161222180924.pdf
metagenomicsanditsapplications-161222180924.pdfmetagenomicsanditsapplications-161222180924.pdf
metagenomicsanditsapplications-161222180924.pdf
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!Introduction to Gene Mining Part A: BLASTn-off!
Introduction to Gene Mining Part A: BLASTn-off!
 
Job Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio EngineeringJob Talk Iowa State University Ag Bio Engineering
Job Talk Iowa State University Ag Bio Engineering
 
2011Field talk at iEVOBIO 2011
2011Field talk at iEVOBIO 20112011Field talk at iEVOBIO 2011
2011Field talk at iEVOBIO 2011
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 

Plus de c.titus.brown

Plus de c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 

Dernier

Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Ai in communication electronicss[1].pptx
Ai in communication electronicss[1].pptxAi in communication electronicss[1].pptx
Ai in communication electronicss[1].pptxsubscribeus100
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 

Dernier (20)

Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Ai in communication electronicss[1].pptx
Ai in communication electronicss[1].pptxAi in communication electronicss[1].pptx
Ai in communication electronicss[1].pptx
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by Petrovic
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 

2015 mcgill-talk

  • 1. A data intensive future: how can biology best take advantage of the coming data deluge? C. Titus Brown ctbrown@ucdavis.edu Associate Professor, UC Davis
  • 2. Choose your own adventure: Either you believe that all this “Big Data” stuff is nonsense and/or overblown:  Please help me out by identifying my misconceptions! Or, you are interested in strategies and techniques for working with lots of data, in which case:  I hope to make some useful technical and social/cultural points.
  • 3. The obligatory slide about abundant sequencing data. http://www.genome.gov/sequencingcosts/ Also see: https://biomickwatson.wordpress.com/2015/03/25/the-cost-of-sequencing-is- still-going-down/
  • 4. Three general uses for abundant sequencing data.  Computational hypothesis falsification.  Model comparison or evaluation of sufficiency.  Hypothesis generation. http://ivory.idyll.org/blog/2015-what-to-do-with-sequencing-data.html
  • 5. My lab’s goals re “data intensive biology”  Build open tools and evaluate approaches for moving quickly from raw-ish data to hypotheses.  Work with collaborators to identify emerging challenges that are preventing them from doing their science.  Train peers in data analysis techniques.
  • 6. Investigating soil microbial communities  95% or more of soil microbes cannot be cultured in lab.  Very little transport in soil and sediment => slow mixing rates.  Estimates of immense diversity:  Billions of microbial cells per gram of soil.  Million+ microbial species per gram of soil (Gans et al, 2005)  One observed lower bound for genomic sequence complexity => 26 Gbp (Amazon Rain Forest Microbial Observatory)
  • 7. N. A. Krasil'nikov, SOIL MICROORGANISMS AND HIGHER PLANTS http://www.soilandhealth.org/01aglibrary/010112krasil/010112krasil.ptII.html “By 'soil' we understand (Vil'yams, 1931) a loose surface layer of earth capable of yielding plant crops. In the physical sense the soil represents a complex disperse system consisting of three phases: solid, liquid, and gaseous.” Microbies live in & on: • Surfaces of aggregate particles; • Pores within microaggregates;
  • 8. Questions to address  Role of soil microbes in nutrient cycling:  How does agricultural soil differ from native soil?  How do soil microbial communities respond to climate perturbation?  Genome-level questions:  What kind of strain-level heterogeneity is present in the population?  What are the phage and viral populations & dynamic?  What species are where, and how much is shared between different geographical locations?
  • 9. Must use culture independent approaches  Many reasons why you can’t or don’t want to culture: cross-feeding, niche specificity, dormancy, etc.  If you want to get at underlying function, 16s analysis alone is not sufficient. Single-cell sequencing & shotgun metagenomics are two common ways to investigate complex microbial communities.
  • 10. Shotgun metagenomics  Collect samples;  Extract DNA;  Feed into sequencer;  Computationally analyze. Wikipedia: Environmental shotgun sequencing.png “Sequence it all and let the bioinformaticians sort it out”
  • 11. Great Prairie Grand Challenge - -SAMPLING LOCATIONS 2008
  • 12. A “Grand Challenge” dataset (DOE/JGI) 0 100 200 300 400 500 600 Iowa, Continuous corn Iowa, Native Prairie Kansas, Cultivated corn Kansas, Native Prairie Wisconsin, Continuous corn Wisconsin, Native Prairie Wisconsin, Restored Prairie Wisconsin, Switchgrass BasepairsofSequencing(Gbp) GAII HiSeq Rumen (Hess et. al, 2011), 268 Gbp MetaHIT (Qin et. al, 2011), 578 Gbp NCBI nr database, 37 Gbp Total: 1,846 Gbp soil metagenome Rumen K-mer Filtered, 111 Gbp
  • 13. Why do we need so much data?!  20-40x coverage is necessary; 100x is ~sufficient.  Mixed population sampling => sensitivity driven by lowest abundance.  For example, for E. coli in 1/1000 dilution, you would need approximately 100x coverage of a 5mb genome at 1/1000, or 500 Gbp of sequence! (For soil, estimate is 50 Tbp)  Sequencing is straightforward; data analysis is not. “$1000 genome with $1m analysis”
  • 14. Great Prairie Grand Challenge - goals  How much of the source metagenome can we reconstruct from ~300-600 Gbp+ of shotgun sequencing? (Largest data set ever sequenced, ~2010.)  What can we learn about soil from looking at the reconstructed metagenome? (See list of questions)
  • 15. Great Prairie Grand Challenge - goals  How much of the source metagenome can we reconstruct from ~300-600 Gbp+ of shotgun sequencing? (Largest data set ever sequenced, ~2010.)  What can we learn about soil from looking at the reconstructed metagenome? (See list of questions) (For complex ecological and evolutionary systems, we’re just starting to get past the first question. More on that later.)
  • 16. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com De novo assembly scales with size of data, not size of (meta)genome.
  • 17. Why do assemblers scale badly? Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set
  • 18. Our problem, in a nutshell: We had so much data that we couldn’t compute on it. (This was, and is, a common problem in non- model systems.)
  • 19. Our solution: abundance normalization (diginorm) Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 20. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (30-300 Gbp for human)
  • 21. Actual coverage varies widely from the average. Low coverage introduces unavoidable breaks.
  • 22. But! Shotgun sequencing is very redundant! Lots of the high coverage simply isn’t needed. (unnecessary data)
  • 29. Contig assembly now scales with richness, not diversity. Most samples can be assembled on commodity computers. (information) (data)
  • 30. Diginorm is widely useful: 1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem. (Schwarz et al., 2013; pmid 23985341) 2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep) 3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid 24225886)
  • 31. Changes the way analyses scale. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
  • 32. Question: does this approach negatively affect results? (No.) 3 70 25 1 36 13563 35 13 7 4 23 8 1 6 5 Diginorm V/O Raw V/O Diginorm trinity Raw trinity Evaluation of Molgula occulta transcriptome assembly approaches. Lowe et al., 2014, https://peerj.com/preprints/505/
  • 33. Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Back to soil - what about the assembly results for Iowa corn and prairie?? Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Adina Howe
  • 34. Resulting contigs are low coverage. Figure11: Coverage (median basepair) distribution of assembled contigs from soil metagenomes.
  • 35. So, for soil:  We really do need quite a bit more data to comprehensively sample gene content of agricultural soil;  But at least now we can assemble what we already have.  Estimate required sequencing depth at 50 Tbp;  Now also have 2-8 Tbp from Amazon Rain Forest Microbial Observatory.  …still not saturated coverage, but getting closer.
  • 36. Biogeography: Iowa sample overlap? Corn and prairie De Bruijn graphs have 51% overlap. Corn Prairie Suggests that at greater depth, samples may have similar genomic content.
  • 37. Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp Blocking problem: we don’t know what most genes do! Total Assembly Total Contigs (> 300 bp) % Reads Assembled Predicted protein coding 2.5 bill 4.5 mill 19% 5.3 mill 3.5 bill 5.9 mill 22% 6.8 mill Howe et al, 2014; pmid 24632729
  • 38. Reminder: the real challenge is understanding  We have gotten distracted by shiny toys: sequencing!! Data!!  Data is now plentiful! But:  We typically have no knowledge of what > 50% of an environmental metagenome “means”, functionally. http://ivory.idyll.org/blog/2014-function-of-unknown-genes.html
  • 39. Data integration as a next challenge In 5-10 years, we will have nigh-infinite data. (Genomic, transcriptomic, proteomic, metabolomic, …?) How do we explore these data sets? Registration, cross-validation, integration with models…
  • 40. Carbon cycling in the ocean - “DeepDOM” cruise, Kujawinski & Longnecker et al.
  • 41. Integrating many different data types to build understanding. Figure 2. Summary of challenges associated with the data integration in the proposed project. “DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab.
  • 43. A few thoughts on next steps.  Enable scientists with better tools.  Train a bioinformatics “middle class.”  Accelerate science via the open science “network effect”.
  • 44. That is… what now? Once you have all this data, what do you do? "Business as usual simply cannot work.” - David Haussler, 2014 Looking at millions to billions of (human) genomes in the next 5-10 years.
  • 45. Enabling scientists with better tools - Build robust, flexible computational frameworks for data exploration, and make them open and remixable. Develop theory, algorithms, & software together, and train people in its use. (Oh, and stop pretending that we can develop “black boxes” that will give you the right answer.)
  • 46. Education and training - towards a bioinformatics “middle class” Biology is underprepared for data-intensive investigation. We must teach and train the next generations. => Build a cohort of “data intensive biologists” who can use data and tools as an intrinsic and unremarkable part of their research. ~10-20 workshops / year, novice -> masterclass; open materials. dib-training.rtfd.org/
  • 47. Can open science trigger a “network effect”? http://prasoondiwakar.com/wordpress/trivia/the-network-effect
  • 48. The open science “network effect” If we have open tools, and trained users, then what remains to hold us back? Access to data.
  • 49. The data deluge is here – it’s just somewhat hidden. I actually think this graph should be a much steeper.
  • 50. Tackling data availability… In 5-10 years, we will have nigh-infinite data. (Genomic, transcriptomic, proteomic, metabolomic, …?) We currently have no good way of querying, exploring, investigating, or mining these data sets, especially across multiple locations.. Moreover, most data is unavailable until after publication, and often it must then be “curated” to become useful.
  • 51. Pre-publication data sharing? There is no obvious reason to make data available prior to publication of its analysis. There is no immediate reward for doing so. Neither is there much systematized reward for doing so. (Citations and kudos feel good, but are cold comfort.) Worse, there are good reasons not to do so. If you make your data available, others can take advantage of it…
  • 52. This bears some similarity to the Prisoners’ Dilemma: Where “confession” is not sharing your data. Note: I’m not a game theorist (but some of my best friends are). (Leighton Pritchard modification of http://www.acting-man.com/?p=34313)
  • 53. So, how do we get academics to share their data!? Well, what are people doing now? Two successful “systems” (send me more!!) 1. Oceanographic research 2. Biomedical research
  • 54. 1. Research cruises are expensive! In oceanography, individual researchers cannot afford to set up a cruise. So, they form scientific consortia. These consortia have data sharing and preprint sharing agreements. (I’m told it works pretty well (?))
  • 55. 2. Some data makes more sense when you have more data Omberg et al., Nature Genetics, 2013. Sage Bionetworks et al.: Organize a consortium to generate data; Standardize data generation; Share via common platform; Store results, provenance, analysis descriptions, and source code; Run a leaderboard for a subset of analyses; Win!
  • 56. This “walled garden” model is interesting! “Compete” on analysis, not on data.
  • 57. Some notes -  Sage model requires ~similar data in common format;  Common analysis platform then becomes immediately useful;  Data is ~easily re-usable by participants;  Publication of data becomes straightforward;  Both models are centralized and coordinated. :(
  • 58. So: can we drive data sharing via a decentralized model, e.g. a distributed graph database? Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI) ivory.idyll.org/blog/2014-moore-ddd-award.html
  • 59. My larger research vision: 100% buzzword compliantTM Enable and incentivize sharing by providing immediate utility; frictionless sharing. Permissionless innovation for e.g. new data mining approaches. Plan for poverty with federated infrastructure built on open & cloud. Solve people’s current problems, while remaining agile for the future. ivory.idyll.org/blog/2014-moore-ddd-award.html
  • 60. Thanks! Please contact me at ctbrown@ucdavis.edu! Soil collaborators: Tiedje (MSU), Jansson (PNNL), Tringe (JGI/DOE)

Notes de l'éditeur

  1. Fly-over country (that I live in)
  2. A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
  3. High coverage is essential.
  4. High coverage is essential.
  5. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
  6. Passionate about training; necessary fro advancement of field; also deeply self-interested because I find out what the real problems are. (“Some people can do assembly” is not “everyone can do assembly”)
  7. Analyze data in cloud; import and export important; connect to other databases.
  8. Work with other Moore DDD folk on the data mining aspect. Start with cross validation, move to more sophisticated in-server implementations.