SlideShare une entreprise Scribd logo
1  sur  55
A data intensive future:
How can biology take full advantage of
the coming data deluge?
C.Titus Brown
School ofVeterinary Medicine;
Genome Center & Data Science Initiative
1/22/16
#titusplantz
Slides are on slideshare.net/c.titus.brown/
(My one plant collaboration)
Helping Shamoni Maheshwari (Comai Lab)
w/analysis of ChIP-seq data from Arabidopsis.
Outline
0. Background: what’s coming?
1. Research: what do we do with infinite data?
2. Development: software and infrastructure.
3. Open science & reproducibility.
4. Training
0. Background
What is going to be happening in the next 5
years with biological data generation?
DNA sequencing rates continues
to grow.
Stephens et al., 2015 - 10.1371/journal.pbio.1002195
(2015 was a good year)
Oxford Nanopore sequencing
Slide viaTorsten Seeman
Nanopore technology
Slide viaTorsten Seeman
Scaling up --
Scaling up --
Slide viaTorsten Seeman
http://ebola.nextflu.org/
“Fighting EbolaWith a Palm-
Sized DNA Sequencer”
See: http://www.theatlantic.com/science/archive/2015/09/ebola-
sequencer-dna-minion/405466/
“DeepDOM” cruise: examination
of dissolved organic matter &
microbial metabolism vs physical
parameters – potential collab.
Via Elizabeth Kujawinski
Another challenge beyond volume and velocity – variety.
Data integration.
Figure 2. Summary of challenges associated with the data integration in the proposed project.
Figure via E. Kujawinski
CRISPR
The challenge with genome editing is fast
becoming what to edit rather than how to do.
A point for reflection…
Increasingly, the best guide to the next 10 years
of biology is science fiction ...
1. Research
Working with ~infinite amounts of data, and
doing something effective with it.
Shotgun sequencing and coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.
Random sampling => deep sampling needed
Typically 10-100x needed for robust recovery (30-300 Gbp for human)
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Computational problems now scale with information
content rather than data set size.
Most samples can be reconstructed via de
novo assembly on commodity computers.
A local collaboration:
The horse genome &
transcriptome
Tamer Mansour w/Bellone, Finno, Penedo, &
Murray labs.
Input data
Tissue Library length #samples #frag(M) #bp(Gb)
BrainStem PE fr.firststrand 101 8 166.73 33.68
Cerebellum PE fr.firststrand 100 24 411.48 82.3
Muscle PE fr.firststrand 126 12 301.94 76.08
Retina PE fr.unstranded 81 2 20.3 3.28
SpinalCord PE fr.firststrand 101 16 403 81.4
Skin PE fr.unstranded 81 2 18.54 3
SE fr.unstranded 81 2 16.57 1.34
SE fr.unstranded 95 3 105.51 10.02
Embryo ICM PE fr.unstranded 100 3 126.32 25.26
SE fr.unstranded 100 3 115.21 11.52
Embryo TE PE fr.unstranded 100 3 129.84 25.96
SE fr.unstranded 100 3 102.26 10.23
Total 81 1917.7 364.07
equCabs current status -
NCBI Annotation
Feature Acc Annotation GFF Refseq DB
Total no of genes 25565
ptn coding genes 19686
Coding RNA NM BestRefSeq 764 1097
Coding RNA XM Gnomon 31578 31346
Non coding RNA NR BestRefSeq 348 726
Non coding RNA XR Gnomon 3311 3310
Total 36001 36479
32342 coding transcripts encoded by 19686 genes
(average 1.6 transcript per gene)
There are 3034 pseudo genes
(with no annotated transcripts)
Status count
reviewed 4
Validated 267
Provisional 540
Predicted 7
inferred 279
Tamer Mansour
Library prep
Read
trimming
Mapping to ref
Merge rep.
Trans Ass.
Merge byTiss.
Predict ORF
VariantAna
Update dbvar
Haplotype ass
Pool/diginorm
Predict ncRNA
Filter & Compare Ass.
filter knowns
Compare to public ann. Merge All Ass.
Mapping to ref
Trans Ass.
Tamer Mansour
Digital normalization & (e.g.)
horse transcriptome
The computational demands for cufflinks
- Read binning (processing time)
- Construction of gene models (no of genes, no of splicing junctions, no of
reads per locus, sequencing errors, complexity of the locus like gene
overlap and multiple isoforms (processing time & Memory utilization)
Diginorm
- Significant reduction of binning time
- Relative increase of the resources
required for gene model construction
with merging more samples and tissues
- ? false recombinant isoforms
Tamer Mansour
Effect of digital normalization
** Should be very valuable for detection of ncRNA
Tamer Mansour
The ORF problem
Hestand et al 2014: “we identified 301,829 positions with SNPs or small indels within
these transcripts relative to EquCab2. Interestingly, 780 variants extend the open reading
frame of the transcript and appear to be small errors in the equine reference genome”
Tamer Mansour
We merged the assemblies into six tissue-specific transcription
profiles for cerebellum, brainstem, spinal cord, retina, muscle and
skin.The final merger of all assemblies overlaps with 63% and 73% of
NCBI and Ensembl loci, respectively, capturing about 72% and 81%
of their coding bases. Comparing our assembly to the most recent
transcriptome annotation shows ~85% overlapping loci. In addition, at
least 40% of our annotated loci represent novel transcripts.
Tamer Mansour
2. Software and infrastructure
Alas, practical data analysis depends on
software and computers, which leads to
depressingly practical considerations for
gentleperson scientists.
Software
It’s all well and good to develop new data
analysis approaches, but their utility is greater
when they are implemented in usable software.
Writing, maintaining, and progressing research
software is hard.
The khmer software package
• Demo implementation of research data structures &
algorithms;
• 10.5k lines of C++ code, 13.7k lines of Python code;
• khmer v2.0 has 87% statement coverage under test;
• ~3-4 developers, 50+ contributors, ~1000s of users (?)
• Developed as a “true” open source project.
The khmer software package, Crusoe et al., 2015. http://f1000research.com/articles/4-900/v1
Challenges:
Research vs stability!
Stable software for users, & platform for future
research;
vs research “culture”
(funding and careers)
Infrastructure issues
Suppose that we have a nice ecosystem of bioinformatics &
data analysis tools.
Where and how do we run them?
Consider:
1. Biologists hate funding computational infrastructure.
2. Researchers are generally incompetent at building and
maintaining usable infrastructure.
3. Centralized infrastructure fails in the face of infinite data.
Decentralized infrastructure for
bioinformatics?
Compute server
(Galaxy?
Arvados?)
Web interface + API
Data/
Info
Raw data sets
Public
servers
"Walled
garden"
server
Private
server
Graph query layer
Upload/submit
(NCBI, KBase)
Import
(MG-RAST,
SRA, EBI)
ivory.idyll.org/blog/2014-moore-ddd-award.html
3. Open science and
reproducibility
In my experience, most researchers* cannot
replicate their own computational analyses, much
less reproduce those published by anyone else.
*This doesn’t apply to anyone in this
audience; you’re all outliers!
IPython Notebook: data + code =>
IPython)Notebook)
To reproduce our papers:
git clone <khmer> && python setup.py install
git clone <pipeline>
cd pipeline
wget <data> && tar xzf <data>
make && cd ../notebook && make
cd ../ && make
This is standard process in lab --
Our papers now have:
• Source hosted on github;
• Data hosted there or onAWS;
• Long running data analysis =>
‘make’
• Graphing and data digestion =>
IPython Notebook (also in
github)
Zhang et al. doi: 10.1371/journal.pone.0101271
Literate graphing & interactive
exploration
Camille Scott
Why do we do this?
• We work faster and more reliably.
• We can build on our own (and others’)
research.
• Robust computational research, released
early, gives us a competitive advantage.
4.Training
Methods and tools do little without a trained
hand wielding them, and a trained eye
examining the results.
Perspectives on training
• Prediction: The single biggest challenge
facing biology over the next 20 years is the
lack of data analysis training (see: NIH DIWG
report)
• Data analysis is not turning the crank; it is an
intellectual exercise on par with
experimental design or paper writing.
• Training is systematically undervalued in
academia (!?)
UC Davis and training
My goal here is to support the coalescence and
growth of a local community of practice around
“data intensive biology”.
Summer NGS workshop (2010-2017)
General parameters:
• Regular intensive workshops, half-day or longer.
• Aimed at research practitioners (grad students & more
senior); open to all (including outside community).
• Novice (“zero entry”) on up.
• Low cost for students.
• Leveraging global training initiatives:
Thus far & near future
~12 workshops on bioinformatics in 2015.
Trying out Q1 & Q2 2016:
• Half-day intro workshops (27 planned);
• Week-long advanced workshops;
• Co-working hours (“data therapy”).
dib-training.readthedocs.org/
The End.
• If you think 5-10 years out, we face significant
practical issues for data analysis in biology.
• We need new algorithms/data structures,
AND good implementations, AND better
computational practice,AND training.
(It’s a pretty good time to be doing biology.)
Thanks for listening!
Please contact me at ctbrown@ucdavis.edu!

Contenu connexe

Tendances

2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsDuncan Hull
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchEuropean Bioinformatics Institute
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Greg Landrum
 
ELRIG Event Biocity Scotland May19
ELRIG Event Biocity Scotland May19ELRIG Event Biocity Scotland May19
ELRIG Event Biocity Scotland May19Angelo Pugliese
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsLeighton Pritchard
 
2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinarPistoia Alliance
 
CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata managementPistoia Alliance
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Pistoia Alliance
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
 
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013Functional Genomics Data Society
 
From peer-reviewed to peer-reproduced: a role for research objects in scholar...
From peer-reviewed to peer-reproduced: a role for research objects in scholar...From peer-reviewed to peer-reproduced: a role for research objects in scholar...
From peer-reviewed to peer-reproduced: a role for research objects in scholar...Alejandra Gonzalez-Beltran
 
Why are we still doing industrial age drug
Why are we still doing industrial age drugWhy are we still doing industrial age drug
Why are we still doing industrial age drugSean Ekins
 

Tendances (20)

2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
NETTAB 2013
NETTAB 2013NETTAB 2013
NETTAB 2013
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of Bioinformatics
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
CSHALS 2013
CSHALS 2013CSHALS 2013
CSHALS 2013
 
Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...Reproducibility in cheminformatics and computational chemistry research: cert...
Reproducibility in cheminformatics and computational chemistry research: cert...
 
NETTAB 2012
NETTAB 2012NETTAB 2012
NETTAB 2012
 
ELRIG Event Biocity Scotland May19
ELRIG Event Biocity Scotland May19ELRIG Event Biocity Scotland May19
ELRIG Event Biocity Scotland May19
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar
 
CEDAR work bench for metadata management
CEDAR work bench for metadata managementCEDAR work bench for metadata management
CEDAR work bench for metadata management
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
Beyond the PDF 2, 2013
Beyond the PDF 2, 2013Beyond the PDF 2, 2013
Beyond the PDF 2, 2013
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
 
From peer-reviewed to peer-reproduced: a role for research objects in scholar...
From peer-reviewed to peer-reproduced: a role for research objects in scholar...From peer-reviewed to peer-reproduced: a role for research objects in scholar...
From peer-reviewed to peer-reproduced: a role for research objects in scholar...
 
Why are we still doing industrial age drug
Why are we still doing industrial age drugWhy are we still doing industrial age drug
Why are we still doing industrial age drug
 
Drug Discovery- ELRIG -2012
Drug Discovery- ELRIG -2012Drug Discovery- ELRIG -2012
Drug Discovery- ELRIG -2012
 

En vedette

From Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectFrom Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectJoel Gehman
 
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Yue Liao
 
SMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementSMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementJoel Gehman
 

En vedette (20)

Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
 
From Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectFrom Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki Project
 
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
 
How One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online ChemistryHow One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online Chemistry
 
Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings
 
NSF Data Management Requirements 101
NSF Data Management Requirements 101NSF Data Management Requirements 101
NSF Data Management Requirements 101
 
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
The influence of data curation on QSAR Modeling – examining issues of qualit...
 The influence of data curation on QSAR Modeling – examining issues of qualit... The influence of data curation on QSAR Modeling – examining issues of qualit...
The influence of data curation on QSAR Modeling – examining issues of qualit...
 
SMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementSMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic Management
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
Environmental Chemistry Compound Identification Using High Resolution Mass Sp...
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
 
Social Media Tools for Scientists and Building an Online Profile
Social Media Tools for Scientists and Building an Online ProfileSocial Media Tools for Scientists and Building an Online Profile
Social Media Tools for Scientists and Building an Online Profile
 

Similaire à 2016 davis-plantbio

Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsmikaelhuss
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceCarole Goble
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenomec.titus.brown
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
2013 10-30-sbc361-reproducible designsandsustainablesoftware
2013 10-30-sbc361-reproducible designsandsustainablesoftware2013 10-30-sbc361-reproducible designsandsustainablesoftware
2013 10-30-sbc361-reproducible designsandsustainablesoftwareYannick Wurm
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation SequencingEdizonJambormias2
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...Artificial Intelligence Institute at UofSC
 

Similaire à 2016 davis-plantbio (20)

Cloud bioinformatics 2
Cloud bioinformatics 2Cloud bioinformatics 2
Cloud bioinformatics 2
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 naples
2014 naples2014 naples
2014 naples
 
2014 villefranche
2014 villefranche2014 villefranche
2014 villefranche
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
2013 10-30-sbc361-reproducible designsandsustainablesoftware
2013 10-30-sbc361-reproducible designsandsustainablesoftware2013 10-30-sbc361-reproducible designsandsustainablesoftware
2013 10-30-sbc361-reproducible designsandsustainablesoftware
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation Sequencing
 
A biologist in e-Science
A biologist in e-ScienceA biologist in e-Science
A biologist in e-Science
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
 

Plus de c.titus.brown

Plus de c.titus.brown (17)

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
2014 bosc-keynote
2014 bosc-keynote2014 bosc-keynote
2014 bosc-keynote
 

Dernier

STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 

Dernier (20)

STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 

2016 davis-plantbio

  • 1. A data intensive future: How can biology take full advantage of the coming data deluge? C.Titus Brown School ofVeterinary Medicine; Genome Center & Data Science Initiative 1/22/16 #titusplantz Slides are on slideshare.net/c.titus.brown/
  • 2. (My one plant collaboration) Helping Shamoni Maheshwari (Comai Lab) w/analysis of ChIP-seq data from Arabidopsis.
  • 3. Outline 0. Background: what’s coming? 1. Research: what do we do with infinite data? 2. Development: software and infrastructure. 3. Open science & reproducibility. 4. Training
  • 4. 0. Background What is going to be happening in the next 5 years with biological data generation?
  • 5. DNA sequencing rates continues to grow. Stephens et al., 2015 - 10.1371/journal.pbio.1002195
  • 6. (2015 was a good year)
  • 13. “Fighting EbolaWith a Palm- Sized DNA Sequencer” See: http://www.theatlantic.com/science/archive/2015/09/ebola- sequencer-dna-minion/405466/
  • 14. “DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab. Via Elizabeth Kujawinski Another challenge beyond volume and velocity – variety.
  • 15. Data integration. Figure 2. Summary of challenges associated with the data integration in the proposed project. Figure via E. Kujawinski
  • 16. CRISPR The challenge with genome editing is fast becoming what to edit rather than how to do.
  • 17. A point for reflection… Increasingly, the best guide to the next 10 years of biology is science fiction ...
  • 18. 1. Research Working with ~infinite amounts of data, and doing something effective with it.
  • 19. Shotgun sequencing and coverage “Coverage” is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
  • 20. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (30-300 Gbp for human)
  • 27. Computational problems now scale with information content rather than data set size. Most samples can be reconstructed via de novo assembly on commodity computers.
  • 28. A local collaboration: The horse genome & transcriptome Tamer Mansour w/Bellone, Finno, Penedo, & Murray labs.
  • 29. Input data Tissue Library length #samples #frag(M) #bp(Gb) BrainStem PE fr.firststrand 101 8 166.73 33.68 Cerebellum PE fr.firststrand 100 24 411.48 82.3 Muscle PE fr.firststrand 126 12 301.94 76.08 Retina PE fr.unstranded 81 2 20.3 3.28 SpinalCord PE fr.firststrand 101 16 403 81.4 Skin PE fr.unstranded 81 2 18.54 3 SE fr.unstranded 81 2 16.57 1.34 SE fr.unstranded 95 3 105.51 10.02 Embryo ICM PE fr.unstranded 100 3 126.32 25.26 SE fr.unstranded 100 3 115.21 11.52 Embryo TE PE fr.unstranded 100 3 129.84 25.96 SE fr.unstranded 100 3 102.26 10.23 Total 81 1917.7 364.07
  • 30. equCabs current status - NCBI Annotation Feature Acc Annotation GFF Refseq DB Total no of genes 25565 ptn coding genes 19686 Coding RNA NM BestRefSeq 764 1097 Coding RNA XM Gnomon 31578 31346 Non coding RNA NR BestRefSeq 348 726 Non coding RNA XR Gnomon 3311 3310 Total 36001 36479 32342 coding transcripts encoded by 19686 genes (average 1.6 transcript per gene) There are 3034 pseudo genes (with no annotated transcripts) Status count reviewed 4 Validated 267 Provisional 540 Predicted 7 inferred 279 Tamer Mansour
  • 31. Library prep Read trimming Mapping to ref Merge rep. Trans Ass. Merge byTiss. Predict ORF VariantAna Update dbvar Haplotype ass Pool/diginorm Predict ncRNA Filter & Compare Ass. filter knowns Compare to public ann. Merge All Ass. Mapping to ref Trans Ass. Tamer Mansour
  • 32. Digital normalization & (e.g.) horse transcriptome The computational demands for cufflinks - Read binning (processing time) - Construction of gene models (no of genes, no of splicing junctions, no of reads per locus, sequencing errors, complexity of the locus like gene overlap and multiple isoforms (processing time & Memory utilization) Diginorm - Significant reduction of binning time - Relative increase of the resources required for gene model construction with merging more samples and tissues - ? false recombinant isoforms Tamer Mansour
  • 33. Effect of digital normalization ** Should be very valuable for detection of ncRNA Tamer Mansour
  • 34. The ORF problem Hestand et al 2014: “we identified 301,829 positions with SNPs or small indels within these transcripts relative to EquCab2. Interestingly, 780 variants extend the open reading frame of the transcript and appear to be small errors in the equine reference genome” Tamer Mansour
  • 35. We merged the assemblies into six tissue-specific transcription profiles for cerebellum, brainstem, spinal cord, retina, muscle and skin.The final merger of all assemblies overlaps with 63% and 73% of NCBI and Ensembl loci, respectively, capturing about 72% and 81% of their coding bases. Comparing our assembly to the most recent transcriptome annotation shows ~85% overlapping loci. In addition, at least 40% of our annotated loci represent novel transcripts. Tamer Mansour
  • 36. 2. Software and infrastructure Alas, practical data analysis depends on software and computers, which leads to depressingly practical considerations for gentleperson scientists.
  • 37. Software It’s all well and good to develop new data analysis approaches, but their utility is greater when they are implemented in usable software. Writing, maintaining, and progressing research software is hard.
  • 38. The khmer software package • Demo implementation of research data structures & algorithms; • 10.5k lines of C++ code, 13.7k lines of Python code; • khmer v2.0 has 87% statement coverage under test; • ~3-4 developers, 50+ contributors, ~1000s of users (?) • Developed as a “true” open source project. The khmer software package, Crusoe et al., 2015. http://f1000research.com/articles/4-900/v1
  • 39. Challenges: Research vs stability! Stable software for users, & platform for future research; vs research “culture” (funding and careers)
  • 40. Infrastructure issues Suppose that we have a nice ecosystem of bioinformatics & data analysis tools. Where and how do we run them? Consider: 1. Biologists hate funding computational infrastructure. 2. Researchers are generally incompetent at building and maintaining usable infrastructure. 3. Centralized infrastructure fails in the face of infinite data.
  • 41. Decentralized infrastructure for bioinformatics? Compute server (Galaxy? Arvados?) Web interface + API Data/ Info Raw data sets Public servers "Walled garden" server Private server Graph query layer Upload/submit (NCBI, KBase) Import (MG-RAST, SRA, EBI) ivory.idyll.org/blog/2014-moore-ddd-award.html
  • 42. 3. Open science and reproducibility In my experience, most researchers* cannot replicate their own computational analyses, much less reproduce those published by anyone else. *This doesn’t apply to anyone in this audience; you’re all outliers!
  • 43. IPython Notebook: data + code => IPython)Notebook)
  • 44. To reproduce our papers: git clone <khmer> && python setup.py install git clone <pipeline> cd pipeline wget <data> && tar xzf <data> make && cd ../notebook && make cd ../ && make
  • 45. This is standard process in lab -- Our papers now have: • Source hosted on github; • Data hosted there or onAWS; • Long running data analysis => ‘make’ • Graphing and data digestion => IPython Notebook (also in github) Zhang et al. doi: 10.1371/journal.pone.0101271
  • 46. Literate graphing & interactive exploration Camille Scott
  • 47. Why do we do this? • We work faster and more reliably. • We can build on our own (and others’) research. • Robust computational research, released early, gives us a competitive advantage.
  • 48. 4.Training Methods and tools do little without a trained hand wielding them, and a trained eye examining the results.
  • 49. Perspectives on training • Prediction: The single biggest challenge facing biology over the next 20 years is the lack of data analysis training (see: NIH DIWG report) • Data analysis is not turning the crank; it is an intellectual exercise on par with experimental design or paper writing. • Training is systematically undervalued in academia (!?)
  • 50. UC Davis and training My goal here is to support the coalescence and growth of a local community of practice around “data intensive biology”.
  • 51. Summer NGS workshop (2010-2017)
  • 52. General parameters: • Regular intensive workshops, half-day or longer. • Aimed at research practitioners (grad students & more senior); open to all (including outside community). • Novice (“zero entry”) on up. • Low cost for students. • Leveraging global training initiatives:
  • 53. Thus far & near future ~12 workshops on bioinformatics in 2015. Trying out Q1 & Q2 2016: • Half-day intro workshops (27 planned); • Week-long advanced workshops; • Co-working hours (“data therapy”). dib-training.readthedocs.org/
  • 54. The End. • If you think 5-10 years out, we face significant practical issues for data analysis in biology. • We need new algorithms/data structures, AND good implementations, AND better computational practice,AND training. (It’s a pretty good time to be doing biology.)
  • 55. Thanks for listening! Please contact me at ctbrown@ucdavis.edu!

Notes de l'éditeur

  1. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
  2. ## Cell lineages of mammalian pre-implantation development: trophectoderm (TE), epiblast (EPI) & primitive endoderm (PE) ## The first cell fate decision segregates the TE from the inner cell mass (ICM). ## Prior to implantation, ICM gives rise to PE, which is a monolayer separating the blastocoel from the cluster of pluripotent EPI cells. ## The EPI forms the future fetus, the TE develops into the fetal placenta, and the PE becomes the visceral and parietal endoderm of the yolk sacs. ## The ICM/EPI is the source of pluripotent embryonic stem cells (ESCs)
  3. Library prep: remove barcodes, check for the data quality, adjust the encoding of quality scores, unify the samples name Adaptor removal & error trimming: mild trimmomatic Mapping: TopHat - using the UCSC reference genome – reference based using refSeq annotation Add @RG (To allow variant calling) & Merge technical duplicates Transcript assembly: per sample - Cufflinks a) reference independent: The main assembly to discover new genes and or new models of existing genes b) refseq guided: to assess our ability to enhance this set of genes Merging Assemblies: Cuffmerge Pool then diginorm  filter known by back mapping to final total transcriptome  map then reference (our trans) guided transcriptome
  4. Analyze data in cloud; import and export important; connect to other databases.
  5. Lure them in with bioinformatics and then show them that Michigan, in the summertime, is qite nice!