SlideShare une entreprise Scribd logo
1  sur  39
Denis C. Bauer | Bioinformatics | @allPowerde
08 July 2014
CSIRO COMPUTATIONAL INFORMATICS
Population-scale high-throughput sequencing
data analysis
ByMelody
Talk Overview
2 |
• Background: CSIRO/Omics Project
• Methods: NGS Data Processing on HPC/Cloud
• Research Outcome: Cancer and Microbes in Colorectal Cancer
Denis Bauer | @allPowerde
62% of our people hold
university degrees
2000 doctorates
500 masters
With our university
partners, we develop
650 postgraduate
research students
Top 1% of global research
institutions in 14 of 22 research
fields
Top 0.1% in 4 research fields
Darwin
Alice Springs
Geraldton
2 sites
Atherton
Townsville
2 sites
Rockhampton
Toowoomba
Gatton
Myall Vale
Narrabri
Mopra
Parkes
Griffith
Belmont
Geelong
Hobart
Sandy Bay
Wodonga
Newcastle
Armidale
2 sites
Perth
3 sites
Adelaide
2 sites Sydney 5 sites
Canberra 7 sites
Murchison
Cairns
Irymple
Melbourne 5 sites
CSIRO: Who we are
Werribee 2 sites
Brisbane
6 sites
Bribie
Island
People
Divisions
Locations
Flagships
Budget
6500
13
58
11
$1B+
The Commonwealth Scientific and Industrial Research Organisation
Denis Bauer | @allPowerde3 |
Our business units
12Research Divisions11National Research Flagships
+National Research Facilities
and Collections
FOOD, HEALTH
& LIFE SCIENCE
INDUSTRIES
ENVIRONMENT MANUFACTURING,
MATERIALS &
MINERALS
ENERGY INFORMATION &
COMMUNICATIONS
+Transformational
Capability Platforms
Denis Bauer | @allPowerde4 |
Our track record: top inventions
4. EXTENDED
WEAR CONTACTS
2. POLYMER
BANKNOTES
3. RELENZA
FLU VACCINE
1. Fast WLAN
Wireless Local
Area Network
5. AEROGARD 6. TOTAL
WELLBEING DIET
7. RAFT
POLYMERISATION
8. BARLEYMAX 9. SELF TWISTING
YARN
10. SOFTLY
WASHING LIQUID
Denis Bauer | @allPowerde5 |
Part 1: The ‘omics project
The goalof the project is to investigate the
susceptibility to colorectal cancer in the
context of obesity and the gut
microbiome
Denis Bauer | @allPowerde6 |
Data from Pilot Study
Full Cohort: 500 (178 to date) individuals from colorectal resection at the John Hunter Hospital, Newcastle Private
Hospital and Royal Newcastle Centre (surgeons Dr Brian Draganic, Dr Peter Pockney & Dr Steve Smith)
organized by Dr Desma Grice and Prof Rodney Scott (University of Newcastle)
Denis Bauer | @allPowerde7 |
• Objective: capture genomic variances reliably in tumour normal
and adipose.
• Sequence effort:
• 12 tumour -> 6 lanes (2-plex)
• 12 normal -> 3 lanes (4-plex)
• 12 adipose -> 3 lanes (4-plex)
Considerations before sequencing: Undersampling
More depth needed due to
potentially low cellularity in
the tumour sample
additional
depth
tumour sample
normal sample
Denis Bauer | @allPowerde8 |
• Objective: process samples avoiding confounding factors
Considerations before sequencing: Flowcell design
L1
L2
L2
L2
O1
O1
O1
O2
O2
O2
Sequenced
over 3 lanes
L1
L1
Normal
Adipose
Tumour
4-plex
4-plex
4-plex
L2
O2
L1
O1
L2
O2
L1
O1
Sequence on
one lane each
L2
O2
L1
O1
Subject every
sample to the same
lane and flowcell
effects by
multiplexing
(labelling every
sample with a
identifying barcode)
Denis Bauer | @allPowerde9 |
• Population-scale sequencing with more samples than illumina-barcodes: imbalanced
flowcell design will split samples and pair the halves with different partners (e.g.
LeanSubj1.1 + Obese Subject 1.1; LeanSubj1.2 + Obese Subject 3.2 )
Considerations for Omics Proj.: Flowcell design
L1.1
L1.1
O1.1
O1.1
O1.1
L1.1
Normal
Adipose
Tumour
L2.1
L2.1
L2.1
O2.1
O2.1
O2.1
L3.1
L3.1
L3.1
O3.1
O3.1
O3.1
L4.1
L4.1
L4.1
O4.1
O4.1
O4.1
Lane1
Lane2 Lane3 Lane4
L1.2
L1.2
O3.2
O3.2
O3.2
L2.2
L2.2
L2.2
O4.2
O4.2
O4.2
L3.2
L3.2
L3.2
O1.2
O1.2
O1.2
L4.2
L4.2
L4.2
O2.2
O2.2
O2.2
Lane5 Lane6 Lane7 Lane8
L1.2
4-plex
4-plex
2-plex
L=Lean
O=Obese
L1.1=Lean individual 1
part 1 (of 2) ...
12 Lanes
Auer PL, Doerge RW. Statistical design and analysis of RNA sequencing data. Genetics. 2010 PMID: 20439781
Denis Bauer | @allPowerde10 |
Blue Monster says
Design your experiment with project-
specific pitfalls in mind
Auer PL et al. Statistical design and analysis of RNA sequencing data.
Genetics. 2010 PMID: 20439781
Denis Bauer | @allPowerde11 |
Part 2: NGS Data Processing
Minimize project set-up overhead
while providing easily adaptable processing modules
for NGS analysis on high-performance-
compute clusters/cloud
Denis Bauer | @allPowerde12 |
Resource consumption for Variant Calling
qsub –t 1-36 task.qsub
Script
Submission
Scheduler
0
50
100
100
DNAseq
average
task
mapping
recalibration
transcripts
annotation
variant
Resource consumption
36 samples (2.7T data) on average requires
128 hours CPU time (ste= 15)
77 GB RAM (ste=0.34)
CPU
(hours)
Real time
(hours)
Memory
(GB)
0
50
100
0
50
100
DNAseqRNAseq
cpu
cpu_real
memory
type
average task
mapping
recalibration
transcripts
annotation
variant
Resource consumption
#PBS –l nodes=2:ppn=8
High-Performance-Compute
Denis Bauer | @allPowerde13 |
doi:10.1038/nbt.2421
Tailored processing for different sequencing applications
Wet-lab Protocols Production Informatics
Variant
Calling
Methylation
Sites
Gene
Expression
Despite different approaches
we want to use the same
processing framework!
Denis Bauer | @allPowerde14 |
reusability
cutting edgedata security
HPC environment
reproducibility
robustness
adaptability
knowledge transfer
(publication)
efficient
Wish list for a framework
Denis Bauer | @allPowerde15 |
Denis Bauer | @allPowerde16 |
Denis Bauer | @allPowerde17 |
Assess
experimental
success
quickly
Denis Bauer | @allPowerde18 |
DEMO - files
Project X fastq
Exp1
Run1_read1.fastq
Run2_read1.fastq
Exp2 Run3_read1.fastq
We can start from raw fastq files: here
3 files (Run1-3) in 2 different
conditions (Exp1-2)
Denis Bauer | @allPowerde19 |
DEMO – setting up config file
#********************
# Data
#********************
declare -a DIR; DIR=( Exp1 Exp2 )
#********************
# Tasks
#********************
RUNMAPPINGBOWTIE2="1" # mapping with bowtie2
#********************
# Paths
#********************
# reference genome
FASTA=/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa
20 | Denis Bauer, @allPowerde
We specify the folders NGSANE
should run on and what to do (here:
bowtie2 mapping). We can also
specify project specific settings (here:
use igenomes)
DEMO – dry run
bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt
[NGSANE] Trigger mode: [empty] (dry run)
[NOTE] Folders: Exp1 Exp2
[Task] bowtie2
[NOTE] setup enviroment
[TODO] Exp1/Run1_read1.fastq
[TODO] Exp1/Run2_read1.fastq
[TODO] Exp2/Run3_read1.fastq
[NOTE] proceeding with job scheduling...
[NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
[NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
[NOTE] make Exp1/bowtie2/Run3.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run3_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
We run NGSANE in dry run to test
what jobs it would submit
Denis Bauer | @allPowerde21 |
DEMO – submit
bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt armed
[NGSANE] Trigger mode: armed
Double check! Then type safetyoff and hit enter to launch the job: safetyoff
... take cover!
[NOTE] Folders: Exp1 Exp2
[Task] bowtie2
[NOTE] setup environment
[TODO] Exp1/Run1_read1.fastq
[TODO] Exp1/Run2_read1.fastq
[TODO] Exp2/Run3_read1.fastq
[NOTE] proceeding with job scheduling...
[NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
Jobnumber 2424899
[NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
Jobnumber 2424900
[NOTE] make Exp2/bowtie2/Run3.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp2/Run3_read1.fastq -o /NGSANEDEMO/Exp2/bowtie2 --rgsi Exp2
Jobnumber 2424901
We submit HPC jobs. Checkout the
returned qsub identifiers.
Denis Bauer | @allPowerde22 |
DEMO – scheduler
bau04c@burnet-login:/NGSANEDEMO> qstat -u bau04c
burnet-srv.idpx.hpsc.csiro.au:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----
2424899.burnet-s bau04c normal NGs_bowtie2_RunM 9085 1 2 -- 00:05 R 00:00
2424900.burnet-s bau04c normal NGs_bowtie2_RunM 9178 1 2 -- 00:05 R 00:00
2424901.burnet-s bau04c normal NGs_bowtie2_RunM 9353 1 2 -- 00:05 R 00:00
Three HPC jobs run in parallele because there
were three fastq files. But there is no limit to the
number of files to process in parallele: easy scale-
up to populations.
Denis Bauer | @allPowerde23 |
DEMO – report
bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt html
[NGSANE] Trigger mode: html
>>>>> Generate HTML report
>>>>> startdate Fri Jan 24 08:02:37 EST 2014
>>>>> hostname burnet-login
>>>>> makeSummary.sh -k /NGSANEDEMO/config.txt
--R --R version 3.0.0 (2013-04-03) -- "Masked Marvel”
--Python--Python 2.7.2
QC - bowtie2
>>>>> Generate HTML report - FINISHED
>>>>> enddate Fri Jan 24 08:02:39 EST 2014
More report examples
Now create the HTML overview page,
to check if jobs finised sucessfully and
what the results are (bowtie2:
mapping statistics)
Denis Bauer | @allPowerde24 |
DEMO - files
Project X
Summary HTML
Exp1 Bowtie
Run1.bam
Run2.bam
Exp2 Bowtie Run3.bam
fastq
Exp1
Run1_read1.fastq
Run2_read1.fastq
Exp2 Run3_read1.fastq
The resulting file structure: every
experiment has a folder with the tasks
as subfolders and in them the results
(here: bam files)
Denis Bauer | @allPowerde25 |
NGSANE Currently supports
• Transfer data (smbclient)
• Quality Control
(GATK, FastQC, RNA-SeQC, custom summaries,
user code)
• Trimming
(Cutadapt,Trimgalore, Trimmomatic)
• Mapping
(BWA,Bowtie1,Bowtie2,Tophat)
• Transcript Quantification
(cufflinks, htseq, bedtools)
• Variant calling
(GATK, samtools)
• Variant annotation
(annovar)
• 3D Genome structure
(Hicup, fit-hi-c, Hiclib, Homer)
Denis Bauer | @allPowerde26 |
For details see https://github.com/BauerLab/ngsane/wiki/How-to-use-the-virtual-machine
Denis Bauer | @allPowerde27 |
Blue Monster says
Analyze your data to be reproducible
and well documented with tools that
scale well to larger datasets
Buske FA et al. NGSANE: a lightweight production
informatics framework for high-throughput data analysis.
Bioinformatics. 2014 PMID: 24470576
Denis Bauer | @allPowerde28 |
Part 3: Combining Omics Data
Seeing the full picture requires taking all
information into account
Denis Bauer | @allPowerde29 |
Result overview: traditional differential analysis
1e−02
1e+00
1e+02
1e−02 1e+00 1e+02
tumour FPKM + 0
normalFPKM+0
1. 722 genes differentially expressed (DE) between tumour and
normal
• QC: We have good concordance with genes known to be up/down regulated in CRC
2. 841 differentially methylated (DM) genomic regions -- mostly
hypermethylated
• QC: good concordance with previously reported gut methylation profile
0.1
10.0
0.1 10.0
tumour FPKM + 0
normalFPKM+0
Fernandez et al. Genome Res. 2012CSIRO inhouse
Known DE gene Known DM locations
Denis Bauer | @allPowerde30 |
Microbial Population:traditional population survey
Paul Greenfield
Denis Bauer | @allPowerde31 |
Data integration
(image credit: Francis Tabary)
Denis Bauer | @allPowerde32 |
DNA methylation: Blood signatures in Adipose and Gut samples
Tim Peters
Some gut/adipose
samples have blood-
like signatures.
Denis Bauer | @allPowerde33 |
Exonseq: blood-signatures stem from a blood-plasma protein
●●
●
●
●●
cor = 0.78
●●
●
●
●●
cor = 0.73
●
● ●
●
●
●
cor = 0
●●
●
●
●●
cor = 0.65
0.0e+00
5.0e−06
1.0e−05
1.5e−05
0e+00
2e−05
4e−05
6e−05
8e−05
0.0000
0.0005
0.0010
0.0015
0.0020
0e+00
1e−05
2e−05
ADM2COL6A3FNIP1HAAO
cor = −0.2
●●
● ●
●
●
cor = 0.57
●
●
●
●
●● cor = 0.16
−1e−04
0e+00
0e+00
1e−04
2e−04
0e+00
1e−04
2e−04
HGB2MALT1
0.00 0.25 0.50 0.75
total/reads
●
● ●
●
●
●
cor = 0
●●
●
●
●●
cor = 0.65
●
●
● ●
●
●
cor = −0.59
●
●
● ●
●
●
cor = −0.58
●
●
● ●
●
●
cor = −0.51
●
● ● ●
●●
cor = −0.2
●●
● ●
●
●
cor = 0.57
●
●
●
●
●● cor = 0.16
0.0000
0.0005
0.0010
0.0015
0.0020
0e+00
1e−05
2e−05
−0.0005
0.0000
0.0005
0.0010
0.0015
−0.001
0.000
0.001
−0.001
0.000
0.001
0.002
0.003
−1e−04
0e+00
1e−04
2e−04
0e+00
1e−04
2e−04
0e+00
1e−04
2e−04
FNIP1HAAOHBA1HBA2HBBHGB1HGB2MALT1
0.00 0.25 0.50 0.75
count/total
factor(samples)
●
●
●
●
●
●
●
●
●
●
●
●
2
4
7
12
14
19
20
40
50
57
59
62
factor(status)
● lean
obese
Contamination by ADM2, a gene expressed in blood plasma
Individuals
Contamination (%)
Contamination(%)
expression
Plasma protein ADM2 makes up most of
the human material in the digesta (number
of reads mapping to human genome)
Denis Bauer | @allPowerde34 |
Medical History: Blood potentially resulting from medication
CARTIA
14,50,57
WARFARIN
40
ASPIRIN
59,7
COPLAVIX
12
No anti-clotting drug 2, 62, 4
No medication 19,20
Wilcoxon rank sum test p-value = 0.02
Anti-thrombosis drugs
significantly enriched in
individuals with human
material in digesta.
Denis Bauer | @allPowerde35 |
Microbial data: Blood “liking” opportunistic bacteria are enriched
in contaminated samples
E. coli and Salmonella etc
Opportunistic pathogens.
Respond to inflammation
and bleeding
Bacterial marker for low level
chronic gut bleeding ?
Denis Bauer | @allPowerde36 |
Blue Monster says
Integrating different ‘omics data is
still a challenge.
Denis Bauer | @allPowerde37 |
Three things to remember
• Good experimental design is necessary
(even) in sequencing experiments
• Reproducible, documented data
analysis is key (e.g. NGSANE, a
lightweight flexible tool for large-scale
sequence data analysis on high-
performance systems and Amazon’s
elastic cloud)
• Promising research opportunities are in
the integration of multiple high-
throughput data sources
Denis Bauer | @allPowerde38 |
COMPUTATIONAL INFORMATICS
Thank youComputational Informatics
Denis C. Bauer
t +61 2 9123 4567
e Denis.Bauer@csiro.au
w www.csiro.au/bioinformatics
Buske et al.,
Bioinformatics,
Jan 2014
More talks online: Twitter:
http://www.slideshare.net/allPowerde @allPowerde
Fabian A. Buske
Susan Clark
Hugh French
Martin Smith
Garvan Institute of Medical
Research, Sydney, Australia
Robert Dunne
Tim Peters
Paul Greenfield
Piotr Szul
Tomasz Bednarz
Computational Informatics,
CSIRO, Australia
Garry Hannan
Animal Food and Health Scinece,
CSIRO, Australia
Rodney Scott
University of Newcastle, Australia
Funding:
National Health and Medical
Research Council;
National Breast Cancer
Foundation;
CSIRO's Transformational
Capability Platform;
CSIRO’s IM&T;
Science and Industry Endowment
Fund
http://www.genome-engineering.com.au/

Contenu connexe

Tendances

A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceRobert Grossman
 
Jan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carrollJan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carrollGenomeInABottle
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
 
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaDatabricks
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsLynn Langit
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data PipelinesLynn Langit
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFlávio Codeço Coelho
 
Tin-Lap Lee: GDSAP- A Galaxy-based platform for large-scale genomics analysis
Tin-Lap Lee: GDSAP- A Galaxy-based platform for large-scale genomics analysisTin-Lap Lee: GDSAP- A Galaxy-based platform for large-scale genomics analysis
Tin-Lap Lee: GDSAP- A Galaxy-based platform for large-scale genomics analysisGigaScience, BGI Hong Kong
 
Genome simulation and applications
Genome simulation and applicationsGenome simulation and applications
Genome simulation and applicationsHari Prasad
 
Using Supercomputers and Supernetworks to Explore the Ocean of Life
Using Supercomputers and Supernetworks to Explore the Ocean of LifeUsing Supercomputers and Supernetworks to Explore the Ocean of Life
Using Supercomputers and Supernetworks to Explore the Ocean of LifeLarry Smarr
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesIan Foster
 
Aug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsAug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsGenomeInABottle
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysiscursoNGS
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
 
New data from giab genomes pacbio ccs
New data from giab genomes   pacbio ccsNew data from giab genomes   pacbio ccs
New data from giab genomes pacbio ccsGenomeInABottle
 

Tendances (20)

A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
Jan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carrollJan2016 dnanexus giab uses andrew carroll
Jan2016 dnanexus giab uses andrew carroll
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomics
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data Pipelines
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
 
Tin-Lap Lee: GDSAP- A Galaxy-based platform for large-scale genomics analysis
Tin-Lap Lee: GDSAP- A Galaxy-based platform for large-scale genomics analysisTin-Lap Lee: GDSAP- A Galaxy-based platform for large-scale genomics analysis
Tin-Lap Lee: GDSAP- A Galaxy-based platform for large-scale genomics analysis
 
Genome simulation and applications
Genome simulation and applicationsGenome simulation and applications
Genome simulation and applications
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
Using Supercomputers and Supernetworks to Explore the Ocean of Life
Using Supercomputers and Supernetworks to Explore the Ocean of LifeUsing Supercomputers and Supernetworks to Explore the Ocean of Life
Using Supercomputers and Supernetworks to Explore the Ocean of Life
 
Opportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architecturesOpportunities for X-Ray science in future computing architectures
Opportunities for X-Ray science in future computing architectures
 
Genetic data storage
Genetic data storageGenetic data storage
Genetic data storage
 
Aug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsAug2015 analysis team spiral genetics
Aug2015 analysis team spiral genetics
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
DNA as Storage Medium
DNA as Storage MediumDNA as Storage Medium
DNA as Storage Medium
 
New data from giab genomes pacbio ccs
New data from giab genomes   pacbio ccsNew data from giab genomes   pacbio ccs
New data from giab genomes pacbio ccs
 

En vedette

Centralizing sequence analysis
Centralizing sequence analysisCentralizing sequence analysis
Centralizing sequence analysisDenis C. Bauer
 
STAR: Recombination site prediction
STAR: Recombination site predictionSTAR: Recombination site prediction
STAR: Recombination site predictionDenis C. Bauer
 
Qbi Centre for Brain genomics (Informatics side)
Qbi Centre for Brain genomics (Informatics side)Qbi Centre for Brain genomics (Informatics side)
Qbi Centre for Brain genomics (Informatics side)Denis C. Bauer
 
Differential gene expression
Differential gene expressionDifferential gene expression
Differential gene expressionDenis C. Bauer
 
Allelic Imbalance for Pre-capture Whole Exome Sequencing
Allelic Imbalance for Pre-capture Whole Exome SequencingAllelic Imbalance for Pre-capture Whole Exome Sequencing
Allelic Imbalance for Pre-capture Whole Exome SequencingDenis C. Bauer
 
Cell differentiation and differential gene expression
Cell differentiation and differential gene expressionCell differentiation and differential gene expression
Cell differentiation and differential gene expressionStephanie Beck
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysismikaelhuss
 

En vedette (9)

Centralizing sequence analysis
Centralizing sequence analysisCentralizing sequence analysis
Centralizing sequence analysis
 
STAR: Recombination site prediction
STAR: Recombination site predictionSTAR: Recombination site prediction
STAR: Recombination site prediction
 
Trip Report Seattle
Trip Report SeattleTrip Report Seattle
Trip Report Seattle
 
Qbi Centre for Brain genomics (Informatics side)
Qbi Centre for Brain genomics (Informatics side)Qbi Centre for Brain genomics (Informatics side)
Qbi Centre for Brain genomics (Informatics side)
 
Differential gene expression
Differential gene expressionDifferential gene expression
Differential gene expression
 
Allelic Imbalance for Pre-capture Whole Exome Sequencing
Allelic Imbalance for Pre-capture Whole Exome SequencingAllelic Imbalance for Pre-capture Whole Exome Sequencing
Allelic Imbalance for Pre-capture Whole Exome Sequencing
 
Cell differentiation and differential gene expression
Cell differentiation and differential gene expressionCell differentiation and differential gene expression
Cell differentiation and differential gene expression
 
Part I : Introduction to Protein Structure
Part I : Introduction to Protein StructurePart I : Introduction to Protein Structure
Part I : Introduction to Protein Structure
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 

Similaire à Population-scale high-throughput sequencing data analysis

How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science researchDenis C. Bauer
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyJuan Antonio Vizcaino
 
Runtime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM SimulationRuntime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM SimulationFisnik Kraja
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
Blue Waters and Resource Management - Now and in the Future
 Blue Waters and Resource Management - Now and in the Future Blue Waters and Resource Management - Now and in the Future
Blue Waters and Resource Management - Now and in the Futureinside-BigData.com
 
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...Amazon Web Services
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitGreg Landrum
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough NutsMVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough Nutsinside-BigData.com
 
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...Rothamsted Research, UK
 
IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...
IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...
IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...Lixi Conrads
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analyticsinside-BigData.com
 
Big data for SAS programmers
Big data for SAS programmersBig data for SAS programmers
Big data for SAS programmersKevin Lee
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos EngineeringSIGHUP
 
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...ChemAxon
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIGeorge Markomanolis
 

Similaire à Population-scale high-throughput sequencing data analysis (20)

How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easy
 
Runtime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM SimulationRuntime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM Simulation
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
Blue Waters and Resource Management - Now and in the Future
 Blue Waters and Resource Management - Now and in the Future Blue Waters and Resource Management - Now and in the Future
Blue Waters and Resource Management - Now and in the Future
 
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
Customer Case Study: How Novel Compute Technology Transforms Medical and Life...
 
presentation
presentationpresentation
presentation
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough NutsMVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
 
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
 
IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...
IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...
IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analytics
 
Big data for SAS programmers
Big data for SAS programmersBig data for SAS programmers
Big data for SAS programmers
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos Engineering
 
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 

Plus de Denis C. Bauer

Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research Denis C. Bauer
 
Translating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynoteTranslating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynoteDenis C. Bauer
 
Going Server-less for Web-Services that need to Crunch Large Volumes of Data
Going Server-less for Web-Services that need to Crunch Large Volumes of DataGoing Server-less for Web-Services that need to Crunch Large Volumes of Data
Going Server-less for Web-Services that need to Crunch Large Volumes of DataDenis C. Bauer
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseqDenis C. Bauer
 
Functionally annotate genomic variants
Functionally annotate genomic variantsFunctionally annotate genomic variants
Functionally annotate genomic variantsDenis C. Bauer
 
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Denis C. Bauer
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Denis C. Bauer
 
Introduction to second generation sequencing
Introduction to second generation sequencingIntroduction to second generation sequencing
Introduction to second generation sequencingDenis C. Bauer
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsDenis C. Bauer
 
The missing data issue for HiSeq runs
The missing data issue for HiSeq runsThe missing data issue for HiSeq runs
The missing data issue for HiSeq runsDenis C. Bauer
 
Deciphering the regulatory code in the genome
Deciphering the regulatory code in the genomeDeciphering the regulatory code in the genome
Deciphering the regulatory code in the genomeDenis C. Bauer
 
SUMOylation site prediction
SUMOylation site predictionSUMOylation site prediction
SUMOylation site predictionDenis C. Bauer
 

Plus de Denis C. Bauer (13)

Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research Cloud-native machine learning - Transforming bioinformatics research
Cloud-native machine learning - Transforming bioinformatics research
 
Translating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynoteTranslating genomics into clinical practice - 2018 AWS summit keynote
Translating genomics into clinical practice - 2018 AWS summit keynote
 
Going Server-less for Web-Services that need to Crunch Large Volumes of Data
Going Server-less for Web-Services that need to Crunch Large Volumes of DataGoing Server-less for Web-Services that need to Crunch Large Volumes of Data
Going Server-less for Web-Services that need to Crunch Large Volumes of Data
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 
Functionally annotate genomic variants
Functionally annotate genomic variantsFunctionally annotate genomic variants
Functionally annotate genomic variants
 
Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2Variant (SNPs/Indels) calling in DNA sequences, Part 2
Variant (SNPs/Indels) calling in DNA sequences, Part 2
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1
 
Introduction to second generation sequencing
Introduction to second generation sequencingIntroduction to second generation sequencing
Introduction to second generation sequencing
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
The missing data issue for HiSeq runs
The missing data issue for HiSeq runsThe missing data issue for HiSeq runs
The missing data issue for HiSeq runs
 
Deciphering the regulatory code in the genome
Deciphering the regulatory code in the genomeDeciphering the regulatory code in the genome
Deciphering the regulatory code in the genome
 
ReliF
ReliFReliF
ReliF
 
SUMOylation site prediction
SUMOylation site predictionSUMOylation site prediction
SUMOylation site prediction
 

Population-scale high-throughput sequencing data analysis

  • 1. Denis C. Bauer | Bioinformatics | @allPowerde 08 July 2014 CSIRO COMPUTATIONAL INFORMATICS Population-scale high-throughput sequencing data analysis ByMelody
  • 2. Talk Overview 2 | • Background: CSIRO/Omics Project • Methods: NGS Data Processing on HPC/Cloud • Research Outcome: Cancer and Microbes in Colorectal Cancer Denis Bauer | @allPowerde
  • 3. 62% of our people hold university degrees 2000 doctorates 500 masters With our university partners, we develop 650 postgraduate research students Top 1% of global research institutions in 14 of 22 research fields Top 0.1% in 4 research fields Darwin Alice Springs Geraldton 2 sites Atherton Townsville 2 sites Rockhampton Toowoomba Gatton Myall Vale Narrabri Mopra Parkes Griffith Belmont Geelong Hobart Sandy Bay Wodonga Newcastle Armidale 2 sites Perth 3 sites Adelaide 2 sites Sydney 5 sites Canberra 7 sites Murchison Cairns Irymple Melbourne 5 sites CSIRO: Who we are Werribee 2 sites Brisbane 6 sites Bribie Island People Divisions Locations Flagships Budget 6500 13 58 11 $1B+ The Commonwealth Scientific and Industrial Research Organisation Denis Bauer | @allPowerde3 |
  • 4. Our business units 12Research Divisions11National Research Flagships +National Research Facilities and Collections FOOD, HEALTH & LIFE SCIENCE INDUSTRIES ENVIRONMENT MANUFACTURING, MATERIALS & MINERALS ENERGY INFORMATION & COMMUNICATIONS +Transformational Capability Platforms Denis Bauer | @allPowerde4 |
  • 5. Our track record: top inventions 4. EXTENDED WEAR CONTACTS 2. POLYMER BANKNOTES 3. RELENZA FLU VACCINE 1. Fast WLAN Wireless Local Area Network 5. AEROGARD 6. TOTAL WELLBEING DIET 7. RAFT POLYMERISATION 8. BARLEYMAX 9. SELF TWISTING YARN 10. SOFTLY WASHING LIQUID Denis Bauer | @allPowerde5 |
  • 6. Part 1: The ‘omics project The goalof the project is to investigate the susceptibility to colorectal cancer in the context of obesity and the gut microbiome Denis Bauer | @allPowerde6 |
  • 7. Data from Pilot Study Full Cohort: 500 (178 to date) individuals from colorectal resection at the John Hunter Hospital, Newcastle Private Hospital and Royal Newcastle Centre (surgeons Dr Brian Draganic, Dr Peter Pockney & Dr Steve Smith) organized by Dr Desma Grice and Prof Rodney Scott (University of Newcastle) Denis Bauer | @allPowerde7 |
  • 8. • Objective: capture genomic variances reliably in tumour normal and adipose. • Sequence effort: • 12 tumour -> 6 lanes (2-plex) • 12 normal -> 3 lanes (4-plex) • 12 adipose -> 3 lanes (4-plex) Considerations before sequencing: Undersampling More depth needed due to potentially low cellularity in the tumour sample additional depth tumour sample normal sample Denis Bauer | @allPowerde8 |
  • 9. • Objective: process samples avoiding confounding factors Considerations before sequencing: Flowcell design L1 L2 L2 L2 O1 O1 O1 O2 O2 O2 Sequenced over 3 lanes L1 L1 Normal Adipose Tumour 4-plex 4-plex 4-plex L2 O2 L1 O1 L2 O2 L1 O1 Sequence on one lane each L2 O2 L1 O1 Subject every sample to the same lane and flowcell effects by multiplexing (labelling every sample with a identifying barcode) Denis Bauer | @allPowerde9 |
  • 10. • Population-scale sequencing with more samples than illumina-barcodes: imbalanced flowcell design will split samples and pair the halves with different partners (e.g. LeanSubj1.1 + Obese Subject 1.1; LeanSubj1.2 + Obese Subject 3.2 ) Considerations for Omics Proj.: Flowcell design L1.1 L1.1 O1.1 O1.1 O1.1 L1.1 Normal Adipose Tumour L2.1 L2.1 L2.1 O2.1 O2.1 O2.1 L3.1 L3.1 L3.1 O3.1 O3.1 O3.1 L4.1 L4.1 L4.1 O4.1 O4.1 O4.1 Lane1 Lane2 Lane3 Lane4 L1.2 L1.2 O3.2 O3.2 O3.2 L2.2 L2.2 L2.2 O4.2 O4.2 O4.2 L3.2 L3.2 L3.2 O1.2 O1.2 O1.2 L4.2 L4.2 L4.2 O2.2 O2.2 O2.2 Lane5 Lane6 Lane7 Lane8 L1.2 4-plex 4-plex 2-plex L=Lean O=Obese L1.1=Lean individual 1 part 1 (of 2) ... 12 Lanes Auer PL, Doerge RW. Statistical design and analysis of RNA sequencing data. Genetics. 2010 PMID: 20439781 Denis Bauer | @allPowerde10 |
  • 11. Blue Monster says Design your experiment with project- specific pitfalls in mind Auer PL et al. Statistical design and analysis of RNA sequencing data. Genetics. 2010 PMID: 20439781 Denis Bauer | @allPowerde11 |
  • 12. Part 2: NGS Data Processing Minimize project set-up overhead while providing easily adaptable processing modules for NGS analysis on high-performance- compute clusters/cloud Denis Bauer | @allPowerde12 |
  • 13. Resource consumption for Variant Calling qsub –t 1-36 task.qsub Script Submission Scheduler 0 50 100 100 DNAseq average task mapping recalibration transcripts annotation variant Resource consumption 36 samples (2.7T data) on average requires 128 hours CPU time (ste= 15) 77 GB RAM (ste=0.34) CPU (hours) Real time (hours) Memory (GB) 0 50 100 0 50 100 DNAseqRNAseq cpu cpu_real memory type average task mapping recalibration transcripts annotation variant Resource consumption #PBS –l nodes=2:ppn=8 High-Performance-Compute Denis Bauer | @allPowerde13 |
  • 14. doi:10.1038/nbt.2421 Tailored processing for different sequencing applications Wet-lab Protocols Production Informatics Variant Calling Methylation Sites Gene Expression Despite different approaches we want to use the same processing framework! Denis Bauer | @allPowerde14 |
  • 15. reusability cutting edgedata security HPC environment reproducibility robustness adaptability knowledge transfer (publication) efficient Wish list for a framework Denis Bauer | @allPowerde15 |
  • 16. Denis Bauer | @allPowerde16 |
  • 17. Denis Bauer | @allPowerde17 |
  • 19. DEMO - files Project X fastq Exp1 Run1_read1.fastq Run2_read1.fastq Exp2 Run3_read1.fastq We can start from raw fastq files: here 3 files (Run1-3) in 2 different conditions (Exp1-2) Denis Bauer | @allPowerde19 |
  • 20. DEMO – setting up config file #******************** # Data #******************** declare -a DIR; DIR=( Exp1 Exp2 ) #******************** # Tasks #******************** RUNMAPPINGBOWTIE2="1" # mapping with bowtie2 #******************** # Paths #******************** # reference genome FASTA=/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa 20 | Denis Bauer, @allPowerde We specify the folders NGSANE should run on and what to do (here: bowtie2 mapping). We can also specify project specific settings (here: use igenomes)
  • 21. DEMO – dry run bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt [NGSANE] Trigger mode: [empty] (dry run) [NOTE] Folders: Exp1 Exp2 [Task] bowtie2 [NOTE] setup enviroment [TODO] Exp1/Run1_read1.fastq [TODO] Exp1/Run2_read1.fastq [TODO] Exp2/Run3_read1.fastq [NOTE] proceeding with job scheduling... [NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy [ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1 [NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy [ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1 [NOTE] make Exp1/bowtie2/Run3.asd.bam.dummy [ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run3_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1 We run NGSANE in dry run to test what jobs it would submit Denis Bauer | @allPowerde21 |
  • 22. DEMO – submit bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt armed [NGSANE] Trigger mode: armed Double check! Then type safetyoff and hit enter to launch the job: safetyoff ... take cover! [NOTE] Folders: Exp1 Exp2 [Task] bowtie2 [NOTE] setup environment [TODO] Exp1/Run1_read1.fastq [TODO] Exp1/Run2_read1.fastq [TODO] Exp2/Run3_read1.fastq [NOTE] proceeding with job scheduling... [NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy [ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1 Jobnumber 2424899 [NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy [ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1 Jobnumber 2424900 [NOTE] make Exp2/bowtie2/Run3.asd.bam.dummy [ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f /NGSANEDEMO/fastq/Exp2/Run3_read1.fastq -o /NGSANEDEMO/Exp2/bowtie2 --rgsi Exp2 Jobnumber 2424901 We submit HPC jobs. Checkout the returned qsub identifiers. Denis Bauer | @allPowerde22 |
  • 23. DEMO – scheduler bau04c@burnet-login:/NGSANEDEMO> qstat -u bau04c burnet-srv.idpx.hpsc.csiro.au: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time -------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - ----- 2424899.burnet-s bau04c normal NGs_bowtie2_RunM 9085 1 2 -- 00:05 R 00:00 2424900.burnet-s bau04c normal NGs_bowtie2_RunM 9178 1 2 -- 00:05 R 00:00 2424901.burnet-s bau04c normal NGs_bowtie2_RunM 9353 1 2 -- 00:05 R 00:00 Three HPC jobs run in parallele because there were three fastq files. But there is no limit to the number of files to process in parallele: easy scale- up to populations. Denis Bauer | @allPowerde23 |
  • 24. DEMO – report bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt html [NGSANE] Trigger mode: html >>>>> Generate HTML report >>>>> startdate Fri Jan 24 08:02:37 EST 2014 >>>>> hostname burnet-login >>>>> makeSummary.sh -k /NGSANEDEMO/config.txt --R --R version 3.0.0 (2013-04-03) -- "Masked Marvel” --Python--Python 2.7.2 QC - bowtie2 >>>>> Generate HTML report - FINISHED >>>>> enddate Fri Jan 24 08:02:39 EST 2014 More report examples Now create the HTML overview page, to check if jobs finised sucessfully and what the results are (bowtie2: mapping statistics) Denis Bauer | @allPowerde24 |
  • 25. DEMO - files Project X Summary HTML Exp1 Bowtie Run1.bam Run2.bam Exp2 Bowtie Run3.bam fastq Exp1 Run1_read1.fastq Run2_read1.fastq Exp2 Run3_read1.fastq The resulting file structure: every experiment has a folder with the tasks as subfolders and in them the results (here: bam files) Denis Bauer | @allPowerde25 |
  • 26. NGSANE Currently supports • Transfer data (smbclient) • Quality Control (GATK, FastQC, RNA-SeQC, custom summaries, user code) • Trimming (Cutadapt,Trimgalore, Trimmomatic) • Mapping (BWA,Bowtie1,Bowtie2,Tophat) • Transcript Quantification (cufflinks, htseq, bedtools) • Variant calling (GATK, samtools) • Variant annotation (annovar) • 3D Genome structure (Hicup, fit-hi-c, Hiclib, Homer) Denis Bauer | @allPowerde26 |
  • 27. For details see https://github.com/BauerLab/ngsane/wiki/How-to-use-the-virtual-machine Denis Bauer | @allPowerde27 |
  • 28. Blue Monster says Analyze your data to be reproducible and well documented with tools that scale well to larger datasets Buske FA et al. NGSANE: a lightweight production informatics framework for high-throughput data analysis. Bioinformatics. 2014 PMID: 24470576 Denis Bauer | @allPowerde28 |
  • 29. Part 3: Combining Omics Data Seeing the full picture requires taking all information into account Denis Bauer | @allPowerde29 |
  • 30. Result overview: traditional differential analysis 1e−02 1e+00 1e+02 1e−02 1e+00 1e+02 tumour FPKM + 0 normalFPKM+0 1. 722 genes differentially expressed (DE) between tumour and normal • QC: We have good concordance with genes known to be up/down regulated in CRC 2. 841 differentially methylated (DM) genomic regions -- mostly hypermethylated • QC: good concordance with previously reported gut methylation profile 0.1 10.0 0.1 10.0 tumour FPKM + 0 normalFPKM+0 Fernandez et al. Genome Res. 2012CSIRO inhouse Known DE gene Known DM locations Denis Bauer | @allPowerde30 |
  • 31. Microbial Population:traditional population survey Paul Greenfield Denis Bauer | @allPowerde31 |
  • 32. Data integration (image credit: Francis Tabary) Denis Bauer | @allPowerde32 |
  • 33. DNA methylation: Blood signatures in Adipose and Gut samples Tim Peters Some gut/adipose samples have blood- like signatures. Denis Bauer | @allPowerde33 |
  • 34. Exonseq: blood-signatures stem from a blood-plasma protein ●● ● ● ●● cor = 0.78 ●● ● ● ●● cor = 0.73 ● ● ● ● ● ● cor = 0 ●● ● ● ●● cor = 0.65 0.0e+00 5.0e−06 1.0e−05 1.5e−05 0e+00 2e−05 4e−05 6e−05 8e−05 0.0000 0.0005 0.0010 0.0015 0.0020 0e+00 1e−05 2e−05 ADM2COL6A3FNIP1HAAO cor = −0.2 ●● ● ● ● ● cor = 0.57 ● ● ● ● ●● cor = 0.16 −1e−04 0e+00 0e+00 1e−04 2e−04 0e+00 1e−04 2e−04 HGB2MALT1 0.00 0.25 0.50 0.75 total/reads ● ● ● ● ● ● cor = 0 ●● ● ● ●● cor = 0.65 ● ● ● ● ● ● cor = −0.59 ● ● ● ● ● ● cor = −0.58 ● ● ● ● ● ● cor = −0.51 ● ● ● ● ●● cor = −0.2 ●● ● ● ● ● cor = 0.57 ● ● ● ● ●● cor = 0.16 0.0000 0.0005 0.0010 0.0015 0.0020 0e+00 1e−05 2e−05 −0.0005 0.0000 0.0005 0.0010 0.0015 −0.001 0.000 0.001 −0.001 0.000 0.001 0.002 0.003 −1e−04 0e+00 1e−04 2e−04 0e+00 1e−04 2e−04 0e+00 1e−04 2e−04 FNIP1HAAOHBA1HBA2HBBHGB1HGB2MALT1 0.00 0.25 0.50 0.75 count/total factor(samples) ● ● ● ● ● ● ● ● ● ● ● ● 2 4 7 12 14 19 20 40 50 57 59 62 factor(status) ● lean obese Contamination by ADM2, a gene expressed in blood plasma Individuals Contamination (%) Contamination(%) expression Plasma protein ADM2 makes up most of the human material in the digesta (number of reads mapping to human genome) Denis Bauer | @allPowerde34 |
  • 35. Medical History: Blood potentially resulting from medication CARTIA 14,50,57 WARFARIN 40 ASPIRIN 59,7 COPLAVIX 12 No anti-clotting drug 2, 62, 4 No medication 19,20 Wilcoxon rank sum test p-value = 0.02 Anti-thrombosis drugs significantly enriched in individuals with human material in digesta. Denis Bauer | @allPowerde35 |
  • 36. Microbial data: Blood “liking” opportunistic bacteria are enriched in contaminated samples E. coli and Salmonella etc Opportunistic pathogens. Respond to inflammation and bleeding Bacterial marker for low level chronic gut bleeding ? Denis Bauer | @allPowerde36 |
  • 37. Blue Monster says Integrating different ‘omics data is still a challenge. Denis Bauer | @allPowerde37 |
  • 38. Three things to remember • Good experimental design is necessary (even) in sequencing experiments • Reproducible, documented data analysis is key (e.g. NGSANE, a lightweight flexible tool for large-scale sequence data analysis on high- performance systems and Amazon’s elastic cloud) • Promising research opportunities are in the integration of multiple high- throughput data sources Denis Bauer | @allPowerde38 |
  • 39. COMPUTATIONAL INFORMATICS Thank youComputational Informatics Denis C. Bauer t +61 2 9123 4567 e Denis.Bauer@csiro.au w www.csiro.au/bioinformatics Buske et al., Bioinformatics, Jan 2014 More talks online: Twitter: http://www.slideshare.net/allPowerde @allPowerde Fabian A. Buske Susan Clark Hugh French Martin Smith Garvan Institute of Medical Research, Sydney, Australia Robert Dunne Tim Peters Paul Greenfield Piotr Szul Tomasz Bednarz Computational Informatics, CSIRO, Australia Garry Hannan Animal Food and Health Scinece, CSIRO, Australia Rodney Scott University of Newcastle, Australia Funding: National Health and Medical Research Council; National Breast Cancer Foundation; CSIRO's Transformational Capability Platform; CSIRO’s IM&T; Science and Industry Endowment Fund http://www.genome-engineering.com.au/

Notes de l'éditeur

  1. Staff # as at 30 June 2012 = 6492 (FTE = 5720) 2011-12 budget = $1.2billion -------------------- Some specifics about us: CSIRO is Australia’s national science agency. We are a mission-directed, large-scale, multidisciplinary research and development organisation. Since 1926, we have been in the business of applying scientific knowledge to the big issues facing Australia and increasingly the world. Globally we are recognised as one of the top 10 applied research organisations. We bring together the best scientists in the world and teams of professionals to work together to help create industries, national wealth, a healthy environment and improved living standards. We have delivered many innovations that have positively impacted on the daily lives of Australians and billions of others around the world. In terms of our vital statistics we generate annual revenues of over A$1 billion . We have around 6,500 people in more than 50 locations across Australia. We lead 11 National Research Flagships addressing major challenges like water, climate, health, manufacturing, mineral resources. This includes our two new flagships: Biosecurity, Digital Productivity and Services (June 2012). We are a leading Australian patenting organisation with over 3,500 patents (granted and pending) and manage an IP portfolio of over 150 revenue bearing licenses. While our research enjoys a high global ranking in terms of publication and citation rates it’s our focus on creating positive impact from science at scale that sets us apart from others in the Australian innovation system. We do science with purpose. We do it well. We make a difference.
  2. CSIRO operates in a matrix. This is to ensure we have the flexibility we need to be able to provide the right mix of skills and talent for major projects; pulling the right people from all across the organisation to form multidisciplinary teams. We understand that often the most successful science comes from crossing boundaries, and working in a matrix structure gives us the ability to do this and therefore help us to deliver impact for Australia. We organise ourselves into 5 research groups. Within these we have 11 National Research Flagships (the two most recent being approved in June 2012 – Biosecurity, Digital Productivity and Services), plus core research portfolios, and 12 research Divisions as at 1 July 2012. We also have Transformational Capability Platforms and several national research facilities and collections. There are equivalent organisations to CSIRO in a number of countries like India and South Africa albeit with slightly narrower missions. CSIRO’s research spans agriculture, food, manufacturing, materials, energy, minerals, health, ICT and the climate, water and environmental domains. This is important to understand because increasingly the solutions to the major challenges we face are being found across sectors and at the interface of different scientific disciplines. So one of the hidden benefits of being large and multidisciplinary that we have discovered, is that you can more readily assemble the teams and partnerships necessary to deliver the scale of impact required to address the big questions facing humanity.
  3. While we are a research organization, we were successful at commercializing a couple of our products. Most famously the wifi protocol which is now in every device using wireless technology like your laptop or phone. Closer to my area of research is Barlymax, a cereal which is high in fibre specifically developed to reduce the risk of bowel cancer. We have a strong track record of commercial success. Our work has impacted the daily lives of Australians and those around the world. These are some of our top inventions.
  4. 94.45+24.57+(7.02e+00)+2.25 12.97+0.90+(1.31e+00)+0.05 11% 49.63+16.53+(9.71e-01)+9.64 0.01+0.01+(1.65e-01)+0.15
  5. There is more than genome sequencing Protocols evolve > pipelines need to be adjusted too
  6. Process data in streams Command line efficiency With scale comes structure
  7. Currently used for processing of biological data but can be adapted to work with other data sources
  8. http://www.nscblog.com/miscellaneous/419/