This document provides an overview of a presentation on population-scale high-throughput sequencing data analysis. It discusses:
1) The background and goals of the CSIRO/Omics Project which aims to investigate colorectal cancer susceptibility using sequencing data from 500 individuals.
2) Methods for processing large-scale NGS data on high-performance computing clusters and cloud infrastructure using the NGSANE framework, which allows processing modules to be run in parallel.
3) Preliminary research outcomes identifying cancer-associated and microbiome changes from analysis of colorectal cancer and control samples.
Population-scale high-throughput sequencing data analysis
1. Denis C. Bauer | Bioinformatics | @allPowerde
08 July 2014
CSIRO COMPUTATIONAL INFORMATICS
Population-scale high-throughput sequencing
data analysis
ByMelody
2. Talk Overview
2 |
• Background: CSIRO/Omics Project
• Methods: NGS Data Processing on HPC/Cloud
• Research Outcome: Cancer and Microbes in Colorectal Cancer
Denis Bauer | @allPowerde
3. 62% of our people hold
university degrees
2000 doctorates
500 masters
With our university
partners, we develop
650 postgraduate
research students
Top 1% of global research
institutions in 14 of 22 research
fields
Top 0.1% in 4 research fields
Darwin
Alice Springs
Geraldton
2 sites
Atherton
Townsville
2 sites
Rockhampton
Toowoomba
Gatton
Myall Vale
Narrabri
Mopra
Parkes
Griffith
Belmont
Geelong
Hobart
Sandy Bay
Wodonga
Newcastle
Armidale
2 sites
Perth
3 sites
Adelaide
2 sites Sydney 5 sites
Canberra 7 sites
Murchison
Cairns
Irymple
Melbourne 5 sites
CSIRO: Who we are
Werribee 2 sites
Brisbane
6 sites
Bribie
Island
People
Divisions
Locations
Flagships
Budget
6500
13
58
11
$1B+
The Commonwealth Scientific and Industrial Research Organisation
Denis Bauer | @allPowerde3 |
4. Our business units
12Research Divisions11National Research Flagships
+National Research Facilities
and Collections
FOOD, HEALTH
& LIFE SCIENCE
INDUSTRIES
ENVIRONMENT MANUFACTURING,
MATERIALS &
MINERALS
ENERGY INFORMATION &
COMMUNICATIONS
+Transformational
Capability Platforms
Denis Bauer | @allPowerde4 |
5. Our track record: top inventions
4. EXTENDED
WEAR CONTACTS
2. POLYMER
BANKNOTES
3. RELENZA
FLU VACCINE
1. Fast WLAN
Wireless Local
Area Network
5. AEROGARD 6. TOTAL
WELLBEING DIET
7. RAFT
POLYMERISATION
8. BARLEYMAX 9. SELF TWISTING
YARN
10. SOFTLY
WASHING LIQUID
Denis Bauer | @allPowerde5 |
6. Part 1: The ‘omics project
The goalof the project is to investigate the
susceptibility to colorectal cancer in the
context of obesity and the gut
microbiome
Denis Bauer | @allPowerde6 |
7. Data from Pilot Study
Full Cohort: 500 (178 to date) individuals from colorectal resection at the John Hunter Hospital, Newcastle Private
Hospital and Royal Newcastle Centre (surgeons Dr Brian Draganic, Dr Peter Pockney & Dr Steve Smith)
organized by Dr Desma Grice and Prof Rodney Scott (University of Newcastle)
Denis Bauer | @allPowerde7 |
8. • Objective: capture genomic variances reliably in tumour normal
and adipose.
• Sequence effort:
• 12 tumour -> 6 lanes (2-plex)
• 12 normal -> 3 lanes (4-plex)
• 12 adipose -> 3 lanes (4-plex)
Considerations before sequencing: Undersampling
More depth needed due to
potentially low cellularity in
the tumour sample
additional
depth
tumour sample
normal sample
Denis Bauer | @allPowerde8 |
9. • Objective: process samples avoiding confounding factors
Considerations before sequencing: Flowcell design
L1
L2
L2
L2
O1
O1
O1
O2
O2
O2
Sequenced
over 3 lanes
L1
L1
Normal
Adipose
Tumour
4-plex
4-plex
4-plex
L2
O2
L1
O1
L2
O2
L1
O1
Sequence on
one lane each
L2
O2
L1
O1
Subject every
sample to the same
lane and flowcell
effects by
multiplexing
(labelling every
sample with a
identifying barcode)
Denis Bauer | @allPowerde9 |
11. Blue Monster says
Design your experiment with project-
specific pitfalls in mind
Auer PL et al. Statistical design and analysis of RNA sequencing data.
Genetics. 2010 PMID: 20439781
Denis Bauer | @allPowerde11 |
12. Part 2: NGS Data Processing
Minimize project set-up overhead
while providing easily adaptable processing modules
for NGS analysis on high-performance-
compute clusters/cloud
Denis Bauer | @allPowerde12 |
13. Resource consumption for Variant Calling
qsub –t 1-36 task.qsub
Script
Submission
Scheduler
0
50
100
100
DNAseq
average
task
mapping
recalibration
transcripts
annotation
variant
Resource consumption
36 samples (2.7T data) on average requires
128 hours CPU time (ste= 15)
77 GB RAM (ste=0.34)
CPU
(hours)
Real time
(hours)
Memory
(GB)
0
50
100
0
50
100
DNAseqRNAseq
cpu
cpu_real
memory
type
average task
mapping
recalibration
transcripts
annotation
variant
Resource consumption
#PBS –l nodes=2:ppn=8
High-Performance-Compute
Denis Bauer | @allPowerde13 |
14. doi:10.1038/nbt.2421
Tailored processing for different sequencing applications
Wet-lab Protocols Production Informatics
Variant
Calling
Methylation
Sites
Gene
Expression
Despite different approaches
we want to use the same
processing framework!
Denis Bauer | @allPowerde14 |
15. reusability
cutting edgedata security
HPC environment
reproducibility
robustness
adaptability
knowledge transfer
(publication)
efficient
Wish list for a framework
Denis Bauer | @allPowerde15 |
19. DEMO - files
Project X fastq
Exp1
Run1_read1.fastq
Run2_read1.fastq
Exp2 Run3_read1.fastq
We can start from raw fastq files: here
3 files (Run1-3) in 2 different
conditions (Exp1-2)
Denis Bauer | @allPowerde19 |
20. DEMO – setting up config file
#********************
# Data
#********************
declare -a DIR; DIR=( Exp1 Exp2 )
#********************
# Tasks
#********************
RUNMAPPINGBOWTIE2="1" # mapping with bowtie2
#********************
# Paths
#********************
# reference genome
FASTA=/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa
20 | Denis Bauer, @allPowerde
We specify the folders NGSANE
should run on and what to do (here:
bowtie2 mapping). We can also
specify project specific settings (here:
use igenomes)
21. DEMO – dry run
bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt
[NGSANE] Trigger mode: [empty] (dry run)
[NOTE] Folders: Exp1 Exp2
[Task] bowtie2
[NOTE] setup enviroment
[TODO] Exp1/Run1_read1.fastq
[TODO] Exp1/Run2_read1.fastq
[TODO] Exp2/Run3_read1.fastq
[NOTE] proceeding with job scheduling...
[NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
[NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
[NOTE] make Exp1/bowtie2/Run3.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run3_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
We run NGSANE in dry run to test
what jobs it would submit
Denis Bauer | @allPowerde21 |
22. DEMO – submit
bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt armed
[NGSANE] Trigger mode: armed
Double check! Then type safetyoff and hit enter to launch the job: safetyoff
... take cover!
[NOTE] Folders: Exp1 Exp2
[Task] bowtie2
[NOTE] setup environment
[TODO] Exp1/Run1_read1.fastq
[TODO] Exp1/Run2_read1.fastq
[TODO] Exp2/Run3_read1.fastq
[NOTE] proceeding with job scheduling...
[NOTE] make Exp1/bowtie2/Run1.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run1_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
Jobnumber 2424899
[NOTE] make Exp1/bowtie2/Run2.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp1/Run2_read1.fastq -o /NGSANEDEMO/Exp1/bowtie2 --rgsi Exp1
Jobnumber 2424900
[NOTE] make Exp2/bowtie2/Run3.asd.bam.dummy
[ JOB] /apps/gi/ngsane/0.4.0.1//mods/bowtie2.sh -k /NGSANEDEMO/config.txt -f
/NGSANEDEMO/fastq/Exp2/Run3_read1.fastq -o /NGSANEDEMO/Exp2/bowtie2 --rgsi Exp2
Jobnumber 2424901
We submit HPC jobs. Checkout the
returned qsub identifiers.
Denis Bauer | @allPowerde22 |
23. DEMO – scheduler
bau04c@burnet-login:/NGSANEDEMO> qstat -u bau04c
burnet-srv.idpx.hpsc.csiro.au:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----
2424899.burnet-s bau04c normal NGs_bowtie2_RunM 9085 1 2 -- 00:05 R 00:00
2424900.burnet-s bau04c normal NGs_bowtie2_RunM 9178 1 2 -- 00:05 R 00:00
2424901.burnet-s bau04c normal NGs_bowtie2_RunM 9353 1 2 -- 00:05 R 00:00
Three HPC jobs run in parallele because there
were three fastq files. But there is no limit to the
number of files to process in parallele: easy scale-
up to populations.
Denis Bauer | @allPowerde23 |
24. DEMO – report
bau04c@burnet-login:/NGSANEDEMO> trigger.sh config.txt html
[NGSANE] Trigger mode: html
>>>>> Generate HTML report
>>>>> startdate Fri Jan 24 08:02:37 EST 2014
>>>>> hostname burnet-login
>>>>> makeSummary.sh -k /NGSANEDEMO/config.txt
--R --R version 3.0.0 (2013-04-03) -- "Masked Marvel”
--Python--Python 2.7.2
QC - bowtie2
>>>>> Generate HTML report - FINISHED
>>>>> enddate Fri Jan 24 08:02:39 EST 2014
More report examples
Now create the HTML overview page,
to check if jobs finised sucessfully and
what the results are (bowtie2:
mapping statistics)
Denis Bauer | @allPowerde24 |
25. DEMO - files
Project X
Summary HTML
Exp1 Bowtie
Run1.bam
Run2.bam
Exp2 Bowtie Run3.bam
fastq
Exp1
Run1_read1.fastq
Run2_read1.fastq
Exp2 Run3_read1.fastq
The resulting file structure: every
experiment has a folder with the tasks
as subfolders and in them the results
(here: bam files)
Denis Bauer | @allPowerde25 |
26. NGSANE Currently supports
• Transfer data (smbclient)
• Quality Control
(GATK, FastQC, RNA-SeQC, custom summaries,
user code)
• Trimming
(Cutadapt,Trimgalore, Trimmomatic)
• Mapping
(BWA,Bowtie1,Bowtie2,Tophat)
• Transcript Quantification
(cufflinks, htseq, bedtools)
• Variant calling
(GATK, samtools)
• Variant annotation
(annovar)
• 3D Genome structure
(Hicup, fit-hi-c, Hiclib, Homer)
Denis Bauer | @allPowerde26 |
27. For details see https://github.com/BauerLab/ngsane/wiki/How-to-use-the-virtual-machine
Denis Bauer | @allPowerde27 |
28. Blue Monster says
Analyze your data to be reproducible
and well documented with tools that
scale well to larger datasets
Buske FA et al. NGSANE: a lightweight production
informatics framework for high-throughput data analysis.
Bioinformatics. 2014 PMID: 24470576
Denis Bauer | @allPowerde28 |
29. Part 3: Combining Omics Data
Seeing the full picture requires taking all
information into account
Denis Bauer | @allPowerde29 |
30. Result overview: traditional differential analysis
1e−02
1e+00
1e+02
1e−02 1e+00 1e+02
tumour FPKM + 0
normalFPKM+0
1. 722 genes differentially expressed (DE) between tumour and
normal
• QC: We have good concordance with genes known to be up/down regulated in CRC
2. 841 differentially methylated (DM) genomic regions -- mostly
hypermethylated
• QC: good concordance with previously reported gut methylation profile
0.1
10.0
0.1 10.0
tumour FPKM + 0
normalFPKM+0
Fernandez et al. Genome Res. 2012CSIRO inhouse
Known DE gene Known DM locations
Denis Bauer | @allPowerde30 |
33. DNA methylation: Blood signatures in Adipose and Gut samples
Tim Peters
Some gut/adipose
samples have blood-
like signatures.
Denis Bauer | @allPowerde33 |
35. Medical History: Blood potentially resulting from medication
CARTIA
14,50,57
WARFARIN
40
ASPIRIN
59,7
COPLAVIX
12
No anti-clotting drug 2, 62, 4
No medication 19,20
Wilcoxon rank sum test p-value = 0.02
Anti-thrombosis drugs
significantly enriched in
individuals with human
material in digesta.
Denis Bauer | @allPowerde35 |
36. Microbial data: Blood “liking” opportunistic bacteria are enriched
in contaminated samples
E. coli and Salmonella etc
Opportunistic pathogens.
Respond to inflammation
and bleeding
Bacterial marker for low level
chronic gut bleeding ?
Denis Bauer | @allPowerde36 |
38. Three things to remember
• Good experimental design is necessary
(even) in sequencing experiments
• Reproducible, documented data
analysis is key (e.g. NGSANE, a
lightweight flexible tool for large-scale
sequence data analysis on high-
performance systems and Amazon’s
elastic cloud)
• Promising research opportunities are in
the integration of multiple high-
throughput data sources
Denis Bauer | @allPowerde38 |
39. COMPUTATIONAL INFORMATICS
Thank youComputational Informatics
Denis C. Bauer
t +61 2 9123 4567
e Denis.Bauer@csiro.au
w www.csiro.au/bioinformatics
Buske et al.,
Bioinformatics,
Jan 2014
More talks online: Twitter:
http://www.slideshare.net/allPowerde @allPowerde
Fabian A. Buske
Susan Clark
Hugh French
Martin Smith
Garvan Institute of Medical
Research, Sydney, Australia
Robert Dunne
Tim Peters
Paul Greenfield
Piotr Szul
Tomasz Bednarz
Computational Informatics,
CSIRO, Australia
Garry Hannan
Animal Food and Health Scinece,
CSIRO, Australia
Rodney Scott
University of Newcastle, Australia
Funding:
National Health and Medical
Research Council;
National Breast Cancer
Foundation;
CSIRO's Transformational
Capability Platform;
CSIRO’s IM&T;
Science and Industry Endowment
Fund
http://www.genome-engineering.com.au/
Notes de l'éditeur
Staff # as at 30 June 2012 = 6492 (FTE = 5720)
2011-12 budget = $1.2billion
--------------------
Some specifics about us:
CSIRO is Australia’s national science agency. We are a mission-directed, large-scale, multidisciplinary research and development organisation.
Since 1926, we have been in the business of applying scientific knowledge to the big issues facing Australia and increasingly the world. Globally we are recognised as one of the top 10 applied research organisations.
We bring together the best scientists in the world and teams of professionals to work together to help create industries, national wealth, a healthy environment and improved living standards.
We have delivered many innovations that have positively impacted on the daily lives of Australians and billions of others around the world.
In terms of our vital statistics we generate annual revenues of over A$1 billion . We have around 6,500 people in more than 50 locations across Australia. We lead 11 National Research Flagships addressing major challenges like water, climate, health, manufacturing, mineral resources. This includes our two new flagships: Biosecurity, Digital Productivity and Services (June 2012). We are a leading Australian patenting organisation with over 3,500 patents (granted and pending) and manage an IP portfolio of over 150 revenue bearing licenses.
While our research enjoys a high global ranking in terms of publication and citation rates it’s our focus on creating positive impact from science at scale that sets us apart from others in the Australian innovation system.
We do science with purpose. We do it well. We make a difference.
CSIRO operates in a matrix. This is to ensure we have the flexibility we need to be able to provide the right mix of skills and talent for major projects; pulling the right people from all across the organisation to form multidisciplinary teams. We understand that often the most successful science comes from crossing boundaries, and working in a matrix structure gives us the ability to do this and therefore help us to deliver impact for Australia.
We organise ourselves into 5 research groups. Within these we have 11 National Research Flagships (the two most recent being approved in June 2012 – Biosecurity, Digital Productivity and Services), plus core research portfolios, and 12 research Divisions as at 1 July 2012. We also have Transformational Capability Platforms and several national research facilities and collections.
There are equivalent organisations to CSIRO in a number of countries like India and South Africa albeit with slightly narrower missions. CSIRO’s research spans agriculture, food, manufacturing, materials, energy, minerals, health, ICT and the climate, water and environmental domains.
This is important to understand because increasingly the solutions to the major challenges we face are being found across sectors and at the interface of different scientific disciplines. So one of the hidden benefits of being large and multidisciplinary that we have discovered, is that you can more readily assemble the teams and partnerships necessary to deliver the scale of impact required to address the big questions facing humanity.
While we are a research organization, we were successful at commercializing a couple of our products. Most famously the wifi protocol which is now in every device using wireless technology like your laptop or phone. Closer to my area of research is Barlymax, a cereal which is high in fibre specifically developed to reduce the risk of bowel cancer.
We have a strong track record of commercial success. Our work has impacted the daily lives of Australians and those around the world. These are some of our top inventions.