Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

Computing for the Analysis
of Genomic Data at CRS4

Chris Jones
24th March 2010

1
giovedì 25 marzo 2010

Who is Chris Jones?
Who is Chris Jones?

2

Who is Chris Jones?
Who is Chris Jones?

• 10 years of particle physics research at Oxford
and CERN in Geneva

2

Who is Chris Jones?
Who is Chris Jones?

and CERN in Geneva
• Strong interest in the use of computers to do
things, especially science, BETTER

2

Who is Chris Jones?
Who is Chris Jones?

and CERN in Geneva
• The ’70s brought digital detectors and an
massive waves of new data to particle physics,
causing exciting major changes of use of, and
attitude towards computers

2

Who is Chris Jones?
Who is Chris Jones?

and CERN in Geneva
• The ’70s brought digital detectors and an
massive waves of new data to particle physics,
causing exciting major changes of use of, and
attitude towards computers
• 20 years of innovating, building, developing and
running services in the CERN Computer Centre
Facility
2

Wellcome Trust Genome Campus

3


• Escaped on sabbatical to European
Bioinformatics Institute – EBI

3


• Strong links to Sanger Institute

3


• And to Roche – Roche Genetics IT Plan

3


• And to Roche – Roche Genetics IT Plan
• Founded the PRISM Forum

3

Why Sequence Genomes?

• I hope Francesco has explained that very well
• Genomic sequence is the most fundamental
information, the starting point, when you look at
how living objects work…
• And studies of “genotype” versus “phenotype” can
bring us an understanding of the origins of
disease which has been completely out of reach
until now
• The technology is just becoming available…

5

DNA sequence and genes look
like…
cacaattacttccacaaatgcagtt
gaagcttctactcttcttgcatagg
taacctgagtcggagcagttttcct
cgtggcttcatctttggtgctggat
cttcagcataccaatttgaaggtgc
agtaaacgaaggcggtagaggacca
agtatttgggataccttcacccata
aatatccagaaaaaataagggatgg
aagcaatgcagacatcacggttgc
6

The Human Genome

7

The Human Genome

• The nucleotide bases are:
a- adenine, c- cytosine, g- guanine, t- thymine

7

The Human Genome

• It took 15 years for the ﬁrst human genome sequence

7

The Human Genome

• Which was released between 2003 - 2005

7

The Human Genome

• There are 3*109 or 3 Gigabases in the human genome

7

The Human Genome

• Pine trees have ~10 times more bases ! Why?

7

The Human Genome

• Pine trees have ~10 times more bases ! Why?
• Do not confuse Gb - bits, GB - Bytes, Gbases (Gb)!

7

Genome Analyzer IIx

 In Ediﬁcio 3
 Two GAIIx machines
 Each of which:
 40 Gbases / run
 Paired end reads
 4 Gbases / day
 but which are complex
and forefront
technology...
8

Genome Analyzer IIx
Preparation Workﬂow

Sample Prep

Pipeline Analysis

9

Genome Analyzer IIx
FlowCell

 8 Lanes
 120 Tiles (2 cols 60 tiles)
 4 Pictures per tile (A-T-G-C ﬂuos)
 On each tile ~220k clusters

10

How much data per run?

11


• 7.3 MBytes image data per tile * 120 tiles * 8
lanes = 7 000 Mbytes = 7 GigaBytes

11


• * 4 bases per read * read length (say 100) = 2
800 GBytes or 2.8 TeraBytes (TB)

11


• * 2 for the paired end = 5.6 TBytes

11


• * 2 for the paired end = 5.6 TBytes
• A run of ~1 week on both machines results
in 11.2 TeraBytes of image data

11

Keeping the raw data?

• If we run for ~40 weeks a year we have
nearly 0.5 PetaBytes (1 PB = 1015 Bytes or 1
000 000 000 000 000 Bytes)
• But if we throw the images away there is no
chance to recuperate more Sequence Data
from the images when a better (promised)
algorithm comes along…
• So biology now faces the problem the
physicists faced 35 years ago
12

Genome Analyzer IIx
Cluster generation

 Attach single molecules to surface
 Amplify to form clusters

103 molecules / µm

2.2·105 molecules/tile

13

Genome Analyzer IIx
Base Calling

• The identity of each base of each cluster is read oﬀ from
sequential images (cycle by cycle)

15

Illumina Pipeline

ACTGCTATCTT
TCGATTCGTAC
TGCTAGGCACC
ATCGCATTTCA
GGACGTCCTGC
TAGGCACCATC
GCATCTCCATC

18

Experiment Timeline

GA IIx Start Day 1

Illumina Pipeline Day 10

BWA and Yun LI workﬂow Day 13

Quality-Check Tools Day 15

Timing for 115 Cycles Experiment on GA IIx

19

How much computing?

 A software pipeline has been implemented at CRS4 to perform such
operations automatically after a sequencing run ends
 40 Gbases per run
 370,000,000 sequences
 4 samples per ﬂowcell
 7,000,000 megabytes of raw data produced per run
 5 days for processing sequence-data on the cluster

 A huge load for the computer centre

21

How much computing?

22

Quality Control

23

Quality Control
 We realised we needed an audit by external experts
of how well we were doing (or how badly)

23

Quality Control
 We asked experts from the Sanger Institute and from
Cancer Research, Cambridge, UK

23

Quality Control
 We developed a Quality check process:
− Qualitative and quantitative evaluation of illumina
summary ﬁle parameters
− Evaluation of sequence quality (avg. number of
“blank” base calls)
− Evaluation of coverage / holes
− Evaluation of known/all SNPs found ratio

23

Quality Control
 We developed a Quality check process:
− Qualitative and quantitative evaluation of illumina
summary ﬁle parameters
− Evaluation of sequence quality (avg. number of
“blank” base calls)
− Evaluation of coverage / holes
− Evaluation of known/all SNPs found ratio
• This has been very successful
23

Quality Check:
– Weekly Team Meeting

 Qualitative and quantitative evaluation of
illumina summary ﬁle parameters:
− Based on Sanger QC protocol
− Quantitative examination of run results
− Qualitative
inspection
of plots

24

Summary of results

 In October 2008 we foresaw 6 Gbases per run per machine
 We started at the end of February 2009
 We started a Quality Control initiative in Sept. 2009
 We have continuously improved number of bases per run:
 Upgrades of machines
 Preparation of samples (reagents, PCR)
 Increasing number of cycles
 New algorithms for image processing and base-calling –
better alignment software
 Quality control

27

Activity summary - statistics

 67 samples sequenced and aligned
 6 samples actually running on the GAs
 Average coverage of samples 2.98X
 ~800 Gbases of raw data
 ~590 Gbases of aligned data

30

Imputation

• Program from Gonçalo Abecasis and Serena Sanna
• Very powerful tool in the analysis of population genetics
• Extrapolate measured data to infer more genomic
variations that you have not measured
• Excellent e-Science, use the computer to do better
science
• This certainly merits a seminar to itself

31

Plans and Visions

• Illumina has announced its latest sequencers, which will
measure 200 Gbases in a run of 8 days
• 5 times our current performance in 20% less time
• Easy to predict 400 or 600 Gbases, – 10 to 15 times as
much data per run
• For the plans to sequence 2000 Sardinians together with
NIH and with University at Ann Arbor, and also for other
requests from the Park and from Sardinia, we would like
to acquire some of these new machines

32

My personal view

33

My personal view

• This is an opportunity for Sardinia to play frontier science on a world stage

33

My personal view

• It exploits the Sardinian genomic heritage and its increased “signal to
noise” to ﬁnd the origins and mechanisms of diseases that aﬀect people
around the world,

33

My personal view

around the world,
• and which ultimately cost Sardinia (and the rest of humanity) a lot of
money

33

My personal view

around the world,
money
• It is driven by a predominantly Sardinia team doing excellent work

33

My personal view

around the world,
money
• It binds together necessarily the strong computer centre of CRS4 and
modern digital sequencing technology to build a forefront Sequencing
Facility

33

My personal view

around the world,
money
Facility
• If we don’t do this now we will lose a golden opportunity for ever

33

My personal view

around the world,
money
Facility
• If we don’t do this now we will lose a golden opportunity for ever
• Where else would you set up such a Facility?

33

Thank you for your attention!

34

Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010

Recommandé

Recommandé

Contenu connexe

Plus de CRS4 Research Center in Sardinia

Plus de CRS4 Research Center in Sardinia (20)

Dernier

Dernier (20)

Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010