"Computing for the Analysis of Genomic data al CRS4" (Chris Jones) presentation at CRS4 Research Center. CRS4 Staff Meeting 24-03-2010 (Pula, Sardinia, Italy)
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Chris Jones - CRS4 Staff Meeting - Pula (Italy) 24-03-2010
1. Computing for the Analysis
of Genomic Data at CRS4
Chris Jones
24th March 2010
1
giovedì 25 marzo 2010
2. Who is Chris Jones?
Who is Chris Jones?
2
giovedì 25 marzo 2010
3. Who is Chris Jones?
Who is Chris Jones?
2
giovedì 25 marzo 2010
4. Who is Chris Jones?
Who is Chris Jones?
• 10 years of particle physics research at Oxford
and CERN in Geneva
2
giovedì 25 marzo 2010
5. Who is Chris Jones?
Who is Chris Jones?
• 10 years of particle physics research at Oxford
and CERN in Geneva
• Strong interest in the use of computers to do
things, especially science, BETTER
2
giovedì 25 marzo 2010
6. Who is Chris Jones?
Who is Chris Jones?
• 10 years of particle physics research at Oxford
and CERN in Geneva
• Strong interest in the use of computers to do
things, especially science, BETTER
• The ’70s brought digital detectors and an
massive waves of new data to particle physics,
causing exciting major changes of use of, and
attitude towards computers
2
giovedì 25 marzo 2010
7. Who is Chris Jones?
Who is Chris Jones?
• 10 years of particle physics research at Oxford
and CERN in Geneva
• Strong interest in the use of computers to do
things, especially science, BETTER
• The ’70s brought digital detectors and an
massive waves of new data to particle physics,
causing exciting major changes of use of, and
attitude towards computers
• 20 years of innovating, building, developing and
running services in the CERN Computer Centre
Facility
2
giovedì 25 marzo 2010
8. Who is Chris Jones?
Who is Chris Jones?
• 10 years of particle physics research at Oxford
and CERN in Geneva
• Strong interest in the use of computers to do
things, especially science, BETTER
• The ’70s brought digital detectors and an
massive waves of new data to particle physics,
causing exciting major changes of use of, and
attitude towards computers
• 20 years of innovating, building, developing and
running services in the CERN Computer Centre
Facility
2
giovedì 25 marzo 2010
10. Wellcome Trust Genome Campus
• Escaped on sabbatical to European
Bioinformatics Institute – EBI
3
giovedì 25 marzo 2010
11. Wellcome Trust Genome Campus
• Escaped on sabbatical to European
Bioinformatics Institute – EBI
• Strong links to Sanger Institute
3
giovedì 25 marzo 2010
12. Wellcome Trust Genome Campus
• Escaped on sabbatical to European
Bioinformatics Institute – EBI
• Strong links to Sanger Institute
• And to Roche – Roche Genetics IT Plan
3
giovedì 25 marzo 2010
13. Wellcome Trust Genome Campus
• Escaped on sabbatical to European
Bioinformatics Institute – EBI
• Strong links to Sanger Institute
• And to Roche – Roche Genetics IT Plan
• Founded the PRISM Forum
3
giovedì 25 marzo 2010
14. Wellcome Trust Genome Campus
• Escaped on sabbatical to European
Bioinformatics Institute – EBI
• Strong links to Sanger Institute
• And to Roche – Roche Genetics IT Plan
• Founded the PRISM Forum
3
giovedì 25 marzo 2010
15. Why Sequence Genomes?
• I hope Francesco has explained that very well
• Genomic sequence is the most fundamental
information, the starting point, when you look at
how living objects work…
• And studies of “genotype” versus “phenotype” can
bring us an understanding of the origins of
disease which has been completely out of reach
until now
• The technology is just becoming available…
5
giovedì 25 marzo 2010
16. DNA sequence and genes look
like…
cacaattacttccacaaatgcagtt
gaagcttctactcttcttgcatagg
taacctgagtcggagcagttttcct
cgtggcttcatctttggtgctggat
cttcagcataccaatttgaaggtgc
agtaaacgaaggcggtagaggacca
agtatttgggataccttcacccata
aatatccagaaaaaataagggatgg
aagcaatgcagacatcacggttgc
6
giovedì 25 marzo 2010
18. The Human Genome
• The nucleotide bases are:
a- adenine, c- cytosine, g- guanine, t- thymine
7
giovedì 25 marzo 2010
19. The Human Genome
• The nucleotide bases are:
a- adenine, c- cytosine, g- guanine, t- thymine
• It took 15 years for the first human genome sequence
7
giovedì 25 marzo 2010
20. The Human Genome
• The nucleotide bases are:
a- adenine, c- cytosine, g- guanine, t- thymine
• It took 15 years for the first human genome sequence
• Which was released between 2003 - 2005
7
giovedì 25 marzo 2010
21. The Human Genome
• The nucleotide bases are:
a- adenine, c- cytosine, g- guanine, t- thymine
• It took 15 years for the first human genome sequence
• Which was released between 2003 - 2005
• There are 3*109 or 3 Gigabases in the human genome
7
giovedì 25 marzo 2010
22. The Human Genome
• The nucleotide bases are:
a- adenine, c- cytosine, g- guanine, t- thymine
• It took 15 years for the first human genome sequence
• Which was released between 2003 - 2005
• There are 3*109 or 3 Gigabases in the human genome
• Pine trees have ~10 times more bases ! Why?
7
giovedì 25 marzo 2010
23. The Human Genome
• The nucleotide bases are:
a- adenine, c- cytosine, g- guanine, t- thymine
• It took 15 years for the first human genome sequence
• Which was released between 2003 - 2005
• There are 3*109 or 3 Gigabases in the human genome
• Pine trees have ~10 times more bases ! Why?
• Do not confuse Gb - bits, GB - Bytes, Gbases (Gb)!
7
giovedì 25 marzo 2010
24. Genome Analyzer IIx
In Edificio 3
Two GAIIx machines
Each of which:
40 Gbases / run
Paired end reads
4 Gbases / day
but which are complex
and forefront
technology...
8
giovedì 25 marzo 2010
25. Genome Analyzer IIx
In Edificio 3
Two GAIIx machines
Each of which:
40 Gbases / run
Paired end reads
4 Gbases / day
but which are complex
and forefront
technology...
8
giovedì 25 marzo 2010
29. How much data per run?
• 7.3 MBytes image data per tile * 120 tiles * 8
lanes = 7 000 Mbytes = 7 GigaBytes
11
giovedì 25 marzo 2010
30. How much data per run?
• 7.3 MBytes image data per tile * 120 tiles * 8
lanes = 7 000 Mbytes = 7 GigaBytes
• * 4 bases per read * read length (say 100) = 2
800 GBytes or 2.8 TeraBytes (TB)
11
giovedì 25 marzo 2010
31. How much data per run?
• 7.3 MBytes image data per tile * 120 tiles * 8
lanes = 7 000 Mbytes = 7 GigaBytes
• * 4 bases per read * read length (say 100) = 2
800 GBytes or 2.8 TeraBytes (TB)
• * 2 for the paired end = 5.6 TBytes
11
giovedì 25 marzo 2010
32. How much data per run?
• 7.3 MBytes image data per tile * 120 tiles * 8
lanes = 7 000 Mbytes = 7 GigaBytes
• * 4 bases per read * read length (say 100) = 2
800 GBytes or 2.8 TeraBytes (TB)
• * 2 for the paired end = 5.6 TBytes
• A run of ~1 week on both machines results
in 11.2 TeraBytes of image data
11
giovedì 25 marzo 2010
33. Keeping the raw data?
• If we run for ~40 weeks a year we have
nearly 0.5 PetaBytes (1 PB = 1015 Bytes or 1
000 000 000 000 000 Bytes)
• But if we throw the images away there is no
chance to recuperate more Sequence Data
from the images when a better (promised)
algorithm comes along…
• So biology now faces the problem the
physicists faced 35 years ago
12
giovedì 25 marzo 2010
34. Genome Analyzer IIx
Cluster generation
Attach single molecules to surface
Amplify to form clusters
103 molecules / µm
2.2·105 molecules/tile
13
giovedì 25 marzo 2010
35. Genome Analyzer IIx
Base Calling
• The identity of each base of each cluster is read off from
sequential images (cycle by cycle)
15
giovedì 25 marzo 2010
37. Experiment Timeline
GA IIx Start Day 1
Illumina Pipeline Day 10
BWA and Yun LI workflow Day 13
Quality-Check Tools Day 15
Timing for 115 Cycles Experiment on GA IIx
19
giovedì 25 marzo 2010
38. How much computing?
A software pipeline has been implemented at CRS4 to perform such
operations automatically after a sequencing run ends
40 Gbases per run
370,000,000 sequences
4 samples per flowcell
7,000,000 megabytes of raw data produced per run
5 days for processing sequence-data on the cluster
A huge load for the computer centre
21
giovedì 25 marzo 2010
41. Quality Control
We realised we needed an audit by external experts
of how well we were doing (or how badly)
23
giovedì 25 marzo 2010
42. Quality Control
We realised we needed an audit by external experts
of how well we were doing (or how badly)
We asked experts from the Sanger Institute and from
Cancer Research, Cambridge, UK
23
giovedì 25 marzo 2010
43. Quality Control
We realised we needed an audit by external experts
of how well we were doing (or how badly)
We asked experts from the Sanger Institute and from
Cancer Research, Cambridge, UK
We developed a Quality check process:
− Qualitative and quantitative evaluation of illumina
summary file parameters
− Evaluation of sequence quality (avg. number of
“blank” base calls)
− Evaluation of coverage / holes
− Evaluation of known/all SNPs found ratio
23
giovedì 25 marzo 2010
44. Quality Control
We realised we needed an audit by external experts
of how well we were doing (or how badly)
We asked experts from the Sanger Institute and from
Cancer Research, Cambridge, UK
We developed a Quality check process:
− Qualitative and quantitative evaluation of illumina
summary file parameters
− Evaluation of sequence quality (avg. number of
“blank” base calls)
− Evaluation of coverage / holes
− Evaluation of known/all SNPs found ratio
• This has been very successful
23
giovedì 25 marzo 2010
45. Quality Check:
– Weekly Team Meeting
Qualitative and quantitative evaluation of
illumina summary file parameters:
− Based on Sanger QC protocol
− Quantitative examination of run results
− Qualitative
inspection
of plots
24
giovedì 25 marzo 2010
46. Summary of results
In October 2008 we foresaw 6 Gbases per run per machine
We started at the end of February 2009
We started a Quality Control initiative in Sept. 2009
We have continuously improved number of bases per run:
Upgrades of machines
Preparation of samples (reagents, PCR)
Increasing number of cycles
New algorithms for image processing and base-calling –
better alignment software
Quality control
27
giovedì 25 marzo 2010
48. Activity summary - statistics
67 samples sequenced and aligned
6 samples actually running on the GAs
Average coverage of samples 2.98X
~800 Gbases of raw data
~590 Gbases of aligned data
30
giovedì 25 marzo 2010
49. Imputation
• Program from Gonçalo Abecasis and Serena Sanna
• Very powerful tool in the analysis of population genetics
• Extrapolate measured data to infer more genomic
variations that you have not measured
• Excellent e-Science, use the computer to do better
science
• This certainly merits a seminar to itself
31
giovedì 25 marzo 2010
50. Plans and Visions
• Illumina has announced its latest sequencers, which will
measure 200 Gbases in a run of 8 days
• 5 times our current performance in 20% less time
• Easy to predict 400 or 600 Gbases, – 10 to 15 times as
much data per run
• For the plans to sequence 2000 Sardinians together with
NIH and with University at Ann Arbor, and also for other
requests from the Park and from Sardinia, we would like
to acquire some of these new machines
32
giovedì 25 marzo 2010
52. My personal view
• This is an opportunity for Sardinia to play frontier science on a world stage
33
giovedì 25 marzo 2010
53. My personal view
• This is an opportunity for Sardinia to play frontier science on a world stage
• It exploits the Sardinian genomic heritage and its increased “signal to
noise” to find the origins and mechanisms of diseases that affect people
around the world,
33
giovedì 25 marzo 2010
54. My personal view
• This is an opportunity for Sardinia to play frontier science on a world stage
• It exploits the Sardinian genomic heritage and its increased “signal to
noise” to find the origins and mechanisms of diseases that affect people
around the world,
• and which ultimately cost Sardinia (and the rest of humanity) a lot of
money
33
giovedì 25 marzo 2010
55. My personal view
• This is an opportunity for Sardinia to play frontier science on a world stage
• It exploits the Sardinian genomic heritage and its increased “signal to
noise” to find the origins and mechanisms of diseases that affect people
around the world,
• and which ultimately cost Sardinia (and the rest of humanity) a lot of
money
• It is driven by a predominantly Sardinia team doing excellent work
33
giovedì 25 marzo 2010
56. My personal view
• This is an opportunity for Sardinia to play frontier science on a world stage
• It exploits the Sardinian genomic heritage and its increased “signal to
noise” to find the origins and mechanisms of diseases that affect people
around the world,
• and which ultimately cost Sardinia (and the rest of humanity) a lot of
money
• It is driven by a predominantly Sardinia team doing excellent work
• It binds together necessarily the strong computer centre of CRS4 and
modern digital sequencing technology to build a forefront Sequencing
Facility
33
giovedì 25 marzo 2010
57. My personal view
• This is an opportunity for Sardinia to play frontier science on a world stage
• It exploits the Sardinian genomic heritage and its increased “signal to
noise” to find the origins and mechanisms of diseases that affect people
around the world,
• and which ultimately cost Sardinia (and the rest of humanity) a lot of
money
• It is driven by a predominantly Sardinia team doing excellent work
• It binds together necessarily the strong computer centre of CRS4 and
modern digital sequencing technology to build a forefront Sequencing
Facility
• If we don’t do this now we will lose a golden opportunity for ever
33
giovedì 25 marzo 2010
58. My personal view
• This is an opportunity for Sardinia to play frontier science on a world stage
• It exploits the Sardinian genomic heritage and its increased “signal to
noise” to find the origins and mechanisms of diseases that affect people
around the world,
• and which ultimately cost Sardinia (and the rest of humanity) a lot of
money
• It is driven by a predominantly Sardinia team doing excellent work
• It binds together necessarily the strong computer centre of CRS4 and
modern digital sequencing technology to build a forefront Sequencing
Facility
• If we don’t do this now we will lose a golden opportunity for ever
• Where else would you set up such a Facility?
33
giovedì 25 marzo 2010
59. Thank you for your attention!
34
giovedì 25 marzo 2010