The document discusses challenges in analyzing next generation sequencing (NGS) data from genome sequencing and the potential for real-time analysis using in-memory technologies. Specifically, it notes that conventional genome analysis can take days to weeks but the Hasso Plattner Institute has developed an in-memory approach that can perform alignment and variant calling on 10GB of sequencing data from 1000 genomes in under 45 minutes and enable interactive analysis in real-time. This approach uses an in-memory column-oriented database to store and query sequencing data without disk access for faster processing and analysis of genomic data.
Real-time Analysis of Next Generation Sequencing Data
1. Real-time Analysis of
Next Generation Sequencing Data
World Health Summit
Oct 24, 2012
Prof. Dr. Christoph Meinel
Matthieu Schapranow
Hasso Plattner Institute
2. Genome Sequencing:
Do you have enough time?
2
Image taken from http://portal.ccg.uni-koeln.de/ccg/assets/images/3730.jpg
Real-time Analysis of NGS Data, Prof. Dr. Meinel, Schapranow, World Health Summit, Oct 24, 2012
3. The Archon Genomics X Prize
3
!
“$10 million will be awarded to the first team
to rapidly, accurately and economically
sequence 100 whole human genomes to an
unprecedented level of accuracy.”
!
(Archon Genomics X Prize, 2012)!
Real-time Analysis of NGS Data, Prof. Dr. Meinel, Schapranow, World Health Summit, Oct 24, 2012
4. Agenda
4
■ Conventional Medicine
■ Personalized Medicine
■ Challenges of Genome Data Analysis
■ High-Performance In-Memory Genome Project
Real-time Analysis of NGS Data, Prof. Dr. Meinel, Schapranow, World Health Summit, Oct 24, 2012
5. Conventional Medicine
5
Women Will
Develop
Cancer
Men Will Never
Delop
Cancer
0% 50% 100%
American Cancer Society, Surveillance Research, 2012
Chemotherapies
Fail
Work
Real-time Analysis of NGS Data, Prof. Dr. Meinel, Schapranow, World Health Summit, Oct 24, 2012
6. Personalized Medicine
6
“Personalized medicine aims at treating patients
specifically based on their individual dispositions, e.g.
genetic or environmental factors”!
(K. Jain, Textbook of Personalized Medicine. Springer, 2009)!
Enhanced by Limiting Factor
World-wide medical Research results in heterogeneously
research activities formatted in distributed databases
Real-time Analysis of NGS Data, Prof. Dr. Meinel, Schapranow, World Health Summit, Oct 24, 2012
7. Personalized Medicine
7 Patient suffering Conventional Therapy Treatment
from Cancer Decision
DNA Analysis of
Sequencing Genomic Data
• Quantity: 3.2 Billion Base Pairs • Quantity:
• Data Size: 1-20 GB • Known Mutations: 80M
• Distinct Genes: 20k-25k
• Proteins: 50k-300k
• Data Sizes:
• Alignment: 5-10 GB
• Variants: 10-100 GB
Personalized Medicine
As of Today
Supported by HPI
0 10 20 30 40 Duration [Days]
Real-time Analysis of NGS Data, Prof. Dr. Meinel, Schapranow, World Health Summit, Oct 24, 2012
8. Challenges of Genome Data Analysis
8
Analysis of Genomic
Data
Alignment and Analysis of Annotations
Variant Calling in World-wide DBs
Bound To CPU Performance Memory Capacity
Duration Hours Weeks
HPI Minutes Real-time
Multi-Core Partitioning & Compression
In-Memory
Technology
Real-time Analysis of NGS Data, Prof. Dr. Meinel, Schapranow, World Health Summit, Oct 24, 2012
9. Challenges of Genome Data Analysis
9
Analysis of Genomic
Data
Alignment and Analysis of Annotations
Variant Calling in World-wide DBs
Bound To CPU Performance Memory Capacity
Duration Hours Weeks
HPI Minutes Real-time
Multi-Core Partitioning & Compression
In-Memory
Technology
Real-time Analysis of NGS Data, Prof. Dr. Meinel, Schapranow, World Health Summit, Oct 24, 2012
10. High-Performance In-Memory Genome Project
Real-time Analysis of Genome Data
10
■
Real-time Analysis of NGS Data, Prof. Dr. Meinel, Schapranow, World Health Summit, Oct 24, 2012
11. High-Performance In-Memory Genome Project
It is your time
11 ■ ~10G FASTQ files resp. ~45M reads from 1k genome project
■ ~400k-700k variants detected BWA, Bowtie, Bowtie2, TMAP
■ ~45 min for alignment and variant calling
■ Analysis of result: Interactive exploration in real-time
Real-time Analysis of NGS Data, Prof. Dr. Meinel, Schapranow, World Health Summit, Oct 24, 2012
12. High-Performance In-Memory Genome Project
Architecture
12
Real-time Analysis of NGS Data, Prof. Dr. Meinel, Schapranow, World Health Summit, Oct 24, 2012
13. High-Performance In-Memory Genome Project
In-Memory Technology
●
●
●
●
Read Event
Read Event Verification
Verification
Repositories
Repositories Services
Services
up to 8.000 read
up to 8.000 read up to 2.000
up to 2.000
event notifications
event notifications requests
requests
per second per second
13
per second per second
+ Combined Minimal Any attribute
Discovery Service
column
Discovery Service
projections as index
and row store
Insert only
Multi-core/
+ for time travel Bulk load
+++ parallelization
SAP HANA
SAP HANA
P A
Active/passive P A
Lightweight
A P data store Partitioning
Compression
Dynamic SQL
Analytics on SQL interface
multi- historical
threading t on columns &
data rows
within nodes
No aggregate Single and Reduction of
x
tables multi-tenancy
x
layers
Object to
+++ On-the-fly Text Retrieval
extensibility
relational T and Extraction
mapping
Map Group Key No disk
reduce
Real-time Analysis of NGS Data, Prof. Dr. Meinel, Schapranow, World Health Summit, Oct 24, 2012
14. High-Performance In-Memory Genome Project
Hardware Characteristics at FSOC-Lab
14
■ 1,000 core cluster at
Hasso Plattner Institute with
25 TB main memory
■ Consists of 25 nodes, each:
□ 40 cores
□ 1 TB main memory
□ Intel® Xeon® E7- 4870
□ 2.40GHz
□ 30 MB Cache
Real-time Analysis of NGS Data, Prof. Dr. Meinel, Schapranow, World Health Summit, Oct 24, 2012
15. What to take home?
15
Sequencing machines become faster, smaller,
cheaper, and generate immense data sets in
heterogeneous formats
■ IT technology is the key to explore and
analyze these big data sets
■ Parallelization reduces time for processing of genome data
■ In-memory technology enables real-time analysis and interactive
exploration of genome data
■ We integrate research results from int’l research databases in a
single knowledge base
“Let’s identify genomic roots and optimal treatments before
the patient wakes up from anaesthesia”
Real-time Analysis of NGS Data, Prof. Dr. Meinel, Schapranow, World Health Summit, Oct 24, 2012
16. Thank you for your interest!
Keep in contact with us.
16
Prof. Dr. Christoph Meinel Matthieu-P. Schapranow, M.Sc.
office-meinel@hpi.uni-potsdam.de schapranow@hpi.uni-potsdam.de
http://www.hpi.uni-potsdam.de/meinel/team/christoph_meinel.html http://j.mp/schapranow
Hasso Plattner Institute
Enterprise Platform & Integration Concepts
Matthieu-P. Schapranow
August-Bebel-Str. 88
14482 Potsdam, Germany
Real-time Analysis of NGS Data, Prof. Dr. Meinel, Schapranow, World Health Summit, Oct 24, 2012