Best pratices at BGI for the Challenges in the Era of Big Genomics Data
1. Xing Xu, Ph.D
Director of Cloud Computing Product
Challenges in the Era of Big Genomic
Data and Our Practices in BGI
2. Topics for Today
About BGI
Challenges and Solutions
Data transfer
Cloud Computing
Computational Algorithms and Infrastructure
Data Storage
2
3. BGI
The world largest genome sequencing center
Started with Human Genome Project in 1999 with only a
few sequencers.
Now more than 150 sequencers, 6 TB/day sequencing
throughput.
MODEL
ABI
3730XL
Roche
454
ABI
SOLiD 4
Solexa
GA IIx
Illumina
HiSeq 2000
INSTALLATION 16 1 27 6 135
4. BGI
The world largest genome sequencing center
The largest computing and storage center for
genomics in China
- 20,000+ CPU cores
- 19 NVIDIA GPUs
- 220+ Tflops peak
performance
- 17 PB data storage
- The storage and
computation capability
increase by 10000 folds!
- Still increasing …
5. BGI
The world largest genome sequencing center
The largest computing and storage center for
genomics in China
One of world leading research institutes in
Genomics
Since 2007,
- 253 papers in high-impact journals
- Including 47 in Nature and its sub-journals,
9 in Science,2 in Cell, and 1 in NEJM, with
42 first and/or corresponding authors
- 369 patent applications
- 254 software authorship
6. BGI
The world largest genome sequencing center
The largest computing and storage center for
genomics in China
One of world leading research institutes in
Genomics
BGI has the sequencing capacity, hardware resource
and software proficiency to be the one of the strongest
end-to-end service providers in the world for NGS
sequencing, data analysis and data interpretation.
9. Challenges for
Handling Big Data
Exponential growth of data amount
Complicate data analysis process
Widely distributed data
Images from omicsmaps.com 9
BGI
10. Challenges and Solutions
Data transfer
Cloud Computing
Computational Algorithms and Infrastructure
Data Management
10
11. Solutions for data transfer
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
11
12. Solutions for data transfer
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
12
High speed
data transfer
13. Solutions for data transfer:
High speed data transfer
13
Demonstrated 10Gbps ultra high speed data
exchange with UC Davis, and NCBI in June,
2012.
14. Solutions for data transfer:
High speed data transfer
14
Demonstrated 10Gbps ultra high speed data
exchange with UC Davis, and NCBI in June.
A 24GB file was transferred from China to US
in 30 Seconds (~8Gbits/s).
Right software: Aspera Fastp data transfer protocol
Right infrastructure: 10Gb link between US and China
Right technology: RAM Disk, iPV6
15. Solutions for data transfer
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
15
Aspera
Server
Aspera
Client
Aspera
Client
Aspera
Client
Software license
Expensive physical
bandwidth
Free
BGI
Clients
Bottleneck on the
client site
Not a good solution
of sharing
16. Solutions for data transfer
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
16
17. Solutions for cloud
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
Cloud Computing
EasyGenomics, A Software as a Service (SaaS) platform
for NGS data analysis
17
18. EasyGenomics™
EasyGenomics is a Software as a Service (SaaS)
bioinformatics platform for research and applications.
Algorithms, W
orkflows,
Reports
Computational
Resources
Database,
Data management
Web portal,
Simple UIHigh speed
connection
19. A typical user case
19
A normal user case of EasyGenomics and Customers’ Local Computational resource.
The double line items are Customers’ data or resource. The single line items are
results and data within BGI and EasyGenomics platform. The widths of arrows
represent the sizes of data flows (not in real proportion).
Customers’
Local
Resources
20. Bioinformatics Workflow
Four steps:
Upload, Create a Sample, Perform Analyses, Download Results
Algorithms:
Carefully chosen, tested and optimized
Workflows:
Whole Genome Resequencing, Exome Resequencing, RNA-Seq,
small RNA, ncRNA, and De novo Assembly
26. What’s new?
An internal version of EG is running
automatically as a production system.
It integrates the new data delivery portal of
sequencing service.
Aspera fastp download
Accessible to all workflows on EasyGenomics
26
27. You can chose to deliver data
to EasyGenomics platform
27
Configuration file
30. Solutions for cloud
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
30
31. Two paths for
the future cloud solution
Software as a Service (SaaS) to Platform as a
Service (PaaS)
To give the flexibility to research users:
Add their own tools (any tools)
Integrate their own workflows (different combinations of
modules)
One-Click SaaS solution
To give the automated solution for clinical users:
Automated solution for repetitive works
Fulfill very specific functions
31
32. Solutions for
Algorithm and Infrastructure
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
Algorithm and Infrastructure
Scale up with Hadoop / MapReduce: Hecate (de novo
Assembly tool), Gaea (Resequencing pipeline)
32
36. 0
2
4
6
8
10
12
14
16
Old Pipeline Cloud-based pipeline
Two weeks
Within 15 hrs (
120cores)
Data: Human 60X whole genome Re-sequencing
Fast and Scalable
• The Hadoop Implementation provides great scalability.
• Simply by providing more resource, the analysis can finish much
faster.
37. SOAP-GaeaAlignment (1 human sample in 1000genome)
Software Mapping Rate
Confident Mapping
Rate(MAPQ>=10)
Stampy 85.93% 70.00%
SOAP2 79.14% 79.14%
Novo align 82.53% 79.74
BWA 91.54% 84.78%
Bowtie 81.15% 81.15%
SOAP-GaeaAlignment 91.75% 85.20%
It’s not only FAST,
but also ACCURATE
38. Assembly
Constructing de bruijn Graph
Solving Tiny Repeats Merging Bubbles
Scaffolding Merging Contigs
SOAP-Hecate: Distributed
de novo Genome Assembly
39. Contig Extension
Scaffolding
Gap closing
SOAPdenovo v2
SOAP-Hecate v2.5
(84 cores)
SOAP-Hecate v2.5
(180 cores)
Data Size 670GB 670GB 670GB
No. of Servers 1 7 15
Time 59 hour 59hour 38hour
Memory Size 400*1 24*7 24G*15
Mode Centralized Distributed Distributed
*80X human whole genome
SOAP-Hecate is scalable and
using much less memory
Scalability
Performance
SOAP-Hecate SOAPdenovo ALLPATH Phusion2, phrap Meraculous ABySS
Scaffold
N50
26,570,829 117,000 211,000 495,000 486,000 144,300
Tested on simulated data from Assemblathon 1(Earl, Bradnam et al.
40. Solutions for Algorithms
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
Algorithm and Infrastructure
Scale up with Hadoop / MapReduce: Hecate (de novo
Assembly tool), Gaea (Resequencing pipeline)
GPU based acceleration: SOAP3 (Aligner), GSNP(SNP
caller), GAMA (Population genetics tool)
40
41. SOAP3: ~20X speed up
from SOAP2
SOAP
SOAP2
(2008)
20-30x
SOAP3
(2011)
10-30X
GPU Version
1893.45
10671.39
211.53
819.81
0
2000
4000
6000
8000
10000
12000
Human Zebra fish
Total Time (second)
SOAP2 SOAP3
14.12
14.6
13
13.5
14
14.5
15
Human Zebra fish
Speedup
84.2
64.49
88.29
76.55
0
20
40
60
80
100
Human Zebra fish
Alignment Ratio (%)
SOAP2 SOAP3
Collaboration from University of Hong Kong
43. Solutions for
Data Management
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
Algorithm and Infrastructure
Scale up with Hadoop / MapReduce
GPU based acceleration
Data Management
Data management in BGI
43
44. Paradigm Shift
Traditional Model
Business
Determine
what question
to ask
IT
Structures the
data to
answer
that question
Big Data Model
IT
Delivers a
platform to
enable
creative
discovery
Business
Explores what
questions
could be
asked
46. BGI Data Pyramid
iRODS
(Data)
Database
(Information)
Data Mining
(Knowledge)
Health/Clinical APP
(Decision)
• Data Preservation
• Data Retrieval
• Data Sharing
• BGI-SNP
• BGI-SV
• BGI-GaP
• Disease: HGVD/PMRD
• Systems Biology
• Drug Discovery
• Diagnosis of Genetic
Diseases
• Drug of Choice
48. iRODS - integrated Rule
Oriented Data System
48*Access data with Web-based Browser or iRODS GUI or Command Line clients.
renci.org
49. iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow - iRODS
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
Variant
(Gene)
Disease
Drug
iRODS-based Data Management
• Contents: raw data, analyzed data and related metadata
• Data backup
• Fully integrated with LIMS
• Able to search and access any data according to the metadata from
BGI data standard, e.g. project, sample, cohort, phenotype, QC, etc.
• Federation: integrate separate iRODS zones
50. Variant
(Gene)
Disease
Drug
iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow – BGI-DB
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
BGI-DB
• A locus-specific database (LSDB) for all variants identified by BGI
• Manage all basic information generated from data analysis pipelines
• Link all detailed information about individual samples to each variant
• Easy to query information from samples with certain commonality
(such as same phenotype, same cohort, etc.)
• Provide the raw information for further data mining steps
51. iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow – BGI-DW & BGI-KB
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
Variant
(Gene)
Disease
Drug
BGI Data Warehousing & Knowledge Base
• BGI data warehousing (BGI-DW) consists of a series of secondary databases related to
variants, diseases and drugs
• BGI knowledge base (BGI-KB) stores and manages the knowledge obtained through
mining BGI-DB, BGI-DW and other public resources
• Periodically and automatically updated
• Provide APIs for the bioinformaticians to query the information and generate
individualized reports
52. iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow - Successful Story
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
Query the allele frequency
database to filter out common
variants and identify disease-
causal variants
Calculate variant frequencies
from certain cohorts and save
them into the allele frequency
database
Diagnosis for Monogenic
Disease
Group samples
into cohorts
based on their
phenotypes
Variant
(Gene)
Disease
Drug
53. Summary of Our Practice
in IT infrastructure
Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
Algorithm and Infrastructure
Scale up with Hadoop / MapReduce
GPU based acceleration
Data Management
Using iRODs file system to manage big data 53
54. Acknowledgement
Development Team
Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc.
Flex Lab: Yan Li (Hecate), Zhi Zhang(GAEA, iRODS) etc. GPU Lab: Bingqiang Wang etc.
Test & QA Team
Xin Guan, Jingjuan Liu, etc.
PMO & IT Operation
Wenjun Zeng, Litong Lai, Jing Tian, etc.
Product Team
Xing Xu, Jing Guo, Fang Fang etc.
Other BGI Teams
Collaborators:
University of Hong Kong (HKU)
Hong Kong University of Science and Technology (HKUST)
Nvidia - Aspera
RENCI - TianJing Supercomputing center
Notes de l'éditeur
At the heart of EasyGenomics is our Bioinformatics Core. 5 workflows with carefully chosen algorithms, tested and optimized. Filtering, QC Report, Alignment along with other supporting features.
Then we chooseHadoop. The Hadoop streaming framework enabled us to fast parallel the common used algorithms.The distributed storage system HDFS, ensured the safety of our data.In addition, the Map/reduce framwork enabled us to further improve the accuracy and performance of our modules
We integrated all the standard re-sequencing steps onto the cloud-based framework. Which can greatly accelerate the data anlysis.
Except for parallel existing algorithms, we also designed new approaches based on the distributed frameworks. Benefit from its capacity of big data processing, we developed new models for mapping and variation detections.
This cloud-basedpipeline is already applied in Cancer and human disease studies within BGI.It can reduce the analysis time from two weeks into two days.Additionally, the figure on the bottom shows the speedups of our applications, which means we can control the analysis time by choosing the size of the cluster. It would be even faster.
This is the comparison on a human sample in 1000genome project Our mapping tools –GaeaAlignment returning the highest mapping rate.
Hecate distributed graph simplification algrithm into etire cluster of computer nodes。Transfered memory usage into fast distributed IO work which enabled the assembly work
Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. (Source: Wikipedia)SOAP: first-generation short read alignment toolSOAP2 (2008): 20 to 30 times faster than SOAP, less memoryCollaboration between BGI & University of Hong KongCompressed indexing: bidirectional BWT (2BWT)SOAP3 (2011): 10 to 30 times faster than SOAP2Collaboration from University of Hong KongGPU’s parallel processing powerCPU memory: increase from a few to tens GBGPU-based indexing: GPU-2BWT