SlideShare une entreprise Scribd logo
1  sur  54
Xing Xu, Ph.D
Director of Cloud Computing Product
Challenges in the Era of Big Genomic
Data and Our Practices in BGI
Topics for Today
 About BGI
 Challenges and Solutions
Data transfer
Cloud Computing
Computational Algorithms and Infrastructure
Data Storage
2
BGI
 The world largest genome sequencing center
Started with Human Genome Project in 1999 with only a
few sequencers.
Now more than 150 sequencers, 6 TB/day sequencing
throughput.
MODEL
ABI
3730XL
Roche
454
ABI
SOLiD 4
Solexa
GA IIx
Illumina
HiSeq 2000
INSTALLATION 16 1 27 6 135
BGI
 The world largest genome sequencing center
 The largest computing and storage center for
genomics in China
- 20,000+ CPU cores
- 19 NVIDIA GPUs
- 220+ Tflops peak
performance
- 17 PB data storage
- The storage and
computation capability
increase by 10000 folds!
- Still increasing …
BGI
 The world largest genome sequencing center
 The largest computing and storage center for
genomics in China
 One of world leading research institutes in
Genomics
Since 2007,
- 253 papers in high-impact journals
- Including 47 in Nature and its sub-journals,
9 in Science,2 in Cell, and 1 in NEJM, with
42 first and/or corresponding authors
- 369 patent applications
- 254 software authorship
BGI
 The world largest genome sequencing center
 The largest computing and storage center for
genomics in China
 One of world leading research institutes in
Genomics
BGI has the sequencing capacity, hardware resource
and software proficiency to be the one of the strongest
end-to-end service providers in the world for NGS
sequencing, data analysis and data interpretation.
Challenges for
Handling Big Data
 Exponential growth of data amount
7
Challenges for
Handling Big Data
 Exponential growth of data amount
 Complicate data analysis process
8
Challenges for
Handling Big Data
 Exponential growth of data amount
 Complicate data analysis process
 Widely distributed data
Images from omicsmaps.com 9
BGI
Challenges and Solutions
 Data transfer
 Cloud Computing
 Computational Algorithms and Infrastructure
 Data Management
10
Solutions for data transfer
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
11
Solutions for data transfer
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
12
High speed
data transfer
Solutions for data transfer:
High speed data transfer
13
 Demonstrated 10Gbps ultra high speed data
exchange with UC Davis, and NCBI in June,
2012.
Solutions for data transfer:
High speed data transfer
14
 Demonstrated 10Gbps ultra high speed data
exchange with UC Davis, and NCBI in June.
 A 24GB file was transferred from China to US
in 30 Seconds (~8Gbits/s).
Right software: Aspera Fastp data transfer protocol
Right infrastructure: 10Gb link between US and China
Right technology: RAM Disk, iPV6
Solutions for data transfer
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
15
Aspera
Server
Aspera
Client
Aspera
Client
Aspera
Client
 Software license
 Expensive physical
bandwidth
Free
BGI
Clients
 Bottleneck on the
client site
 Not a good solution
of sharing
Solutions for data transfer
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
16
Solutions for cloud
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
 Cloud Computing
EasyGenomics, A Software as a Service (SaaS) platform
for NGS data analysis
17
EasyGenomics™
EasyGenomics is a Software as a Service (SaaS)
bioinformatics platform for research and applications.
Algorithms, W
orkflows,
Reports
Computational
Resources
Database,
Data management
Web portal,
Simple UIHigh speed
connection
A typical user case
19
A normal user case of EasyGenomics and Customers’ Local Computational resource.
The double line items are Customers’ data or resource. The single line items are
results and data within BGI and EasyGenomics platform. The widths of arrows
represent the sizes of data flows (not in real proportion).
Customers’
Local
Resources
Bioinformatics Workflow
 Four steps:
Upload, Create a Sample, Perform Analyses, Download Results
 Algorithms:
Carefully chosen, tested and optimized
 Workflows:
Whole Genome Resequencing, Exome Resequencing, RNA-Seq,
small RNA, ncRNA, and De novo Assembly
Homepage
Four task
portals
Status of
recent works
Warning and
Logging
Navigation
Tabs
Sequencing Quality Report
22
Mapping Report
23
Create an Analysis
Selected
sample(s)
•One selected sample => Single Analysis
•Multiple selected samples => Batch Analyses
Create an Analysis
Selectable
modules
Predefined
Settings
Shortcut
What’s new?
 An internal version of EG is running
automatically as a production system.
 It integrates the new data delivery portal of
sequencing service.
Aspera fastp download
Accessible to all workflows on EasyGenomics
26
You can chose to deliver data
to EasyGenomics platform
27
Configuration file
Import Data from
Sequencing Service
28
Import Data from
Sequencing Service
29
Imported Samples
Solutions for cloud
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
 Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
30
Two paths for
the future cloud solution
 Software as a Service (SaaS) to Platform as a
Service (PaaS)
To give the flexibility to research users:
Add their own tools (any tools)
Integrate their own workflows (different combinations of
modules)
 One-Click SaaS solution
To give the automated solution for clinical users:
Automated solution for repetitive works
Fulfill very specific functions
31
Solutions for
Algorithm and Infrastructure
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
 Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
 Algorithm and Infrastructure
Scale up with Hadoop / MapReduce: Hecate (de novo
Assembly tool), Gaea (Resequencing pipeline)
32
• Fast Parallel Framework: Hadoop Streaming
• Reliable Storage System: HDFS
• Scalable Map/Reduce framework
Raw Data
QC
Mapping
Remove PCR duplications
Realignment
Identify
Variations
Selection & Annotation
Raw Data
SOAP-GaeaQC
SOAPalginer BWA BOWTIE
SOAP-GaeaAlignment
Selection & Annotation
SOAP-GaeaMarkDuplicate
SOAP-GaeaRealignment
SNP : SOAPsnp, SOAP-GaeaSNP, SAMtools
InDel : Dindel, SOAP-GaeaIndel
SOAP-Gaea: Hadoop based
resequencing pipeline
Reads
Reference
Key Value
Position
Map
Aligning
Reduce
 Distributed Indexing for load balancing
 Flexible splitting tolerates more mismatches
 Dynamic Programming for robust gap alignment
SOAP-Gaea: Hadoop based
resequencing pipeline
0
2
4
6
8
10
12
14
16
Old Pipeline Cloud-based pipeline
Two weeks
Within 15 hrs (
120cores)
Data: Human 60X whole genome Re-sequencing
Fast and Scalable
• The Hadoop Implementation provides great scalability.
• Simply by providing more resource, the analysis can finish much
faster.
 SOAP-GaeaAlignment (1 human sample in 1000genome)
Software Mapping Rate
Confident Mapping
Rate(MAPQ>=10)
Stampy 85.93% 70.00%
SOAP2 79.14% 79.14%
Novo align 82.53% 79.74
BWA 91.54% 84.78%
Bowtie 81.15% 81.15%
SOAP-GaeaAlignment 91.75% 85.20%
It’s not only FAST,
but also ACCURATE
Assembly
Constructing de bruijn Graph
Solving Tiny Repeats Merging Bubbles
Scaffolding Merging Contigs
SOAP-Hecate: Distributed
de novo Genome Assembly
Contig Extension
Scaffolding
Gap closing
SOAPdenovo v2
SOAP-Hecate v2.5
(84 cores)
SOAP-Hecate v2.5
(180 cores)
Data Size 670GB 670GB 670GB
No. of Servers 1 7 15
Time 59 hour 59hour 38hour
Memory Size 400*1 24*7 24G*15
Mode Centralized Distributed Distributed
*80X human whole genome
SOAP-Hecate is scalable and
using much less memory
 Scalability
 Performance
SOAP-Hecate SOAPdenovo ALLPATH Phusion2, phrap Meraculous ABySS
Scaffold
N50
26,570,829 117,000 211,000 495,000 486,000 144,300
Tested on simulated data from Assemblathon 1(Earl, Bradnam et al.
Solutions for Algorithms
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
 Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
 Algorithm and Infrastructure
Scale up with Hadoop / MapReduce: Hecate (de novo
Assembly tool), Gaea (Resequencing pipeline)
GPU based acceleration: SOAP3 (Aligner), GSNP(SNP
caller), GAMA (Population genetics tool)
40
SOAP3: ~20X speed up
from SOAP2
SOAP
SOAP2
(2008)
20-30x
SOAP3
(2011)
10-30X
GPU Version
1893.45
10671.39
211.53
819.81
0
2000
4000
6000
8000
10000
12000
Human Zebra fish
Total Time (second)
SOAP2 SOAP3
14.12
14.6
13
13.5
14
14.5
15
Human Zebra fish
Speedup
84.2
64.49
88.29
76.55
0
20
40
60
80
100
Human Zebra fish
Alignment Ratio (%)
SOAP2 SOAP3
Collaboration from University of Hong Kong
527
21879
1
10
100
1000
10000
100000
GSNP SOAPsnp
Elapsedtime(sec.)
Ch.1
73
3675
1
10
100
1000
10000
GSNP SOAPsnp
Elapsedtime(sec.)
Ch. 21
GSNP: 50X faster than its
CPU based SOAPSNP
 The elapsed time of all steps are included.
 GSNP is around 50x faster than single-thread
CPU-based SOAPsnp.
Solutions for
Data Management
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
 Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
 Algorithm and Infrastructure
Scale up with Hadoop / MapReduce
GPU based acceleration
 Data Management
Data management in BGI
43
Paradigm Shift
Traditional Model
Business
Determine
what question
to ask
IT
Structures the
data to
answer
that question
Big Data Model
IT
Delivers a
platform to
enable
creative
discovery
Business
Explores what
questions
could be
asked
Information Pyramid
Value
Decision
Knowledge
Information
Data
Element
Meaning
Context
Application
Achievement
Organizing Refining Summarizing Utilizing
BGI Data Pyramid
iRODS
(Data)
Database
(Information)
Data Mining
(Knowledge)
Health/Clinical APP
(Decision)
• Data Preservation
• Data Retrieval
• Data Sharing
• BGI-SNP
• BGI-SV
• BGI-GaP
• Disease: HGVD/PMRD
• Systems Biology
• Drug Discovery
• Diagnosis of Genetic
Diseases
• Drug of Choice
iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
Variant
(Gene)
Disease
Drug
iRODS - integrated Rule
Oriented Data System
48*Access data with Web-based Browser or iRODS GUI or Command Line clients.
renci.org
iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow - iRODS
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
Variant
(Gene)
Disease
Drug
iRODS-based Data Management
• Contents: raw data, analyzed data and related metadata
• Data backup
• Fully integrated with LIMS
• Able to search and access any data according to the metadata from
BGI data standard, e.g. project, sample, cohort, phenotype, QC, etc.
• Federation: integrate separate iRODS zones
Variant
(Gene)
Disease
Drug
iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow – BGI-DB
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
BGI-DB
• A locus-specific database (LSDB) for all variants identified by BGI
• Manage all basic information generated from data analysis pipelines
• Link all detailed information about individual samples to each variant
• Easy to query information from samples with certain commonality
(such as same phenotype, same cohort, etc.)
• Provide the raw information for further data mining steps
iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow – BGI-DW & BGI-KB
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
Variant
(Gene)
Disease
Drug
BGI Data Warehousing & Knowledge Base
• BGI data warehousing (BGI-DW) consists of a series of secondary databases related to
variants, diseases and drugs
• BGI knowledge base (BGI-KB) stores and manages the knowledge obtained through
mining BGI-DB, BGI-DW and other public resources
• Periodically and automatically updated
• Provide APIs for the bioinformaticians to query the information and generate
individualized reports
iRODS
Sequencer
Raw Data
Data
Analysis
Analyzed Data
Data
Warehousing
Personalized
Analysis
Clinical
Diagnosis
Data Flow - Successful Story
Knowledge
Base
Metadata
LIMS
Public
Resources
BGI-DB
Query the allele frequency
database to filter out common
variants and identify disease-
causal variants
Calculate variant frequencies
from certain cohorts and save
them into the allele frequency
database
Diagnosis for Monogenic
Disease
Group samples
into cohorts
based on their
phenotypes
Variant
(Gene)
Disease
Drug
Summary of Our Practice
in IT infrastructure
 Data transfer
Solution I: Hard drive shipment (w/ Fedex)
Solution II: High Speed Data Transfer
Solution III: Don’t move the data (Cloud Computing)
 Cloud Computing
EasyGenomics, A SaaS platform for NGS data analysis
Two paths for the future cloud solution
 Algorithm and Infrastructure
Scale up with Hadoop / MapReduce
GPU based acceleration
 Data Management
Using iRODs file system to manage big data 53
Acknowledgement
 Development Team
Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc.
Flex Lab: Yan Li (Hecate), Zhi Zhang(GAEA, iRODS) etc. GPU Lab: Bingqiang Wang etc.
 Test & QA Team
Xin Guan, Jingjuan Liu, etc.
 PMO & IT Operation
Wenjun Zeng, Litong Lai, Jing Tian, etc.
 Product Team
Xing Xu, Jing Guo, Fang Fang etc.
 Other BGI Teams
 Collaborators:
University of Hong Kong (HKU)
Hong Kong University of Science and Technology (HKUST)
Nvidia - Aspera
RENCI - TianJing Supercomputing center

Contenu connexe

Tendances

High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming DataGeoffrey Fox
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009Ian Foster
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiersRim Moussa
 
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...J On The Beach
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
 
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain ProjectCeph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Projectinside-BigData.com
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...riyaniaes
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman
 
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence GeneratorRim Moussa
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
 

Tendances (20)

High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiers
 
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain ProjectCeph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
Ceph on the Brain: Storage and Data-Movement Supporting the Human Brain Project
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
parallel OLAP
parallel OLAPparallel OLAP
parallel OLAP
 
An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...An effective classification approach for big data with parallel generalized H...
An effective classification approach for big data with parallel generalized H...
 
ISNCC 2017
ISNCC 2017ISNCC 2017
ISNCC 2017
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence Generator
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Asd 2015
Asd 2015Asd 2015
Asd 2015
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 

En vedette

Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012Xing Xu
 
Genome voyager-beta-brochure
Genome voyager-beta-brochureGenome voyager-beta-brochure
Genome voyager-beta-brochureXing Xu
 
IBM Aspera for telecommunications
IBM Aspera for telecommunicationsIBM Aspera for telecommunications
IBM Aspera for telecommunicationsMohamed Morsi
 
Workshop NGS data analysis - 3
Workshop NGS data analysis - 3Workshop NGS data analysis - 3
Workshop NGS data analysis - 3Maté Ongenaert
 
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Maté Ongenaert
 
Ecobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis LokerenEcobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis LokerenMaté Ongenaert
 
ENCODE project: brief summary of main findings
ENCODE project: brief summary of main findingsENCODE project: brief summary of main findings
ENCODE project: brief summary of main findingsMaté Ongenaert
 
IBM Managed File Transfer Portfolio - IBMImpact 2014
IBM Managed File Transfer Portfolio - IBMImpact 2014IBM Managed File Transfer Portfolio - IBMImpact 2014
IBM Managed File Transfer Portfolio - IBMImpact 2014Leif Davidsen
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
A Platform for Integrated Genome Data Analysis
A Platform for Integrated Genome Data AnalysisA Platform for Integrated Genome Data Analysis
A Platform for Integrated Genome Data AnalysisMatthieu Schapranow
 
New Progress in Pyrosequencing for DNA Methylation
New Progress in Pyrosequencing for DNA MethylationNew Progress in Pyrosequencing for DNA Methylation
New Progress in Pyrosequencing for DNA MethylationQIAGEN
 
AWS Webcast - Migrating your Data Center to the Cloud
AWS Webcast - Migrating your Data Center to the CloudAWS Webcast - Migrating your Data Center to the Cloud
AWS Webcast - Migrating your Data Center to the CloudAmazon Web Services
 
PCR - From Setup to Cleanup: A Beginner`s Guide with Useful Tips and Tricks -...
PCR - From Setup to Cleanup: A Beginner`s Guide with Useful Tips and Tricks -...PCR - From Setup to Cleanup: A Beginner`s Guide with Useful Tips and Tricks -...
PCR - From Setup to Cleanup: A Beginner`s Guide with Useful Tips and Tricks -...QIAGEN
 
Microbiome Profiling with the Microbial Genomics Pro Suite
Microbiome Profiling with the Microbial Genomics Pro SuiteMicrobiome Profiling with the Microbial Genomics Pro Suite
Microbiome Profiling with the Microbial Genomics Pro SuiteQIAGEN
 
Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...
Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...
Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...QIAGEN
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 
Strategies for Hyperscale Data Centers to Approach Net Zero
Strategies for Hyperscale Data Centers to Approach Net ZeroStrategies for Hyperscale Data Centers to Approach Net Zero
Strategies for Hyperscale Data Centers to Approach Net ZeroMattias Ganslandt
 
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...QIAGEN
 

En vedette (20)

Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012
 
Genome voyager-beta-brochure
Genome voyager-beta-brochureGenome voyager-beta-brochure
Genome voyager-beta-brochure
 
IBM Aspera for telecommunications
IBM Aspera for telecommunicationsIBM Aspera for telecommunications
IBM Aspera for telecommunications
 
Workshop NGS data analysis - 3
Workshop NGS data analysis - 3Workshop NGS data analysis - 3
Workshop NGS data analysis - 3
 
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
 
Ecobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis LokerenEcobouwers opendeur passiefhuis Lokeren
Ecobouwers opendeur passiefhuis Lokeren
 
Data Center Monitoring
Data Center MonitoringData Center Monitoring
Data Center Monitoring
 
ENCODE project: brief summary of main findings
ENCODE project: brief summary of main findingsENCODE project: brief summary of main findings
ENCODE project: brief summary of main findings
 
IBM Managed File Transfer Portfolio - IBMImpact 2014
IBM Managed File Transfer Portfolio - IBMImpact 2014IBM Managed File Transfer Portfolio - IBMImpact 2014
IBM Managed File Transfer Portfolio - IBMImpact 2014
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Open human genome data
Open human genome dataOpen human genome data
Open human genome data
 
A Platform for Integrated Genome Data Analysis
A Platform for Integrated Genome Data AnalysisA Platform for Integrated Genome Data Analysis
A Platform for Integrated Genome Data Analysis
 
New Progress in Pyrosequencing for DNA Methylation
New Progress in Pyrosequencing for DNA MethylationNew Progress in Pyrosequencing for DNA Methylation
New Progress in Pyrosequencing for DNA Methylation
 
AWS Webcast - Migrating your Data Center to the Cloud
AWS Webcast - Migrating your Data Center to the CloudAWS Webcast - Migrating your Data Center to the Cloud
AWS Webcast - Migrating your Data Center to the Cloud
 
PCR - From Setup to Cleanup: A Beginner`s Guide with Useful Tips and Tricks -...
PCR - From Setup to Cleanup: A Beginner`s Guide with Useful Tips and Tricks -...PCR - From Setup to Cleanup: A Beginner`s Guide with Useful Tips and Tricks -...
PCR - From Setup to Cleanup: A Beginner`s Guide with Useful Tips and Tricks -...
 
Microbiome Profiling with the Microbial Genomics Pro Suite
Microbiome Profiling with the Microbial Genomics Pro SuiteMicrobiome Profiling with the Microbial Genomics Pro Suite
Microbiome Profiling with the Microbial Genomics Pro Suite
 
Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...
Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...
Clinical Metagenomics for Rapid Detection of Enteric Pathogens and Characteri...
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
Strategies for Hyperscale Data Centers to Approach Net Zero
Strategies for Hyperscale Data Centers to Approach Net ZeroStrategies for Hyperscale Data Centers to Approach Net Zero
Strategies for Hyperscale Data Centers to Approach Net Zero
 
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
Advanced NGS Data Analysis & Interpretation- BGW + IVA: NGS Tech Overview Web...
 

Similaire à Best pratices at BGI for the Challenges in the Era of Big Genomics Data

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom IndustryCloudera, Inc.
 
Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Anthony Potappel
 
Network Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingNetwork Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingGlobus
 
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...Denodo
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersIan Foster
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Robert Grossman
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Big Data Spain
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdfLevLafayette1
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Processing chains with OGC Web Processing Services to process satellite data ...
Processing chains with OGC Web Processing Services to process satellite data ...Processing chains with OGC Web Processing Services to process satellite data ...
Processing chains with OGC Web Processing Services to process satellite data ...FOSS4G 2011
 
Cloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management SystemCloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management SystemLukas Forer
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 

Similaire à Best pratices at BGI for the Challenges in the Era of Big Genomics Data (20)

Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622
 
Network Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingNetwork Engineering for High Speed Data Sharing
Network Engineering for High Speed Data Sharing
 
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Processing chains with OGC Web Processing Services to process satellite data ...
Processing chains with OGC Web Processing Services to process satellite data ...Processing chains with OGC Web Processing Services to process satellite data ...
Processing chains with OGC Web Processing Services to process satellite data ...
 
Cloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management SystemCloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management System
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 

Dernier

SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardsticksaastr
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024eCommerce Institute
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxraffaeleoman
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...henrik385807
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Delhi Call girls
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AITatiana Gurgel
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Vipesco
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesPooja Nehwal
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfhenrik385807
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyPooja Nehwal
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Kayode Fayemi
 
Mathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMoumonDas2
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxmohammadalnahdi22
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )Pooja Nehwal
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024eCommerce Institute
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Hasting Chen
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Pooja Nehwal
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaKayode Fayemi
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubssamaasim06
 

Dernier (20)

SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, YardstickSaaStr Workshop Wednesday w/ Lucas Price, Yardstick
SaaStr Workshop Wednesday w/ Lucas Price, Yardstick
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AI
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara ServicesVVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
VVIP Call Girls Nalasopara : 9892124323, Call Girls in Nalasopara Services
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
 
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night EnjoyCall Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
Call Girl Number in Khar Mumbai📲 9892124323 💞 Full Night Enjoy
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
 
Mathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptxMathematics of Finance Presentation.pptx
Mathematics of Finance Presentation.pptx
 
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptxMohammad_Alnahdi_Oral_Presentation_Assignment.pptx
Mohammad_Alnahdi_Oral_Presentation_Assignment.pptx
 
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
 
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
Re-membering the Bard: Revisiting The Compleat Wrks of Wllm Shkspr (Abridged)...
 
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 93 Noida Escorts >༒8448380779 Escort Service
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
Presentation on Engagement in Book Clubs
Presentation on Engagement in Book ClubsPresentation on Engagement in Book Clubs
Presentation on Engagement in Book Clubs
 

Best pratices at BGI for the Challenges in the Era of Big Genomics Data

  • 1. Xing Xu, Ph.D Director of Cloud Computing Product Challenges in the Era of Big Genomic Data and Our Practices in BGI
  • 2. Topics for Today  About BGI  Challenges and Solutions Data transfer Cloud Computing Computational Algorithms and Infrastructure Data Storage 2
  • 3. BGI  The world largest genome sequencing center Started with Human Genome Project in 1999 with only a few sequencers. Now more than 150 sequencers, 6 TB/day sequencing throughput. MODEL ABI 3730XL Roche 454 ABI SOLiD 4 Solexa GA IIx Illumina HiSeq 2000 INSTALLATION 16 1 27 6 135
  • 4. BGI  The world largest genome sequencing center  The largest computing and storage center for genomics in China - 20,000+ CPU cores - 19 NVIDIA GPUs - 220+ Tflops peak performance - 17 PB data storage - The storage and computation capability increase by 10000 folds! - Still increasing …
  • 5. BGI  The world largest genome sequencing center  The largest computing and storage center for genomics in China  One of world leading research institutes in Genomics Since 2007, - 253 papers in high-impact journals - Including 47 in Nature and its sub-journals, 9 in Science,2 in Cell, and 1 in NEJM, with 42 first and/or corresponding authors - 369 patent applications - 254 software authorship
  • 6. BGI  The world largest genome sequencing center  The largest computing and storage center for genomics in China  One of world leading research institutes in Genomics BGI has the sequencing capacity, hardware resource and software proficiency to be the one of the strongest end-to-end service providers in the world for NGS sequencing, data analysis and data interpretation.
  • 7. Challenges for Handling Big Data  Exponential growth of data amount 7
  • 8. Challenges for Handling Big Data  Exponential growth of data amount  Complicate data analysis process 8
  • 9. Challenges for Handling Big Data  Exponential growth of data amount  Complicate data analysis process  Widely distributed data Images from omicsmaps.com 9 BGI
  • 10. Challenges and Solutions  Data transfer  Cloud Computing  Computational Algorithms and Infrastructure  Data Management 10
  • 11. Solutions for data transfer  Data transfer Solution I: Hard drive shipment (w/ Fedex) 11
  • 12. Solutions for data transfer  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer 12 High speed data transfer
  • 13. Solutions for data transfer: High speed data transfer 13  Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June, 2012.
  • 14. Solutions for data transfer: High speed data transfer 14  Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.  A 24GB file was transferred from China to US in 30 Seconds (~8Gbits/s). Right software: Aspera Fastp data transfer protocol Right infrastructure: 10Gb link between US and China Right technology: RAM Disk, iPV6
  • 15. Solutions for data transfer  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer 15 Aspera Server Aspera Client Aspera Client Aspera Client  Software license  Expensive physical bandwidth Free BGI Clients  Bottleneck on the client site  Not a good solution of sharing
  • 16. Solutions for data transfer  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer Solution III: Don’t move the data (Cloud Computing) 16
  • 17. Solutions for cloud  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer Solution III: Don’t move the data (Cloud Computing)  Cloud Computing EasyGenomics, A Software as a Service (SaaS) platform for NGS data analysis 17
  • 18. EasyGenomics™ EasyGenomics is a Software as a Service (SaaS) bioinformatics platform for research and applications. Algorithms, W orkflows, Reports Computational Resources Database, Data management Web portal, Simple UIHigh speed connection
  • 19. A typical user case 19 A normal user case of EasyGenomics and Customers’ Local Computational resource. The double line items are Customers’ data or resource. The single line items are results and data within BGI and EasyGenomics platform. The widths of arrows represent the sizes of data flows (not in real proportion). Customers’ Local Resources
  • 20. Bioinformatics Workflow  Four steps: Upload, Create a Sample, Perform Analyses, Download Results  Algorithms: Carefully chosen, tested and optimized  Workflows: Whole Genome Resequencing, Exome Resequencing, RNA-Seq, small RNA, ncRNA, and De novo Assembly
  • 21. Homepage Four task portals Status of recent works Warning and Logging Navigation Tabs
  • 24. Create an Analysis Selected sample(s) •One selected sample => Single Analysis •Multiple selected samples => Batch Analyses
  • 26. What’s new?  An internal version of EG is running automatically as a production system.  It integrates the new data delivery portal of sequencing service. Aspera fastp download Accessible to all workflows on EasyGenomics 26
  • 27. You can chose to deliver data to EasyGenomics platform 27 Configuration file
  • 29. Import Data from Sequencing Service 29 Imported Samples
  • 30. Solutions for cloud  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer Solution III: Don’t move the data (Cloud Computing)  Cloud Computing EasyGenomics, A SaaS platform for NGS data analysis Two paths for the future cloud solution 30
  • 31. Two paths for the future cloud solution  Software as a Service (SaaS) to Platform as a Service (PaaS) To give the flexibility to research users: Add their own tools (any tools) Integrate their own workflows (different combinations of modules)  One-Click SaaS solution To give the automated solution for clinical users: Automated solution for repetitive works Fulfill very specific functions 31
  • 32. Solutions for Algorithm and Infrastructure  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer Solution III: Don’t move the data (Cloud Computing)  Cloud Computing EasyGenomics, A SaaS platform for NGS data analysis Two paths for the future cloud solution  Algorithm and Infrastructure Scale up with Hadoop / MapReduce: Hecate (de novo Assembly tool), Gaea (Resequencing pipeline) 32
  • 33. • Fast Parallel Framework: Hadoop Streaming • Reliable Storage System: HDFS • Scalable Map/Reduce framework
  • 34. Raw Data QC Mapping Remove PCR duplications Realignment Identify Variations Selection & Annotation Raw Data SOAP-GaeaQC SOAPalginer BWA BOWTIE SOAP-GaeaAlignment Selection & Annotation SOAP-GaeaMarkDuplicate SOAP-GaeaRealignment SNP : SOAPsnp, SOAP-GaeaSNP, SAMtools InDel : Dindel, SOAP-GaeaIndel SOAP-Gaea: Hadoop based resequencing pipeline
  • 35. Reads Reference Key Value Position Map Aligning Reduce  Distributed Indexing for load balancing  Flexible splitting tolerates more mismatches  Dynamic Programming for robust gap alignment SOAP-Gaea: Hadoop based resequencing pipeline
  • 36. 0 2 4 6 8 10 12 14 16 Old Pipeline Cloud-based pipeline Two weeks Within 15 hrs ( 120cores) Data: Human 60X whole genome Re-sequencing Fast and Scalable • The Hadoop Implementation provides great scalability. • Simply by providing more resource, the analysis can finish much faster.
  • 37.  SOAP-GaeaAlignment (1 human sample in 1000genome) Software Mapping Rate Confident Mapping Rate(MAPQ>=10) Stampy 85.93% 70.00% SOAP2 79.14% 79.14% Novo align 82.53% 79.74 BWA 91.54% 84.78% Bowtie 81.15% 81.15% SOAP-GaeaAlignment 91.75% 85.20% It’s not only FAST, but also ACCURATE
  • 38. Assembly Constructing de bruijn Graph Solving Tiny Repeats Merging Bubbles Scaffolding Merging Contigs SOAP-Hecate: Distributed de novo Genome Assembly
  • 39. Contig Extension Scaffolding Gap closing SOAPdenovo v2 SOAP-Hecate v2.5 (84 cores) SOAP-Hecate v2.5 (180 cores) Data Size 670GB 670GB 670GB No. of Servers 1 7 15 Time 59 hour 59hour 38hour Memory Size 400*1 24*7 24G*15 Mode Centralized Distributed Distributed *80X human whole genome SOAP-Hecate is scalable and using much less memory  Scalability  Performance SOAP-Hecate SOAPdenovo ALLPATH Phusion2, phrap Meraculous ABySS Scaffold N50 26,570,829 117,000 211,000 495,000 486,000 144,300 Tested on simulated data from Assemblathon 1(Earl, Bradnam et al.
  • 40. Solutions for Algorithms  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer Solution III: Don’t move the data (Cloud Computing)  Cloud Computing EasyGenomics, A SaaS platform for NGS data analysis Two paths for the future cloud solution  Algorithm and Infrastructure Scale up with Hadoop / MapReduce: Hecate (de novo Assembly tool), Gaea (Resequencing pipeline) GPU based acceleration: SOAP3 (Aligner), GSNP(SNP caller), GAMA (Population genetics tool) 40
  • 41. SOAP3: ~20X speed up from SOAP2 SOAP SOAP2 (2008) 20-30x SOAP3 (2011) 10-30X GPU Version 1893.45 10671.39 211.53 819.81 0 2000 4000 6000 8000 10000 12000 Human Zebra fish Total Time (second) SOAP2 SOAP3 14.12 14.6 13 13.5 14 14.5 15 Human Zebra fish Speedup 84.2 64.49 88.29 76.55 0 20 40 60 80 100 Human Zebra fish Alignment Ratio (%) SOAP2 SOAP3 Collaboration from University of Hong Kong
  • 42. 527 21879 1 10 100 1000 10000 100000 GSNP SOAPsnp Elapsedtime(sec.) Ch.1 73 3675 1 10 100 1000 10000 GSNP SOAPsnp Elapsedtime(sec.) Ch. 21 GSNP: 50X faster than its CPU based SOAPSNP  The elapsed time of all steps are included.  GSNP is around 50x faster than single-thread CPU-based SOAPsnp.
  • 43. Solutions for Data Management  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer Solution III: Don’t move the data (Cloud Computing)  Cloud Computing EasyGenomics, A SaaS platform for NGS data analysis Two paths for the future cloud solution  Algorithm and Infrastructure Scale up with Hadoop / MapReduce GPU based acceleration  Data Management Data management in BGI 43
  • 44. Paradigm Shift Traditional Model Business Determine what question to ask IT Structures the data to answer that question Big Data Model IT Delivers a platform to enable creative discovery Business Explores what questions could be asked
  • 46. BGI Data Pyramid iRODS (Data) Database (Information) Data Mining (Knowledge) Health/Clinical APP (Decision) • Data Preservation • Data Retrieval • Data Sharing • BGI-SNP • BGI-SV • BGI-GaP • Disease: HGVD/PMRD • Systems Biology • Drug Discovery • Diagnosis of Genetic Diseases • Drug of Choice
  • 47. iRODS Sequencer Raw Data Data Analysis Analyzed Data Data Warehousing Personalized Analysis Clinical Diagnosis Data Flow Knowledge Base Metadata LIMS Public Resources BGI-DB Variant (Gene) Disease Drug
  • 48. iRODS - integrated Rule Oriented Data System 48*Access data with Web-based Browser or iRODS GUI or Command Line clients. renci.org
  • 49. iRODS Sequencer Raw Data Data Analysis Analyzed Data Data Warehousing Personalized Analysis Clinical Diagnosis Data Flow - iRODS Knowledge Base Metadata LIMS Public Resources BGI-DB Variant (Gene) Disease Drug iRODS-based Data Management • Contents: raw data, analyzed data and related metadata • Data backup • Fully integrated with LIMS • Able to search and access any data according to the metadata from BGI data standard, e.g. project, sample, cohort, phenotype, QC, etc. • Federation: integrate separate iRODS zones
  • 50. Variant (Gene) Disease Drug iRODS Sequencer Raw Data Data Analysis Analyzed Data Data Warehousing Personalized Analysis Clinical Diagnosis Data Flow – BGI-DB Knowledge Base Metadata LIMS Public Resources BGI-DB BGI-DB • A locus-specific database (LSDB) for all variants identified by BGI • Manage all basic information generated from data analysis pipelines • Link all detailed information about individual samples to each variant • Easy to query information from samples with certain commonality (such as same phenotype, same cohort, etc.) • Provide the raw information for further data mining steps
  • 51. iRODS Sequencer Raw Data Data Analysis Analyzed Data Data Warehousing Personalized Analysis Clinical Diagnosis Data Flow – BGI-DW & BGI-KB Knowledge Base Metadata LIMS Public Resources BGI-DB Variant (Gene) Disease Drug BGI Data Warehousing & Knowledge Base • BGI data warehousing (BGI-DW) consists of a series of secondary databases related to variants, diseases and drugs • BGI knowledge base (BGI-KB) stores and manages the knowledge obtained through mining BGI-DB, BGI-DW and other public resources • Periodically and automatically updated • Provide APIs for the bioinformaticians to query the information and generate individualized reports
  • 52. iRODS Sequencer Raw Data Data Analysis Analyzed Data Data Warehousing Personalized Analysis Clinical Diagnosis Data Flow - Successful Story Knowledge Base Metadata LIMS Public Resources BGI-DB Query the allele frequency database to filter out common variants and identify disease- causal variants Calculate variant frequencies from certain cohorts and save them into the allele frequency database Diagnosis for Monogenic Disease Group samples into cohorts based on their phenotypes Variant (Gene) Disease Drug
  • 53. Summary of Our Practice in IT infrastructure  Data transfer Solution I: Hard drive shipment (w/ Fedex) Solution II: High Speed Data Transfer Solution III: Don’t move the data (Cloud Computing)  Cloud Computing EasyGenomics, A SaaS platform for NGS data analysis Two paths for the future cloud solution  Algorithm and Infrastructure Scale up with Hadoop / MapReduce GPU based acceleration  Data Management Using iRODs file system to manage big data 53
  • 54. Acknowledgement  Development Team Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc. Flex Lab: Yan Li (Hecate), Zhi Zhang(GAEA, iRODS) etc. GPU Lab: Bingqiang Wang etc.  Test & QA Team Xin Guan, Jingjuan Liu, etc.  PMO & IT Operation Wenjun Zeng, Litong Lai, Jing Tian, etc.  Product Team Xing Xu, Jing Guo, Fang Fang etc.  Other BGI Teams  Collaborators: University of Hong Kong (HKU) Hong Kong University of Science and Technology (HKUST) Nvidia - Aspera RENCI - TianJing Supercomputing center

Notes de l'éditeur

  1. At the heart of EasyGenomics is our Bioinformatics Core. 5 workflows with carefully chosen algorithms, tested and optimized. Filtering, QC Report, Alignment along with other supporting features.
  2. Then we chooseHadoop. The Hadoop streaming framework enabled us to fast parallel the common used algorithms.The distributed storage system HDFS, ensured the safety of our data.In addition, the Map/reduce framwork enabled us to further improve the accuracy and performance of our modules
  3. We integrated all the standard re-sequencing steps onto the cloud-based framework. Which can greatly accelerate the data anlysis.
  4. Except for parallel existing algorithms, we also designed new approaches based on the distributed frameworks. Benefit from its capacity of big data processing, we developed new models for mapping and variation detections.
  5. This cloud-basedpipeline is already applied in Cancer and human disease studies within BGI.It can reduce the analysis time from two weeks into two days.Additionally, the figure on the bottom shows the speedups of our applications, which means we can control the analysis time by choosing the size of the cluster. It would be even faster.
  6. This is the comparison on a human sample in 1000genome project Our mapping tools –GaeaAlignment returning the highest mapping rate.
  7. Hecate distributed graph simplification algrithm into etire cluster of computer nodes。Transfered memory usage into fast distributed IO work which enabled the assembly work
  8. Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. (Source: Wikipedia)SOAP: first-generation short read alignment toolSOAP2 (2008): 20 to 30 times faster than SOAP, less memoryCollaboration between BGI & University of Hong KongCompressed indexing: bidirectional BWT (2BWT)SOAP3 (2011): 10 to 30 times faster than SOAP2Collaboration from University of Hong KongGPU’s parallel processing powerCPU memory: increase from a few to tens GBGPU-based indexing: GPU-2BWT
  9. ` ``