SlideShare une entreprise Scribd logo
1  sur  35
Big Data Genomics:
Clustering Billions of DNA
Sequences with Apache Spark
Zhong Wang, Ph.D.
Group Lead, Genome Analysis
05/23/2019
1999-2007
2008-now: JGI as the DOE sequencing center dedicated to plants and microbes.
DOE JGI: A brief history
Our Mission
3
DOE JGI, Serving as a genomic user facility
in support of the DOE missions:
• Walnut Creek 1999-2019
• Berkeley, CA
• 250 employees
• $70M annual budget
bioenergy, carbon cycling, & biogeochemistry
Our sequencer lineups
Miseq
NextSeq 500
Hiseq 2500
PacBio RSII
Oxford Nanopore
Short-read technologies
Long-read technologies
Novaseq 6000
PacBio Sequel
MinION Promethion
200Tb
sequencing data
in FY18
Illumina
Genomics big data is not typical big data
Unstructured
Volume, variety
veracity increases
during analytics
Metagenome is the genome of a microbial community
10s "intimate kiss" = 80 million bacteria
Metagenomics questions: Who are there? What they do? How they interact?
Microbial communities are “dark matters”
Number of Species
Cow
~6000
Human
~1000
Soil,
>100000
>90% of the species haven’t been seen before
Metagenome sequencing and assembly
Harvest
microbes
Extract
DNA
Shear, &
Sequencing
Assembly
Short Reads
Reconstructed
genomes
Microbial
Community
Metagenome
DNA
The metagenome assembly problem
Library of Books Shredded Library “reconstructed” Library
Genome ~= Book Metagenome ~= Library
Sequencing ~= sampling the pieces and read them
Scale is an enemy
1
10
100
1,000
10,000
100,000
1,000,000
Typical Human Cow Ocean Soil
Gigabases (Gb)
Complexity is another…
Remove contaminants,
sequencing errors
Overlap graph
de bruijn graph
Contigs or clusters
Repetitive elements
Homologous genes
Horizontal transferred genes
The ideal solution and the failed ones
 Easy to develop
 Robust
 Scale to big data
 Efficient
BigMem
• Easy to
develop
• Expensive
• Not scale
MPI
• Fast
• Hard to
develop
• Not robust
Hadoop
• Easy to
develop
• Scale
• Slow
Addressing big data: Apache Spark
• New scalable programming paradigm
• Compatible with Hadoop-supported
storage systems
• Improves efficiency through:
• In-memory computing primitives
• General computation graphs
• Improves usability through:
• Rich APIs in Java, Scala, Python
• Interactive shell
 Scale to big data
 Efficient
 Easy to develop
 Robust
Goal: Metagenome read clustering
Read clustering can reduce metagenome problem to
single-genome problem
• Parallel Processing
• Individualized optimization
Reads Read clusters
Algorithm
2 3
1
Node: Read
Edge: number of kmers two reads share
Kmer to reads is what word to sentence
Read graph containing all reads Graph Partitioning: LPA
Kmer-mapping reads
Graph Construction and Edge Reduction Label Propagation Algorithm
Clustering performance on long reads
Read length = 500-20,000
Short reads? Not so much
Read length = 150
Can long reads come in rescue?
Hybrid clustering
A tradeoff between cost and performance
0
50
100
150
200
250
0% 20% 40% 60% 80% 100%
mean cluster size (K) #reads (M) #clusters
Percent of long reads used
Short-read only: there is still a way out
More samples, better results: one vs 50
More data, better results:
clustering success is dependent on coverage
Can we scale to big data?
Hardware and software environments
Customized EMR Bridge
nodes 20 20 8
cores 8 (160) 8 (160) 28 (224)
memory 64 (1280) 61 (1220) 128 (1024)
Hadoop 2.7.3 2.7.3 2.7.2
Spark 2.1.1 2.2.0 2.1.0
A quick reminder…
2 3
1
Node: Read
Edge: number of kmers two reads share
Kmer to reads is what word to sentence
Read graph containing all reads Graph Partitioning: LPA
Kmer-mapping reads (KMR)
Graph Construction and Edge Reduction (Edges) Label Propagation Algorithm (LPA)
Scale to bigger data volume on a 20-node cluster
0
200
400
600
800
20 40 60 80 100
ExecutionTime(mins)
Data Size (GB)
KMR Edges LPA Total
Increasing nodes on a 50G-dataset
0
100
200
300
400
500
25 50 75 100
ExecutionTime(mins)
Number of nodes
50G
KMR Edges LPA Total
Fine tune parallelism
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8
ExecutionTIme(mins)
Spark default parallelism (log10)
50G 20G
Dataset complexity vs performance
146.33
44.5
0
20
40
60
80
100
120
140
160
Human Iso-Seq Alzheimer(PacBio) Cow Rumen(Illumina)
ExecutionTime(mins)
KMR Edges LPA
Platform comparison: Clouds and HPC
Customized EMR Bridge
nodes 20 20 8
cores 8 (160) 8 (160) 28 (224)
memory 64 (1280) 61 (1220) 128 (1024)
Time (min) 106 105 126
Now we have a big hammer…
Clustering for identifying genome contaminants
Russula 70Mb
Bradyrhizobium
7.2Mb
Collimonas: 5.3Mb
Targeting big metagenome projects
Dr. Morgan-Kiss
@ Miami University
Dr. Slonczewski
@Kenyon University
Two lakes, 1.2Tbp
Acknowledgements
Spark Team
Lizhen Shi @FSU
Xiandong Meng
Kexue Li, LiliWang and Li Deng
@Shanghai U
Kurt Labutti
Elizabeth Tseng @PacBio
Lisa Gerhardt , Evan Racah
@ NERSC
Yong Qin, Gary Jung,
Greg Kurtzer, Bernard Li,
@ HPC
Philip Blood,
Bryon Gill
@PSC

Contenu connexe

Tendances

Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
Gwen (Chen) Shapira
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 

Tendances (20)

Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid
 
Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedIn
 
Tame that Beast
Tame that BeastTame that Beast
Tame that Beast
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
 
ebay
ebayebay
ebay
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion Pipelines
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary DatabaseRedis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 

Similaire à Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
Yasin Memari
 
Convolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsConvolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic hands
Mohsen Jafarzadeh
 

Similaire à Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark (20)

Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010 Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
 
Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough NutsMVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
MVAPICH: How a Bunch of Buckeyes Crack Tough Nuts
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
Low level java programming
Low level java programmingLow level java programming
Low level java programming
 
Covid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power SystemsCovid-19 Response Capability with Power Systems
Covid-19 Response Capability with Power Systems
 
Convolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic handsConvolutional neural networks for speech controlled prosthetic hands
Convolutional neural networks for speech controlled prosthetic hands
 

Plus de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
DataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

  • 1. Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark Zhong Wang, Ph.D. Group Lead, Genome Analysis 05/23/2019
  • 2. 1999-2007 2008-now: JGI as the DOE sequencing center dedicated to plants and microbes. DOE JGI: A brief history
  • 3. Our Mission 3 DOE JGI, Serving as a genomic user facility in support of the DOE missions: • Walnut Creek 1999-2019 • Berkeley, CA • 250 employees • $70M annual budget bioenergy, carbon cycling, & biogeochemistry
  • 4. Our sequencer lineups Miseq NextSeq 500 Hiseq 2500 PacBio RSII Oxford Nanopore Short-read technologies Long-read technologies Novaseq 6000 PacBio Sequel MinION Promethion 200Tb sequencing data in FY18 Illumina
  • 5. Genomics big data is not typical big data Unstructured Volume, variety veracity increases during analytics
  • 6. Metagenome is the genome of a microbial community 10s "intimate kiss" = 80 million bacteria Metagenomics questions: Who are there? What they do? How they interact?
  • 7. Microbial communities are “dark matters” Number of Species Cow ~6000 Human ~1000 Soil, >100000 >90% of the species haven’t been seen before
  • 8. Metagenome sequencing and assembly Harvest microbes Extract DNA Shear, & Sequencing Assembly Short Reads Reconstructed genomes Microbial Community Metagenome DNA
  • 9. The metagenome assembly problem Library of Books Shredded Library “reconstructed” Library Genome ~= Book Metagenome ~= Library Sequencing ~= sampling the pieces and read them
  • 10. Scale is an enemy 1 10 100 1,000 10,000 100,000 1,000,000 Typical Human Cow Ocean Soil Gigabases (Gb)
  • 11. Complexity is another… Remove contaminants, sequencing errors Overlap graph de bruijn graph Contigs or clusters Repetitive elements Homologous genes Horizontal transferred genes
  • 12. The ideal solution and the failed ones  Easy to develop  Robust  Scale to big data  Efficient BigMem • Easy to develop • Expensive • Not scale MPI • Fast • Hard to develop • Not robust Hadoop • Easy to develop • Scale • Slow
  • 13. Addressing big data: Apache Spark • New scalable programming paradigm • Compatible with Hadoop-supported storage systems • Improves efficiency through: • In-memory computing primitives • General computation graphs • Improves usability through: • Rich APIs in Java, Scala, Python • Interactive shell  Scale to big data  Efficient  Easy to develop  Robust
  • 14. Goal: Metagenome read clustering Read clustering can reduce metagenome problem to single-genome problem • Parallel Processing • Individualized optimization Reads Read clusters
  • 15. Algorithm 2 3 1 Node: Read Edge: number of kmers two reads share Kmer to reads is what word to sentence Read graph containing all reads Graph Partitioning: LPA Kmer-mapping reads Graph Construction and Edge Reduction Label Propagation Algorithm
  • 16. Clustering performance on long reads Read length = 500-20,000
  • 17. Short reads? Not so much Read length = 150
  • 18. Can long reads come in rescue?
  • 20. A tradeoff between cost and performance 0 50 100 150 200 250 0% 20% 40% 60% 80% 100% mean cluster size (K) #reads (M) #clusters Percent of long reads used
  • 21. Short-read only: there is still a way out
  • 22. More samples, better results: one vs 50
  • 23. More data, better results: clustering success is dependent on coverage
  • 24. Can we scale to big data?
  • 25. Hardware and software environments Customized EMR Bridge nodes 20 20 8 cores 8 (160) 8 (160) 28 (224) memory 64 (1280) 61 (1220) 128 (1024) Hadoop 2.7.3 2.7.3 2.7.2 Spark 2.1.1 2.2.0 2.1.0
  • 26. A quick reminder… 2 3 1 Node: Read Edge: number of kmers two reads share Kmer to reads is what word to sentence Read graph containing all reads Graph Partitioning: LPA Kmer-mapping reads (KMR) Graph Construction and Edge Reduction (Edges) Label Propagation Algorithm (LPA)
  • 27. Scale to bigger data volume on a 20-node cluster 0 200 400 600 800 20 40 60 80 100 ExecutionTime(mins) Data Size (GB) KMR Edges LPA Total
  • 28. Increasing nodes on a 50G-dataset 0 100 200 300 400 500 25 50 75 100 ExecutionTime(mins) Number of nodes 50G KMR Edges LPA Total
  • 29. Fine tune parallelism 0 50 100 150 200 250 300 350 1 2 3 4 5 6 7 8 ExecutionTIme(mins) Spark default parallelism (log10) 50G 20G
  • 30. Dataset complexity vs performance 146.33 44.5 0 20 40 60 80 100 120 140 160 Human Iso-Seq Alzheimer(PacBio) Cow Rumen(Illumina) ExecutionTime(mins) KMR Edges LPA
  • 31. Platform comparison: Clouds and HPC Customized EMR Bridge nodes 20 20 8 cores 8 (160) 8 (160) 28 (224) memory 64 (1280) 61 (1220) 128 (1024) Time (min) 106 105 126
  • 32. Now we have a big hammer…
  • 33. Clustering for identifying genome contaminants Russula 70Mb Bradyrhizobium 7.2Mb Collimonas: 5.3Mb
  • 34. Targeting big metagenome projects Dr. Morgan-Kiss @ Miami University Dr. Slonczewski @Kenyon University Two lakes, 1.2Tbp
  • 35. Acknowledgements Spark Team Lizhen Shi @FSU Xiandong Meng Kexue Li, LiliWang and Li Deng @Shanghai U Kurt Labutti Elizabeth Tseng @PacBio Lisa Gerhardt , Evan Racah @ NERSC Yong Qin, Gary Jung, Greg Kurtzer, Bernard Li, @ HPC Philip Blood, Bryon Gill @PSC