SlideShare une entreprise Scribd logo
1  sur  21
BioPig: Hadoop-based Analytic Toolkit
for Next-Generation Sequence Data
Zhong Wang, Ph.D.
Computational Biology Staff Scientist
Cellulase
The deep metagenome approach to discover
cellulases for biofuel research
Large data, large reward
http://www.cazy.org/
Only 1% shared
(>=95% identity)
50% validated activity
Science. 2011 Jan 28;331(6016):463-7.
Sequence data
More data would be even better
Rumen(2009) Rumen(2010) Rumen(2012)
17 Gb
250 Gb
1000 Gb
But, can analysis keep up with data growth?
Ideal solutions for the terabase problem
1.Scalable to 1Tb?
2.Performance (within hours)?
High-Mem cluster
Input/Output (IO)Memory
MP/MPI solution: k-mer counting
1
2
3
4
Raw Data Data slices
Each node/core
has data and table slices
Count table
MP/MPI performance
MPI version
412 Gb, 4.5B reads
2.7 hours on 128x24 cores
NESRC HopperII
MP Threaded version
268 Gb, 3B reads
5 days on 32 cores
High-Mem Cluster
• Experienced software engineers
• Six months of development time
• One nodes fails, all fail
Problems:
Fast, scalable
Hadoop/Map Reduce framework
• Google MapReduce
– Data Parallel programming model to process petabyte data
– Generally has a map and a reduce step
• Apache Hadoop
– Distributed file system (HDFS) and job handling for
scalability and robustness
– Data locality to bring compute to data, avoiding network
transfer bottleneck
Programmability: Hadoop vs Pig
finding out top 5 websites young people visit
BioPig: design goals
• Flexible
– every dataset is unique, data analysts have domain knowledge that is essential
to optimize the analysis,
– pluggable modules that analysts can use to build custom analytic pipelines,
• High-Level
– domain-specific language enable data analysts to create custom pipelines,
– hide details of parallelism (too complex for most people),
• Scalability
– leverage data parallelism to speed up analytics,
– integrate external tools and applications where necessary,
– scale from 1 to hundreds of compute nodes with minimal effort and linear
scalability.
• Robustness
– Data and computation are replicated across nodes
to combat failures
BioPIG
Runs on any hardware supporting Hadoop
• JGI Titanium (commodity hadoop cluster)
– Up to 20 16-cores 32GB RAM 1.799Ghz, 1G Ethernet
• NERSC Magellan Cloud Testbed
– Up to 200 8-core 24GB RAM, and 2.67GHz Nehalem
processors, 10Gbit InfiniBand, GPFS
• Amazon AWS
– Elastic MapReduce with cluster compute nodes (23 GB of
memory, 2 x Intel quad-core “Nehalem” architecture 1690
GB of instance storage, 10G Ethernet
BioPig Modules
Blast
Input/Output
(Fasta,q)
K-mer
Counter
Assembly
How k-mer count is implemented
Load Mapper
Shuffle
/sort
Reducer Merge
<id1, header, ‘attagc’>
<id2, header, ‘gttagg’>
<id1, ‘atta’>, <id1,’ttag’>
<id2, ‘gtta’>, <id2, ‘ttag’>
<‘atta’, id1>, <‘ttag’, id1, id2>
<‘gtta’, id2>, <‘tagg’, id2>
<‘atta’, 1>, <‘ttag’, 2>
<‘gtta’, 1>, <‘tagg’, 1>
<‘atta’, 3>, <‘ttag’, 2>
<‘gtta’, 2>, <‘tagg’, 1>
A 7-liner BioPig script for k-mer counting
Rumen metagenome gene discovery pipeline
Read
preprocess
(remove artifacts)
pigBlast
(blast reads
against known
cellulases)
pigAssembler
(Assemble reads
into contigs)
pigExtender
(Extend contigs
into full-length
enzymes)
Cloud solution to large data
BioPig-
Blaster
BioPig-
Assembler
BioPig-
Extender
BioPIG
BioPig: 61 lines of code
MPI-extender: ~12,000 lines
(vs 31 in BioPig)
Flexibility
Programmability
Scalability
x
x
Conclusions
Hadoop-based BioPig shows great
potential for scalable analysis on very large
sequence data, it is robust and easy to use.
Challenges in application
• IO optimization, e.g., reduce data copying
• Some problems do not easily fit into
map/reduce framework, e.g., graph-based
algorithms
• Integration into exiting framework, Galaxy
Acknowledgement
• Karan Bhatia
• Henrik Nordberg
• Kai Wang
• Rob Egan
• Alex Sczyrba
• Jeremy Brand @JGI/NERSC
• Shane Cannon @NERSC
BioPIG

Contenu connexe

Tendances

NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...Igor Sfiligoi
 
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterTaking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterChristopher Bradford
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechRob Emanuele
 
Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runIgor Sfiligoi
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstIgor Sfiligoi
 
Big data solution capacity planning
Big data solution capacity planningBig data solution capacity planning
Big data solution capacity planningRiyaz Shaikh
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Rob Emanuele
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech ProjectsJody Garnett
 
Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?Rob Emanuele
 
Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)Karel Dumon
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefRobert Grossman
 
Energy-aware Task Scheduling using Ant-colony Optimization in cloud
Energy-aware Task Scheduling using Ant-colony Optimization in cloudEnergy-aware Task Scheduling using Ant-colony Optimization in cloud
Energy-aware Task Scheduling using Ant-colony Optimization in cloudLinda J
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.elliando dias
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
 
Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)Globus
 
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...William Yetman
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pRobert Grossman
 

Tendances (20)

NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterTaking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
 
Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud run
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Big data solution capacity planning
Big data solution capacity planningBig data solution capacity planning
Big data solution capacity planning
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?
 
Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)
 
Gnocchi v3
Gnocchi v3Gnocchi v3
Gnocchi v3
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster Relief
 
Energy-aware Task Scheduling using Ant-colony Optimization in cloud
Energy-aware Task Scheduling using Ant-colony Optimization in cloudEnergy-aware Task Scheduling using Ant-colony Optimization in cloud
Energy-aware Task Scheduling using Ant-colony Optimization in cloud
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)
 
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
 
Machine Learning & Data Science in the Age of the GPU: Smarter, Faster, Better
Machine Learning & Data Science in the Age of the GPU: Smarter, Faster, BetterMachine Learning & Data Science in the Age of the GPU: Smarter, Faster, Better
Machine Learning & Data Science in the Age of the GPU: Smarter, Faster, Better
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
 

Similaire à BioPig for scalable analysis of big sequencing data

Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysiscursoNGS
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersIan Foster
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...Fred de Villamil
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit Ganesan Narayanasamy
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesGuy Coates
 
IBM Power Systems: Designed for Data
IBM Power Systems: Designed for DataIBM Power Systems: Designed for Data
IBM Power Systems: Designed for DataIBM Power Systems
 
MSR 2009
MSR 2009MSR 2009
MSR 2009swy351
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingSaliya Ekanayake
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataHitoshi Sato
 
Opportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIOpportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIRyousei Takano
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shangSAIL_QU
 
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataBest pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataXing Xu
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Kohei KaiGai
 

Similaire à BioPig for scalable analysis of big sequencing data (20)

Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomes
 
IBM Power Systems: Designed for Data
IBM Power Systems: Designed for DataIBM Power Systems: Designed for Data
IBM Power Systems: Designed for Data
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
 
Opportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIOpportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCI
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
 
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataBest pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Dernier (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

BioPig for scalable analysis of big sequencing data

  • 1. BioPig: Hadoop-based Analytic Toolkit for Next-Generation Sequence Data Zhong Wang, Ph.D. Computational Biology Staff Scientist
  • 2. Cellulase The deep metagenome approach to discover cellulases for biofuel research
  • 3. Large data, large reward http://www.cazy.org/ Only 1% shared (>=95% identity) 50% validated activity Science. 2011 Jan 28;331(6016):463-7.
  • 4. Sequence data More data would be even better
  • 5. Rumen(2009) Rumen(2010) Rumen(2012) 17 Gb 250 Gb 1000 Gb But, can analysis keep up with data growth?
  • 6. Ideal solutions for the terabase problem 1.Scalable to 1Tb? 2.Performance (within hours)?
  • 8. MP/MPI solution: k-mer counting 1 2 3 4 Raw Data Data slices Each node/core has data and table slices Count table
  • 9. MP/MPI performance MPI version 412 Gb, 4.5B reads 2.7 hours on 128x24 cores NESRC HopperII MP Threaded version 268 Gb, 3B reads 5 days on 32 cores High-Mem Cluster • Experienced software engineers • Six months of development time • One nodes fails, all fail Problems: Fast, scalable
  • 10. Hadoop/Map Reduce framework • Google MapReduce – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed file system (HDFS) and job handling for scalability and robustness – Data locality to bring compute to data, avoiding network transfer bottleneck
  • 11. Programmability: Hadoop vs Pig finding out top 5 websites young people visit
  • 12. BioPig: design goals • Flexible – every dataset is unique, data analysts have domain knowledge that is essential to optimize the analysis, – pluggable modules that analysts can use to build custom analytic pipelines, • High-Level – domain-specific language enable data analysts to create custom pipelines, – hide details of parallelism (too complex for most people), • Scalability – leverage data parallelism to speed up analytics, – integrate external tools and applications where necessary, – scale from 1 to hundreds of compute nodes with minimal effort and linear scalability. • Robustness – Data and computation are replicated across nodes to combat failures BioPIG
  • 13. Runs on any hardware supporting Hadoop • JGI Titanium (commodity hadoop cluster) – Up to 20 16-cores 32GB RAM 1.799Ghz, 1G Ethernet • NERSC Magellan Cloud Testbed – Up to 200 8-core 24GB RAM, and 2.67GHz Nehalem processors, 10Gbit InfiniBand, GPFS • Amazon AWS – Elastic MapReduce with cluster compute nodes (23 GB of memory, 2 x Intel quad-core “Nehalem” architecture 1690 GB of instance storage, 10G Ethernet
  • 15. How k-mer count is implemented Load Mapper Shuffle /sort Reducer Merge <id1, header, ‘attagc’> <id2, header, ‘gttagg’> <id1, ‘atta’>, <id1,’ttag’> <id2, ‘gtta’>, <id2, ‘ttag’> <‘atta’, id1>, <‘ttag’, id1, id2> <‘gtta’, id2>, <‘tagg’, id2> <‘atta’, 1>, <‘ttag’, 2> <‘gtta’, 1>, <‘tagg’, 1> <‘atta’, 3>, <‘ttag’, 2> <‘gtta’, 2>, <‘tagg’, 1>
  • 16. A 7-liner BioPig script for k-mer counting
  • 17. Rumen metagenome gene discovery pipeline Read preprocess (remove artifacts) pigBlast (blast reads against known cellulases) pigAssembler (Assemble reads into contigs) pigExtender (Extend contigs into full-length enzymes)
  • 18. Cloud solution to large data BioPig- Blaster BioPig- Assembler BioPig- Extender BioPIG BioPig: 61 lines of code MPI-extender: ~12,000 lines (vs 31 in BioPig) Flexibility Programmability Scalability x x
  • 19. Conclusions Hadoop-based BioPig shows great potential for scalable analysis on very large sequence data, it is robust and easy to use.
  • 20. Challenges in application • IO optimization, e.g., reduce data copying • Some problems do not easily fit into map/reduce framework, e.g., graph-based algorithms • Integration into exiting framework, Galaxy
  • 21. Acknowledgement • Karan Bhatia • Henrik Nordberg • Kai Wang • Rob Egan • Alex Sczyrba • Jeremy Brand @JGI/NERSC • Shane Cannon @NERSC BioPIG