SlideShare a Scribd company logo
1 of 19
Hadoop-BAM: Directly manipulating BAM on Hadoop Aleksi Kallio CSC - IT Center for Science, Finland BOSC 2011, July 16, Vienna
Background Chipster 2.0: seamless integration of analysis tools, computing clusters and visualizations through a user friendly interface With NGS data, the ”seamless” part gets really hard... Use Hadoop to improve user experience Hadoop-BAM: small side product that might prove to be useful for quite many people
Problem definition ,[object Object],[object Object],[object Object],[object Object]
Problem definition (it gets worse...) You don't only need to store data, but you also have to do something with it Pipelines take a long time to run And in real life you don't use your pipelines once, but often tweak and rerun and rerun...
Enter: Hadoop Map-reduce is a framework for processing terabytes of data in a distributed way Hadoop is an open source implementation of the  Google's map-reduce framework NGS data has a lot in common with web logs, which were the original motivation for map-reduce
Map-reduce framework
Hadoop and map-reduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Possible solutions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hadoop-BAM Small and simple Java library Throw it into your Hadoop installation BAM! Your BAM files are accessible by Hadoop map-reduce functions
What does it do? Gives you Picard SAM API Hadoop splits data into chunks and special care is needed at chunk boundaries Hadoop-BAM handles chunk boundaries behind the scenes
Detecting BAM record boundaries ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Example: Preprocessing for Chipster genome browser How to allow interactive browsing with zooming in and out, for large BAM files? Can use sampling, but it is either slow or inaccurate Preprocess data and produce summaries at different levels (mipmapping) Implemented on top of Hadoop-BAM
Result looks nice
Benchmarking Take 50GB of data from 1000 Genomes Run on cluster of 112 AMD Opteron 2.6 GHz (1344 cores) and Infiniband interconnect
Scalability results
Scalability results (cnt.) Did sorting and summarizing Fairly nice scaling for the processing step No scaling for import and export Lesson: avoid moving data in and out of Hadoop So having to convert data from BAM to something else would be bad
Future plans ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Conclusions Cloud computing is not a free lunch, but tools, algorithms and data formats need to be adapted Hadoop-BAM library available with MIT license:  http://sourceforge.net/projects/hadoop-bam/ Contact: matti.niemenmaa@aalto.fi
Acknowledgements Matti Niemenmaa , André Schumacher, Keijo Heljanko (Aalto University, Department of Information and Computer Science) Petri Klemelä, Eija Korpelainen (CSC - IT Center for Science) TIVIT Cloud Software program for funding

More Related Content

What's hot

Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Hadoop User Group
 
Quick Understanding of NoSQL
Quick Understanding of NoSQLQuick Understanding of NoSQL
Quick Understanding of NoSQL
Edward Yoon
 
Scaling Web Apps P Falcone
Scaling Web Apps P FalconeScaling Web Apps P Falcone
Scaling Web Apps P Falcone
jedt
 
Data platform at Samsung (Big Learning)
Data platform at Samsung (Big Learning)Data platform at Samsung (Big Learning)
Data platform at Samsung (Big Learning)
ZhuanzhuanDing
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
Hadoop User Group
 

What's hot (20)

TechEvent Operating MapR Hadoop Cluster for a year
TechEvent Operating MapR Hadoop Cluster for a yearTechEvent Operating MapR Hadoop Cluster for a year
TechEvent Operating MapR Hadoop Cluster for a year
 
R&D for L&D
R&D for L&DR&D for L&D
R&D for L&D
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
 
Apache Hama at Samsung Open Source Conference
Apache Hama at Samsung Open Source ConferenceApache Hama at Samsung Open Source Conference
Apache Hama at Samsung Open Source Conference
 
Training
TrainingTraining
Training
 
Hadoop online training
Hadoop online trainingHadoop online training
Hadoop online training
 
Introduction of Apache Hama - 2011
Introduction of Apache Hama - 2011Introduction of Apache Hama - 2011
Introduction of Apache Hama - 2011
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Quick Understanding of NoSQL
Quick Understanding of NoSQLQuick Understanding of NoSQL
Quick Understanding of NoSQL
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Scaling Web Apps P Falcone
Scaling Web Apps P FalconeScaling Web Apps P Falcone
Scaling Web Apps P Falcone
 
Data platform at Samsung (Big Learning)
Data platform at Samsung (Big Learning)Data platform at Samsung (Big Learning)
Data platform at Samsung (Big Learning)
 
Map reduce & HDFS with Hadoop
Map reduce & HDFS with HadoopMap reduce & HDFS with Hadoop
Map reduce & HDFS with Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Hama 0.4
Apache Hama 0.4Apache Hama 0.4
Apache Hama 0.4
 
A complete hadoop stack
A complete hadoop stackA complete hadoop stack
A complete hadoop stack
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
 

Viewers also liked (9)

G07-Misc-Gmod
G07-Misc-GmodG07-Misc-Gmod
G07-Misc-Gmod
 
G04-Misc-Debianmed
G04-Misc-DebianmedG04-Misc-Debianmed
G04-Misc-Debianmed
 
B03-GenomeContent-Intermine
B03-GenomeContent-IntermineB03-GenomeContent-Intermine
B03-GenomeContent-Intermine
 
F05-Cloud-Sequencescape
F05-Cloud-SequencescapeF05-Cloud-Sequencescape
F05-Cloud-Sequencescape
 
Bosc talk 7-15-2011x
Bosc talk 7-15-2011xBosc talk 7-15-2011x
Bosc talk 7-15-2011x
 
F03-Cloud-Obiwee
F03-Cloud-ObiweeF03-Cloud-Obiwee
F03-Cloud-Obiwee
 
Unipro ugene bosc 2011 update
Unipro ugene bosc 2011 updateUnipro ugene bosc 2011 update
Unipro ugene bosc 2011 update
 
G03-SemanticWeb-OntoCAT
G03-SemanticWeb-OntoCATG03-SemanticWeb-OntoCAT
G03-SemanticWeb-OntoCAT
 
D02-NextGenSeq-MOLGENIS
D02-NextGenSeq-MOLGENISD02-NextGenSeq-MOLGENIS
D02-NextGenSeq-MOLGENIS
 

Similar to F07-Cloud-Hadoop-BAM

Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
yhadoop
 

Similar to F07-Cloud-Hadoop-BAM (20)

Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Introduction to apache horn (incubating)
Introduction to apache horn (incubating)Introduction to apache horn (incubating)
Introduction to apache horn (incubating)
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Hadoop Tutorial for Beginners
Hadoop Tutorial for BeginnersHadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Best hadoop-online-training
Best hadoop-online-trainingBest hadoop-online-training
Best hadoop-online-training
 

More from Bioinformatics Open Source Conference

More from Bioinformatics Open Source Conference (18)

Running workflows through galaxy bosc presentation
Running workflows through galaxy bosc presentationRunning workflows through galaxy bosc presentation
Running workflows through galaxy bosc presentation
 
Talk1 ben sadi for_gmod_bosc_2011
Talk1 ben sadi for_gmod_bosc_2011Talk1 ben sadi for_gmod_bosc_2011
Talk1 ben sadi for_gmod_bosc_2011
 
Bosc mercer
Bosc mercerBosc mercer
Bosc mercer
 
Mobyle 1 0_new_features_new_types_of_service
Mobyle 1 0_new_features_new_types_of_serviceMobyle 1 0_new_features_new_types_of_service
Mobyle 1 0_new_features_new_types_of_service
 
Bosc2011 arakawa
Bosc2011 arakawaBosc2011 arakawa
Bosc2011 arakawa
 
Bosc2011 isobar-fbp
Bosc2011 isobar-fbpBosc2011 isobar-fbp
Bosc2011 isobar-fbp
 
Talk6 biopython bosc2011
Talk6 biopython bosc2011Talk6 biopython bosc2011
Talk6 biopython bosc2011
 
Bosc2011 ntino-krampis-full
Bosc2011 ntino-krampis-fullBosc2011 ntino-krampis-full
Bosc2011 ntino-krampis-full
 
F02-Cloud-Cloud BioLinux
F02-Cloud-Cloud BioLinuxF02-Cloud-Cloud BioLinux
F02-Cloud-Cloud BioLinux
 
B07-GenomeContent-Biomart
B07-GenomeContent-BiomartB07-GenomeContent-Biomart
B07-GenomeContent-Biomart
 
F06-Cloud-Enabling NGS
F06-Cloud-Enabling NGSF06-Cloud-Enabling NGS
F06-Cloud-Enabling NGS
 
D03-NextGen-Bio-NGS
D03-NextGen-Bio-NGSD03-NextGen-Bio-NGS
D03-NextGen-Bio-NGS
 
C03-Visualization-Webapollo
C03-Visualization-WebapolloC03-Visualization-Webapollo
C03-Visualization-Webapollo
 
F01-Cloud-Mygene.info
F01-Cloud-Mygene.infoF01-Cloud-Mygene.info
F01-Cloud-Mygene.info
 
A01-Openness in knowledge-based systems
A01-Openness in knowledge-based systemsA01-Openness in knowledge-based systems
A01-Openness in knowledge-based systems
 
C02-Visualization-Applying visual analytics
C02-Visualization-Applying visual analyticsC02-Visualization-Applying visual analytics
C02-Visualization-Applying visual analytics
 
B04-GenomeContent-EasyDAS
B04-GenomeContent-EasyDASB04-GenomeContent-EasyDAS
B04-GenomeContent-EasyDAS
 
G09-Misc-EMBOSS
G09-Misc-EMBOSSG09-Misc-EMBOSS
G09-Misc-EMBOSS
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

F07-Cloud-Hadoop-BAM

  • 1. Hadoop-BAM: Directly manipulating BAM on Hadoop Aleksi Kallio CSC - IT Center for Science, Finland BOSC 2011, July 16, Vienna
  • 2. Background Chipster 2.0: seamless integration of analysis tools, computing clusters and visualizations through a user friendly interface With NGS data, the ”seamless” part gets really hard... Use Hadoop to improve user experience Hadoop-BAM: small side product that might prove to be useful for quite many people
  • 3.
  • 4. Problem definition (it gets worse...) You don't only need to store data, but you also have to do something with it Pipelines take a long time to run And in real life you don't use your pipelines once, but often tweak and rerun and rerun...
  • 5. Enter: Hadoop Map-reduce is a framework for processing terabytes of data in a distributed way Hadoop is an open source implementation of the Google's map-reduce framework NGS data has a lot in common with web logs, which were the original motivation for map-reduce
  • 7.
  • 8.
  • 9. Hadoop-BAM Small and simple Java library Throw it into your Hadoop installation BAM! Your BAM files are accessible by Hadoop map-reduce functions
  • 10. What does it do? Gives you Picard SAM API Hadoop splits data into chunks and special care is needed at chunk boundaries Hadoop-BAM handles chunk boundaries behind the scenes
  • 11.
  • 12. Example: Preprocessing for Chipster genome browser How to allow interactive browsing with zooming in and out, for large BAM files? Can use sampling, but it is either slow or inaccurate Preprocess data and produce summaries at different levels (mipmapping) Implemented on top of Hadoop-BAM
  • 14. Benchmarking Take 50GB of data from 1000 Genomes Run on cluster of 112 AMD Opteron 2.6 GHz (1344 cores) and Infiniband interconnect
  • 16. Scalability results (cnt.) Did sorting and summarizing Fairly nice scaling for the processing step No scaling for import and export Lesson: avoid moving data in and out of Hadoop So having to convert data from BAM to something else would be bad
  • 17.
  • 18. Conclusions Cloud computing is not a free lunch, but tools, algorithms and data formats need to be adapted Hadoop-BAM library available with MIT license: http://sourceforge.net/projects/hadoop-bam/ Contact: matti.niemenmaa@aalto.fi
  • 19. Acknowledgements Matti Niemenmaa , André Schumacher, Keijo Heljanko (Aalto University, Department of Information and Computer Science) Petri Klemelä, Eija Korpelainen (CSC - IT Center for Science) TIVIT Cloud Software program for funding