2. Background Chipster 2.0: seamless integration of analysis tools, computing clusters and visualizations through a user friendly interface With NGS data, the ”seamless” part gets really hard... Use Hadoop to improve user experience Hadoop-BAM: small side product that might prove to be useful for quite many people
3.
4. Problem definition (it gets worse...) You don't only need to store data, but you also have to do something with it Pipelines take a long time to run And in real life you don't use your pipelines once, but often tweak and rerun and rerun...
5. Enter: Hadoop Map-reduce is a framework for processing terabytes of data in a distributed way Hadoop is an open source implementation of the Google's map-reduce framework NGS data has a lot in common with web logs, which were the original motivation for map-reduce
9. Hadoop-BAM Small and simple Java library Throw it into your Hadoop installation BAM! Your BAM files are accessible by Hadoop map-reduce functions
10. What does it do? Gives you Picard SAM API Hadoop splits data into chunks and special care is needed at chunk boundaries Hadoop-BAM handles chunk boundaries behind the scenes
11.
12. Example: Preprocessing for Chipster genome browser How to allow interactive browsing with zooming in and out, for large BAM files? Can use sampling, but it is either slow or inaccurate Preprocess data and produce summaries at different levels (mipmapping) Implemented on top of Hadoop-BAM
16. Scalability results (cnt.) Did sorting and summarizing Fairly nice scaling for the processing step No scaling for import and export Lesson: avoid moving data in and out of Hadoop So having to convert data from BAM to something else would be bad
17.
18. Conclusions Cloud computing is not a free lunch, but tools, algorithms and data formats need to be adapted Hadoop-BAM library available with MIT license: http://sourceforge.net/projects/hadoop-bam/ Contact: matti.niemenmaa@aalto.fi
19. Acknowledgements Matti Niemenmaa , André Schumacher, Keijo Heljanko (Aalto University, Department of Information and Computer Science) Petri Klemelä, Eija Korpelainen (CSC - IT Center for Science) TIVIT Cloud Software program for funding