SlideShare une entreprise Scribd logo
1  sur  20
Delivering Bioinformatics MapReduce
Applications in the Cloud
Lukas Forer*, Tomislav Lipić**, Sebastian Schönherr*, Hansi Weißensteiner*,
Davor Davidović**, Florian Kronenberg* and Enis Afgan**
* Division of Genetic Epidemiology, Medical University of Innsbruck, Austria
** Center for Informatics and Computing, Ruđer Bošković Institute, Zagreb, Croatia
Page 2
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Introduction
• Cooperation between
– Division of Genetic Epidemiology, Innsbruck,
Austria
– Ruđer Bošković Institute (RBI), Zagreb, Croatia
• Project
• Aim: Developing a Bioinformatics Platform for the Cloud
Page 3
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Motivation: Bioinformatics
• More and more data is produced
• Next-Generation Sequencing (NGS)
– Allows us to sequence the complete human
genome (3.2 billion positions)
– Decreasing Costs  Increasing Data
Production
Bottleneck is no longer the data production in the
laboratory, but the analysis!
Page 4
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Motivation: Next Generation Sequencing
• NGS Data Characteristics
– Data size in terabyte scale
– Independent text data rows
– Batch processing required for analysis files
• MapReduce
– One possible candidate for batch processing
– scalable approach to manage and process large data
efficiently
“High potential for MapReduce in Genomics”
Page 5
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
MapReduce in Bioinformatics
Hadoop
MapReduce
libraries for
Bioinformatics
Hadoop BAM
Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ,
FASTA, QSEQ, BCF, and VCF)
SeqPig
Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop-
BAM
BioPig Processing NGS data with Apache Pig; Presenting UDFs
Biodoop
MapReduce suite for sequence alignments / manipulation of aligned records; written in
Python
DNA -
Alignment
algorithms
based on
Hadoop
CloudBurst
Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non-
overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds
Seal
Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal
file format) Reduce: Remove duplicates (optional)
Crossbow
Based on Bowtie / SOAPsnp
Map: Executing Bowtie on chunks
Reduce: SNP calling using SOAPsnp
RNA - Analysis
based on
Hadoop
MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie
FX RNA-Seq analysis tool
Eoulsan RNA-Seq analysis tool
Non-Hadoop
based
Approaches
GATK
MapReduce-like framework including a rich set of tools for quality assurance, alignment and
variant calling; not based on Hadoop MapReduce
Page 6
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Problem 1: Complex Analysis Pipelines
• Bioinformatics MapReduce Applications
– available only on a per-tool basis
– cover one aspect of a larger data analysis
pipeline
– Hard to use for scientists without background
in Computer Science
Needed: System which enables building
MapReduce workflows
Page 7
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudgene: Overview
• Cloudgene
– MapReduce based Workflow System
– assists scientists in executing and monitoring
workflows graphically
– data management (import/export)
– workflow parameter track→ reproducibility
• Supports Apache’s Big-Data stack
S. Schönherr, L. Forer, H. Weißensteiner, F. Kronenberg, G. Specht, and A. Kloss-Brandstätter. Cloudgene: a graphical execution platform
for MapReduce programs on private and public clouds. BMC Bioinformatics, 13(1):200, Jan. 2012.
Page 8
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudgene: Workflow Composition
• Reuse existing Apache Hadoop programs
• No adaptations in source code needed
• All meta data about tasks and workflows
are defined in one single file
– the manifest file
MapReduce
program
Manifest File
for Cloudgene
Page 9
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudgene: Workflow Composition
• Manifest file defines the workflow
Page 10
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudgene: Interface
Used
Parameters
Download links
to result files
Page 11
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Problem 2: Missing Infrastructure
• Cloudgene requires a functional compatible cluster
– Small/Medium sized research institutes can hardly afford
own clusters
• Possible Solution: Cloud Computing
– Rent computer hardware from different providers (e.g.
AWS, HP)
– Use resources and services on demand
Needed: System which enables delivering
MapReduce clusters in the cloud
Page 12
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
CloudMan: Overview
• Enables launching/managing a analysis
platform on a cloud infrastructure via a web
browser
– Delivers a scalable cluster-in-the-cloud
• Preconfigure a number of applications
– Workflow System: Galaxy
– Batch scheduler: Sun Grid Engine (SGE)
– MapReduce using Hadoop and SGE
integration
Afgan E, Chapman B, Taylor J. CloudMan as a platform for tool, data, and analysis distribution. BMC Bioinformatics
2012;
Page 13
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Configure the Cluster
Page 14
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Manage the Cluster
Page 15
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Use the Cluster
Page 16
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
CloudMan-as-a-Platform
• CloudMan implements a service dependency
management framework
• Enables creation of user-specific cloud platforms
Idea: Implementing Cloudgene as a CloudMan
service
Page 17
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Cloudgene meets CloudMan
Cloudgene
A Bioinformatics
MapReduce workflow
framework
CloudMan
A cloud manager for
delivering application
execution environments
CloudMan platform
OpenNebulaOpenNebula
AWS OpenStack Eucalyptus
SGE Hadoop HTCondor ...
GalaxyCloudgene
MapReduce tools Traditional tools
Batch
Cluster jobs
Page 18
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
New Opportunities
• One single data analysis cloud platform
– MapReduce and SGE integrated
– Additionally, more big data computational
models can be supported in future
CloudMan CloudMan
Page 19
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Conclusion
• MapReduce
– simple but effective programming model
– well suited for the Cloud
– still limited to a small number of highly qualified domain experts
• Cloudgene
– as a possible MapReduce Workflow Manager
• CloudMan
– as a possible Cloud Manager
• Implementing Cloudgene as a CloudMan service
– Enables domain experts to incorporate and make use of
additional big data models
Page 20
21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr
Acknowledge
• CloudMan
– Enis Afgan
– Tomislav Lipić
– Davor Davidović
• Cloudgene
– Lukas Forer
– Sebastian Schönherr
– Hansi Weißensteiner
– Florian Kronenberg

Contenu connexe

Tendances

Hourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopMatthew Hayes
 
Programming Modes and Performance of Raspberry-Pi Clusters
Programming Modes and Performance of Raspberry-Pi ClustersProgramming Modes and Performance of Raspberry-Pi Clusters
Programming Modes and Performance of Raspberry-Pi ClustersAM Publications
 
GPU accelerated Large Scale Analytics
GPU accelerated Large Scale AnalyticsGPU accelerated Large Scale Analytics
GPU accelerated Large Scale AnalyticsSuleiman Shehu
 
Summary of the Deployment Scenarios and Functional Requirements
Summary of the Deployment Scenarios and Functional RequirementsSummary of the Deployment Scenarios and Functional Requirements
Summary of the Deployment Scenarios and Functional RequirementsArchiver
 
DESY / XFEL Deployment Scenarios
DESY / XFEL Deployment Scenarios  DESY / XFEL Deployment Scenarios
DESY / XFEL Deployment Scenarios Archiver
 
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONSBUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONSijccsa
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman
 
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server
How Data Volume Affects Spark Based Data Analytics on a Scale-up ServerHow Data Volume Affects Spark Based Data Analytics on a Scale-up Server
How Data Volume Affects Spark Based Data Analytics on a Scale-up ServerAhsan Javed Awan
 
Gisruk2013 addy edit2
Gisruk2013 addy edit2Gisruk2013 addy edit2
Gisruk2013 addy edit2Addy Pope
 
Performance Characterization of In-Memory Data Analytics on a Modern Cloud Se...
Performance Characterization of In-Memory Data Analytics on a Modern Cloud Se...Performance Characterization of In-Memory Data Analytics on a Modern Cloud Se...
Performance Characterization of In-Memory Data Analytics on a Modern Cloud Se...Ahsan Javed Awan
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...BigData_Europe
 
rasdaman: from barebone Arrays to DataCubes
rasdaman: from barebone Arrays to DataCubesrasdaman: from barebone Arrays to DataCubes
rasdaman: from barebone Arrays to DataCubesEUDAT
 
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of HadoopIRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of HadoopIRJET Journal
 
Archiver omc cern_deployment_scenarios_technical_details
Archiver omc cern_deployment_scenarios_technical_detailsArchiver omc cern_deployment_scenarios_technical_details
Archiver omc cern_deployment_scenarios_technical_detailsArchiver
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED
 
Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich ABES
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions csandit
 

Tendances (19)

Hourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on HadoopHourglass: a Library for Incremental Processing on Hadoop
Hourglass: a Library for Incremental Processing on Hadoop
 
Programming Modes and Performance of Raspberry-Pi Clusters
Programming Modes and Performance of Raspberry-Pi ClustersProgramming Modes and Performance of Raspberry-Pi Clusters
Programming Modes and Performance of Raspberry-Pi Clusters
 
GPU accelerated Large Scale Analytics
GPU accelerated Large Scale AnalyticsGPU accelerated Large Scale Analytics
GPU accelerated Large Scale Analytics
 
Summary of the Deployment Scenarios and Functional Requirements
Summary of the Deployment Scenarios and Functional RequirementsSummary of the Deployment Scenarios and Functional Requirements
Summary of the Deployment Scenarios and Functional Requirements
 
DESY / XFEL Deployment Scenarios
DESY / XFEL Deployment Scenarios  DESY / XFEL Deployment Scenarios
DESY / XFEL Deployment Scenarios
 
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONSBUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
BUILDING A PRIVATE HPC CLOUD FOR COMPUTE AND DATA-INTENSIVE APPLICATIONS
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server
How Data Volume Affects Spark Based Data Analytics on a Scale-up ServerHow Data Volume Affects Spark Based Data Analytics on a Scale-up Server
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server
 
Gisruk2013 addy edit2
Gisruk2013 addy edit2Gisruk2013 addy edit2
Gisruk2013 addy edit2
 
Performance Characterization of In-Memory Data Analytics on a Modern Cloud Se...
Performance Characterization of In-Memory Data Analytics on a Modern Cloud Se...Performance Characterization of In-Memory Data Analytics on a Modern Cloud Se...
Performance Characterization of In-Memory Data Analytics on a Modern Cloud Se...
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
Apache Big_Data Europe event: "Integrators at work! Real-life applications of...
 
rasdaman: from barebone Arrays to DataCubes
rasdaman: from barebone Arrays to DataCubesrasdaman: from barebone Arrays to DataCubes
rasdaman: from barebone Arrays to DataCubes
 
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of HadoopIRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
IRJET- Big Data-A Review Study with Comparitive Analysis of Hadoop
 
Archiver omc cern_deployment_scenarios_technical_details
Archiver omc cern_deployment_scenarios_technical_detailsArchiver omc cern_deployment_scenarios_technical_details
Archiver omc cern_deployment_scenarios_technical_details
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich Journées ABES 2014 - Projet CIB - Uwe Rich
Journées ABES 2014 - Projet CIB - Uwe Rich
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
 

En vedette

الهوية الرقمية على مواقع التواصل الاجتماعي
الهوية الرقمية على مواقع التواصل الاجتماعيالهوية الرقمية على مواقع التواصل الاجتماعي
الهوية الرقمية على مواقع التواصل الاجتماعيFatma Esa
 
Caravane Bio [Mohammed Benbouida, AMBS, Morocco]
Caravane Bio [Mohammed Benbouida, AMBS, Morocco]Caravane Bio [Mohammed Benbouida, AMBS, Morocco]
Caravane Bio [Mohammed Benbouida, AMBS, Morocco]UNESCO Venice Office
 
Kallio Chipster Bosc2009
Kallio Chipster Bosc2009Kallio Chipster Bosc2009
Kallio Chipster Bosc2009bosc
 
استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)
 استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I) استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)
استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)Prof. Tafida Ghanem
 
Supporting bioinformatics applications with hybrid multi-cloud services
Supporting bioinformatics applications with hybrid multi-cloud servicesSupporting bioinformatics applications with hybrid multi-cloud services
Supporting bioinformatics applications with hybrid multi-cloud servicesAhmed Abdullah
 
Lt npsti process-and_forms_april_2011
Lt npsti process-and_forms_april_2011Lt npsti process-and_forms_april_2011
Lt npsti process-and_forms_april_2011Mosab-Khayat
 
Bioinformatics lecture 1
Bioinformatics lecture 1Bioinformatics lecture 1
Bioinformatics lecture 1Hamid Ur-Rahman
 
تسويق خدمات المعلومات
تسويق خدمات المعلوماتتسويق خدمات المعلومات
تسويق خدمات المعلوماتu083125
 
الثقافة المعلوماتية في الجامعات مكتبة جامعة 6 أكتوبر نوفمبر 2012م
الثقافة المعلوماتية في الجامعات   مكتبة جامعة 6 أكتوبر نوفمبر 2012مالثقافة المعلوماتية في الجامعات   مكتبة جامعة 6 أكتوبر نوفمبر 2012م
الثقافة المعلوماتية في الجامعات مكتبة جامعة 6 أكتوبر نوفمبر 2012مProf. Sherif Shaheen
 
From Sunset To Sunrise
From Sunset To SunriseFrom Sunset To Sunrise
From Sunset To Sunriseguest11219b
 
الثقافة التقنية والمواطنة الالكترونية
الثقافة التقنية والمواطنة الالكترونيةالثقافة التقنية والمواطنة الالكترونية
الثقافة التقنية والمواطنة الالكترونيةNazzal Th. Alenezi
 
دور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفية
دور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفيةدور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفية
دور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفيةe-Marefa
 

En vedette (20)

الهوية الرقمية على مواقع التواصل الاجتماعي
الهوية الرقمية على مواقع التواصل الاجتماعيالهوية الرقمية على مواقع التواصل الاجتماعي
الهوية الرقمية على مواقع التواصل الاجتماعي
 
Caravane Bio [Mohammed Benbouida, AMBS, Morocco]
Caravane Bio [Mohammed Benbouida, AMBS, Morocco]Caravane Bio [Mohammed Benbouida, AMBS, Morocco]
Caravane Bio [Mohammed Benbouida, AMBS, Morocco]
 
Kallio Chipster Bosc2009
Kallio Chipster Bosc2009Kallio Chipster Bosc2009
Kallio Chipster Bosc2009
 
استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)
 استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I) استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)
استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)
 
مهارات+1
مهارات+1مهارات+1
مهارات+1
 
Supporting bioinformatics applications with hybrid multi-cloud services
Supporting bioinformatics applications with hybrid multi-cloud servicesSupporting bioinformatics applications with hybrid multi-cloud services
Supporting bioinformatics applications with hybrid multi-cloud services
 
Dr Justin Schonfeld - Bioinformatics Applications
Dr Justin Schonfeld - Bioinformatics ApplicationsDr Justin Schonfeld - Bioinformatics Applications
Dr Justin Schonfeld - Bioinformatics Applications
 
Lt npsti process-and_forms_april_2011
Lt npsti process-and_forms_april_2011Lt npsti process-and_forms_april_2011
Lt npsti process-and_forms_april_2011
 
Present
PresentPresent
Present
 
e justice
e justice e justice
e justice
 
Dr. Dario Lijtmaer - Data Sharing/Collaboration and Publication using BOLD
Dr. Dario Lijtmaer - Data Sharing/Collaboration and Publication using BOLDDr. Dario Lijtmaer - Data Sharing/Collaboration and Publication using BOLD
Dr. Dario Lijtmaer - Data Sharing/Collaboration and Publication using BOLD
 
Bioinformatics lecture 1
Bioinformatics lecture 1Bioinformatics lecture 1
Bioinformatics lecture 1
 
Brin bws13 quiz mmc
Brin bws13 quiz mmcBrin bws13 quiz mmc
Brin bws13 quiz mmc
 
Visual Studio
Visual StudioVisual Studio
Visual Studio
 
تسويق خدمات المعلومات
تسويق خدمات المعلوماتتسويق خدمات المعلومات
تسويق خدمات المعلومات
 
الثقافة المعلوماتية في الجامعات مكتبة جامعة 6 أكتوبر نوفمبر 2012م
الثقافة المعلوماتية في الجامعات   مكتبة جامعة 6 أكتوبر نوفمبر 2012مالثقافة المعلوماتية في الجامعات   مكتبة جامعة 6 أكتوبر نوفمبر 2012م
الثقافة المعلوماتية في الجامعات مكتبة جامعة 6 أكتوبر نوفمبر 2012م
 
From Sunset To Sunrise
From Sunset To SunriseFrom Sunset To Sunrise
From Sunset To Sunrise
 
الثقافة التقنية والمواطنة الالكترونية
الثقافة التقنية والمواطنة الالكترونيةالثقافة التقنية والمواطنة الالكترونية
الثقافة التقنية والمواطنة الالكترونية
 
دور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفية
دور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفيةدور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفية
دور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفية
 
ABT 609 PPT
ABT 609 PPTABT 609 PPT
ABT 609 PPT
 

Similaire à Delivering Bioinformatics MapReduce Applications in the Cloud

Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...
Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...
Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...Lukas Forer
 
Cloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management SystemCloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management SystemLukas Forer
 
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...BigData_Europe
 
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...Jan Aerts
 
BigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigData_Europe
 
How to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsHow to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsARDC
 
Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Vivien Bonazzi
 
Serverless stream processing of Debezium data change events with Knative | De...
Serverless stream processing of Debezium data change events with Knative | De...Serverless stream processing of Debezium data change events with Knative | De...
Serverless stream processing of Debezium data change events with Knative | De...Red Hat Developers
 
What's New in Cytoscape
What's New in CytoscapeWhat's New in Cytoscape
What's New in CytoscapeKeiichiro Ono
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterIOSR Journals
 
SC1 Workshop 2 General Introduction to BDE
SC1 Workshop 2 General Introduction to BDESC1 Workshop 2 General Introduction to BDE
SC1 Workshop 2 General Introduction to BDEBigData_Europe
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Big Data Value Association
 
Platform introduction & Summary
Platform introduction & SummaryPlatform introduction & Summary
Platform introduction & SummaryBigData_Europe
 
Big data-hadoop-training-course-content-content
Big data-hadoop-training-course-content-contentBig data-hadoop-training-course-content-content
Big data-hadoop-training-course-content-contentTraining Institute
 
Vol 10 No 1 - February 2014
Vol 10 No 1 - February 2014Vol 10 No 1 - February 2014
Vol 10 No 1 - February 2014ijcsbi
 
BDE SC3.3 Workshop - BDE Platform: Technical overview
 BDE SC3.3 Workshop -  BDE Platform: Technical overview BDE SC3.3 Workshop -  BDE Platform: Technical overview
BDE SC3.3 Workshop - BDE Platform: Technical overviewBigData_Europe
 

Similaire à Delivering Bioinformatics MapReduce Applications in the Cloud (20)

Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...
Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...
Cloudflow – A Framework for MapReduce Pipeline Development in Biomedical Rese...
 
Cloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management SystemCloudgene - A MapReduce based Workflow Management System
Cloudgene - A MapReduce based Workflow Management System
 
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...
Apache Big_Data Europe event: "Demonstrating the Societal Value of Big & Smar...
 
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
 
BigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal Pilots
 
How to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collectionsHow to use NCI's national repository of big spatial data collections
How to use NCI's national repository of big spatial data collections
 
Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2
 
Presence cloud
Presence cloudPresence cloud
Presence cloud
 
Serverless stream processing of Debezium data change events with Knative | De...
Serverless stream processing of Debezium data change events with Knative | De...Serverless stream processing of Debezium data change events with Knative | De...
Serverless stream processing of Debezium data change events with Knative | De...
 
What's New in Cytoscape
What's New in CytoscapeWhat's New in Cytoscape
What's New in Cytoscape
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
 
SC1 Workshop 2 General Introduction to BDE
SC1 Workshop 2 General Introduction to BDESC1 Workshop 2 General Introduction to BDE
SC1 Workshop 2 General Introduction to BDE
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
 
C1803041317
C1803041317C1803041317
C1803041317
 
Platform introduction & Summary
Platform introduction & SummaryPlatform introduction & Summary
Platform introduction & Summary
 
Big data-hadoop-training-course-content-content
Big data-hadoop-training-course-content-contentBig data-hadoop-training-course-content-content
Big data-hadoop-training-course-content-content
 
Vol 10 No 1 - February 2014
Vol 10 No 1 - February 2014Vol 10 No 1 - February 2014
Vol 10 No 1 - February 2014
 
Ss eb29
Ss eb29Ss eb29
Ss eb29
 
BDE SC3.3 Workshop - BDE Platform: Technical overview
 BDE SC3.3 Workshop -  BDE Platform: Technical overview BDE SC3.3 Workshop -  BDE Platform: Technical overview
BDE SC3.3 Workshop - BDE Platform: Technical overview
 

Dernier

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Dernier (20)

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Delivering Bioinformatics MapReduce Applications in the Cloud

  • 1. Delivering Bioinformatics MapReduce Applications in the Cloud Lukas Forer*, Tomislav Lipić**, Sebastian Schönherr*, Hansi Weißensteiner*, Davor Davidović**, Florian Kronenberg* and Enis Afgan** * Division of Genetic Epidemiology, Medical University of Innsbruck, Austria ** Center for Informatics and Computing, Ruđer Bošković Institute, Zagreb, Croatia
  • 2. Page 2 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Introduction • Cooperation between – Division of Genetic Epidemiology, Innsbruck, Austria – Ruđer Bošković Institute (RBI), Zagreb, Croatia • Project • Aim: Developing a Bioinformatics Platform for the Cloud
  • 3. Page 3 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Motivation: Bioinformatics • More and more data is produced • Next-Generation Sequencing (NGS) – Allows us to sequence the complete human genome (3.2 billion positions) – Decreasing Costs  Increasing Data Production Bottleneck is no longer the data production in the laboratory, but the analysis!
  • 4. Page 4 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Motivation: Next Generation Sequencing • NGS Data Characteristics – Data size in terabyte scale – Independent text data rows – Batch processing required for analysis files • MapReduce – One possible candidate for batch processing – scalable approach to manage and process large data efficiently “High potential for MapReduce in Genomics”
  • 5. Page 5 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr MapReduce in Bioinformatics Hadoop MapReduce libraries for Bioinformatics Hadoop BAM Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF) SeqPig Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop- BAM BioPig Processing NGS data with Apache Pig; Presenting UDFs Biodoop MapReduce suite for sequence alignments / manipulation of aligned records; written in Python DNA - Alignment algorithms based on Hadoop CloudBurst Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non- overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds Seal Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal file format) Reduce: Remove duplicates (optional) Crossbow Based on Bowtie / SOAPsnp Map: Executing Bowtie on chunks Reduce: SNP calling using SOAPsnp RNA - Analysis based on Hadoop MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie FX RNA-Seq analysis tool Eoulsan RNA-Seq analysis tool Non-Hadoop based Approaches GATK MapReduce-like framework including a rich set of tools for quality assurance, alignment and variant calling; not based on Hadoop MapReduce
  • 6. Page 6 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Problem 1: Complex Analysis Pipelines • Bioinformatics MapReduce Applications – available only on a per-tool basis – cover one aspect of a larger data analysis pipeline – Hard to use for scientists without background in Computer Science Needed: System which enables building MapReduce workflows
  • 7. Page 7 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudgene: Overview • Cloudgene – MapReduce based Workflow System – assists scientists in executing and monitoring workflows graphically – data management (import/export) – workflow parameter track→ reproducibility • Supports Apache’s Big-Data stack S. Schönherr, L. Forer, H. Weißensteiner, F. Kronenberg, G. Specht, and A. Kloss-Brandstätter. Cloudgene: a graphical execution platform for MapReduce programs on private and public clouds. BMC Bioinformatics, 13(1):200, Jan. 2012.
  • 8. Page 8 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudgene: Workflow Composition • Reuse existing Apache Hadoop programs • No adaptations in source code needed • All meta data about tasks and workflows are defined in one single file – the manifest file MapReduce program Manifest File for Cloudgene
  • 9. Page 9 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudgene: Workflow Composition • Manifest file defines the workflow
  • 10. Page 10 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudgene: Interface Used Parameters Download links to result files
  • 11. Page 11 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Problem 2: Missing Infrastructure • Cloudgene requires a functional compatible cluster – Small/Medium sized research institutes can hardly afford own clusters • Possible Solution: Cloud Computing – Rent computer hardware from different providers (e.g. AWS, HP) – Use resources and services on demand Needed: System which enables delivering MapReduce clusters in the cloud
  • 12. Page 12 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr CloudMan: Overview • Enables launching/managing a analysis platform on a cloud infrastructure via a web browser – Delivers a scalable cluster-in-the-cloud • Preconfigure a number of applications – Workflow System: Galaxy – Batch scheduler: Sun Grid Engine (SGE) – MapReduce using Hadoop and SGE integration Afgan E, Chapman B, Taylor J. CloudMan as a platform for tool, data, and analysis distribution. BMC Bioinformatics 2012;
  • 13. Page 13 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Configure the Cluster
  • 14. Page 14 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Manage the Cluster
  • 15. Page 15 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Use the Cluster
  • 16. Page 16 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr CloudMan-as-a-Platform • CloudMan implements a service dependency management framework • Enables creation of user-specific cloud platforms Idea: Implementing Cloudgene as a CloudMan service
  • 17. Page 17 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Cloudgene meets CloudMan Cloudgene A Bioinformatics MapReduce workflow framework CloudMan A cloud manager for delivering application execution environments CloudMan platform OpenNebulaOpenNebula AWS OpenStack Eucalyptus SGE Hadoop HTCondor ... GalaxyCloudgene MapReduce tools Traditional tools Batch Cluster jobs
  • 18. Page 18 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr New Opportunities • One single data analysis cloud platform – MapReduce and SGE integrated – Additionally, more big data computational models can be supported in future CloudMan CloudMan
  • 19. Page 19 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Conclusion • MapReduce – simple but effective programming model – well suited for the Cloud – still limited to a small number of highly qualified domain experts • Cloudgene – as a possible MapReduce Workflow Manager • CloudMan – as a possible Cloud Manager • Implementing Cloudgene as a CloudMan service – Enables domain experts to incorporate and make use of additional big data models
  • 20. Page 20 21 May, 2014 DC VIS - Distributed Computing, Visualization and Biomedical Engineering www.mipro.hr Acknowledge • CloudMan – Enis Afgan – Tomislav Lipić – Davor Davidović • Cloudgene – Lukas Forer – Sebastian Schönherr – Hansi Weißensteiner – Florian Kronenberg

Notes de l'éditeur

  1. Welcome to my presentation. I am happy to be here and to have the opportunity to talk about our approach which enables to deliver bioinf mapred apps in the cloud I am happy to be here and to have the opportunity to talk about our cooperation project, which is called „Delivering Bioinformatics MapReduce Applications in the Cloud”
  2. The solution which i present is a cooperation between our intstitute at the med uni innsbruck and the university of zagreb. The aim of the project is to develop a bioinformatics platform for the cloud feasibility study
  3. I start my talk with a short introduction about the data we use. The data is produced by a technology called NGS which enables to sequence the whole genome. This is done in a massively parallel way and enables to sequnce the genome in adequate time and with adequate coses. This has the consequence that more and more data will produce. So the bottleneck is no longer the data production in the lab, but ist analysis.
  4. the sizife of such data from NGS devices are terabytes and they are independet semi-structed data rows. And to analyze this data we need batch processing. One possible candidate to anaylze this large data efficiently is mapreduce. So there is a high potential for mapreduce in genomics.
  5. This table shows that there exist already several Mapreduce apps in the field of mapreduce. For example for mapping shot reads to reference ot for rna analysis.
  6. But the problem of such pproaches is that thei are available only on a per-tool basis To analyze data we need large pipeline and the avaialble tools cover only one apsect. Moreover, for biologists without background in cs it is very hard to use them Most ov popular workflow systems such as galaxy enable this abstraction only for tradional tools and not for mappreduce. So a system which enable building such mapreduce workflows is missing.
  7. To overcome this issue we developed cloudgene, a mpreduce based workflow system Which assts scients in executing and monitoring worklfows Moreover, we provide a simple data management in order to import and exporrt datasets. Finally, all parameters are tracked which improves the reproducibility of experiments. At the moment we support the whole apache big data stack, that menans mapred, pig and hdfs
  8. To build workflow existing hadoop programs can be reused and no adaption in source code is needed All meta data are defined in a single manifest file For example, here we have an existing tool called cloudburst, and here we have the correspong manifest file which enables to integrate it into cloudgene
  9. Based on this manifest file, we create automatically a userinterface which can be used to submit the job with different parameters and datasets.
  10. When the job is complete, we can download the results files directly through the browser and all used parameters are tracked.
  11. A workflow manager like cloudgene requires a compatible cluster Especially for smale research institutes it can be hard to afford and mantain their own cluster So possible solution is cloud computing which enables to rent compoter hardware from different providers for example amazon. So they can use the rented resources on demand. But the problem here is a missing system which enables devlivering mapreduce clusters in the cloud
  12. To overcome this issues, our cooperation partners from croatia developed cloudman which solves this problem It enables to launch and manage an analysis paltofrm on a cloud infratrsture. So through the browser a scalable cluster-in-the-cloud can be delivered. Moreover, cloudman preconfigures this new cluster with a lot of applications For example galaxy, a batach scheduler, mapreduce or HTCondor.
  13. So in a first step, the user configures the cluster through a simple web wizard. For example the type of the cluster can be selected or the size of the data volume
  14. In a second step, the users can manage it cluster. That means she or he can add new nodes, remove nodes or finally terminates the clusters. Moreover, cloudman provides a very nice feature which enables an automatically scaling.
  15. After that, cloudman installs the required application on the the new cluster. For example, in this case galaxy was automatically installed and can be used by the user to execute non-hadoop workflows.
  16. Cloudman implements a service dependency management framework which spepates different modules. That means, the infrastruce layer is separted by the application execution environement and from the applications and the datasets. This features enable to switch this modules with alternative service and implementation in order to create a user pecific cloud platform. So we can implement cloudgene as a cloudman service in order to provide a mpareduce bioinformatics platform in the cloud.
  17. So the overall picture is to use cloudman as the cloudmanager to create a cluster with the need envirnoment. An then we use cloudgene on top of this cluster to execute and manage mapreduce workflows. This results in a new platform which provides several tools for bioinformatic usecases. So we have the user which uses cloudgene or galaxy on top of the clusters create by cloudman. Moreover, through a cloudabstraction this platform can be instantiated on different coud providers.
  18. So we have one single data analysis platform which integrates MapReduce and SGE. At the moment cloudgene uses only mapreduce. But hadoop 2 introduced a new layer bettwen mapreduce and hadoop. This layer is called yarn and is jobmanager which enables to integrate other parallelizing models like sparkl or giraph. This enables as to use other big data computational modesl Moreover, the platform integratesalready several bioformatics workflows.
  19. To conclude my presentation, we saw that mapred is a simple but effective programming mode. It is wellsuited for the cloud but still mimited to a small number of domain experts with knowledge in computer science. We presented cloudgene as a posssible mapred workflowmanager and cloudman as a posssible cloud manager. Moreover, the implementation of cloudgene as a cloudman service enable scientists to make use of several workflows based on big data models.
  20. Cloudman was developed by enis, tamislav and davor from croatia. And cloudgene was developed by sebm, hansi, florian and me. Now i am at the end of my presentation so thank you very much for your attention.