SlideShare une entreprise Scribd logo
1  sur  24
Cloudgene - an execution platform for
MapReduce programs in public and
private clouds

Lukas Forer, Sebastian Schönherr, Hansi Weißensteiner

University of Innsbruck, Austria
Medical University Innsbruck, Austria


                                              BOSC 2012
MapReduce
                                                                          cluster


    Serial approach              Parallel approach
                                                                                    cloud

                                                                             private        public
       How to support scientists when using (our) MapReduce
       programs?
           Simplify the execution of MapReduce programs including
           data management
           Simplify access to a working MapReduce cluster
           Maintain data sensitivity




2
                      MapReduce: Simplified Data Processing on Large Clusters - Dean & Ghemawat - 2004
MapReduce in Genetics
    CloudBurst
           highly sensitive read mapping with MapReduce; Schatz, 2009
    Crossbow
           Searching for SNPs with cloud computing; Langmead et al., 2009
    MyRNA
           Cloud-scale RNA-sequencing differential expression analysis with Myrna; Langmead et al.,
           2010
    Seal
           a Distributed Short Read Mapping and Duplicate Removal Tool; Pireddu et al., 2012
    Hadoop BAM
           directly manipulating next generation sequencing data in the cloud; Matti Niemenmaa et al.,
           2012
    CloudBioLinux
           CloudBioLinux: pre-configured and on-demand bioinformatics computing for the
           genomics community; Krampis et al., 2012

3
Difficulties with MapReduce


                    Additional steps, when setting up a
                    cluster in a public environment




                    Required steps when cluster is up and
                    running, Hadoop installed




4
Approaches
    Possible approaches
      Program specific approach
         Implement a GUI for every program
         Redundant work for the developer
         Heterogeneity

      Workflow systems
         Galaxy, Taverna, Mobyle
         Possible, but no HDFS support, blackbox

    Our approach for Hadoop MapReduce
         One GUI for different programs
         Feedback, Standardized Import/Export
         Integration of programs via a plugin interface

5
What is Cloudgene?
    Open-source platform to improve the usability of Hadoop
    MapReduce jobs
       Provides a graphical web interface for their execution
       Programs can be integrated by writing a simple configuration file
       Public cloud & private cloud
          Setting up a cluster in the cloud, installs all data on it
       History of executed jobs with defined input/output parameters


    Runs in your browser
                                           Myrna
                                         CloudBurst
                                             Seal
                                         Crossbow
                                         CloudBioLinux

                                         Cloudgene
6
Cloudgene




7
Features
    Integration of programs easily possible
       standard MapReduce programs (Java -> CloudBurst)
       streaming jobs (e.g. Mapper and Reducer using Perl-> Myrna)
       command line programs (e.g. using Pydoop -> Seal)


    Data can be imported from different sources
       S3 / HTTP / FTP
       Import of huge datasets
       Export results to S3 (public cloud)


    Connect different MapReduce programs to a pipeline
    Install additional programs via a web repository
8
Features

    Cloudgene can be used on private and public clusters


       sensitive data
       local data
                             } private cloud

       data on S3
       no in-house cluster
                             } public cloud
       available


    Open source


9
Summary




10
Cloudgene in Action




     How to integrate a new program in Cloudgene
       1. Implement the program (or use existing)
       2. Write plugin configuration file




11
Cloudgene in Action



     Step 1 - Implement a program, executable via the command line


     e.g: FastQ pre-processing with MapReduce
          base quality / sequence quality / duplication levels / length distribution


          hadoop jar exomePreprocessing.jar -input exomeData
          -step baseJob -encoding 0 -output resultsOutput




12
Cloudgene in Action



     Step 2 - Write configuration file including 3 parts


     Part 1 – General information:




13
Cloudgene in Action



     Step 2 - Write configuration file including 3 parts


     Part 2 – Public cloud information:




14
Cloudgene in Action



     Step 2 - Write configuration file including 3 parts


     Part 3 – MapReduce information:




15
Cloudgene in Action




16
Cloudgene in Action




17
Cloudgene in Action




18
Cloudgene in Action




19
Cloudgene in Action

     Different application – different GUI




20
Technologies
     Apache Hadoop
          http://hadoop.apache.org
     Apache Whirr
          http://whirr.apache.org
     Restlet
          http://www.restlet.org
     ExtJS
          http://www.sencha.com
     H2
          http://www.h2database.com



21
Evaluation

                                              4000 sec


     Amazon Elastic MapReduce (EMR)           3500 sec

                                              3000 sec
       Graphical execution for MapReduce
       programs                               2500 sec
                                                                                  Export
       Excellent solution for public clouds   2000 sec                            Calculation
                                                                                  Import
           Combination with S3                1500 sec
                                                                                  Setup
     but                                      1000 sec

           data sensitivity                    500 sec
           Reproducibility
                                                 0 sec
           Additional costs                              Cloudgene   Amazon EMR




22
Integrated programs


 Wordcount, Grep, etc.




                    http://sourceforge.net/apps/medihouse
                                                     in
                    awiki/cloudburst-
                    bio/nfs/project/c/cl/cloudburst-
                                Exome Preprocessing
                    bio/7/70/MediaWikiSidebarLogo
                    .png        Finding SNPs
23
Acknowledgements



                                                                      Project-Website:
Sebastian Schönherr       Lukas Forer         Hansi Weissensteiner    http://cloudgene.uibk.ac.at

                                                                      Source Code:
                                                                      http://github.com/genepi


                                                                     Thanks to the Open Source
Anita Kloss-Brandstätter Florian Kronenberg     Günther Specht       Community




24

Contenu connexe

Similaire à Cloudgene - Simplified Data Processing Platform for MapReduce Programs

Delivering Bioinformatics MapReduce Applications in the Cloud
Delivering Bioinformatics MapReduce Applications in the CloudDelivering Bioinformatics MapReduce Applications in the Cloud
Delivering Bioinformatics MapReduce Applications in the CloudLukas Forer
 
The evolution of data center network fabrics
The evolution of data center network fabricsThe evolution of data center network fabrics
The evolution of data center network fabricsCisco Canada
 
cncf overview and building edge computing using kubernetes
cncf overview and building edge computing using kubernetescncf overview and building edge computing using kubernetes
cncf overview and building edge computing using kubernetesKrishna-Kumar
 
Towards CloudML, a Model-Based Approach to Provision Resources in the Clouds
Towards CloudML, a Model-Based Approach  to Provision Resources in the CloudsTowards CloudML, a Model-Based Approach  to Provision Resources in the Clouds
Towards CloudML, a Model-Based Approach to Provision Resources in the CloudsSébastien Mosser
 
PEPS: CNES Sentinel Satellite Image Analysis, On-Premises and in the Cloud wi...
PEPS: CNES Sentinel Satellite Image Analysis, On-Premises and in the Cloud wi...PEPS: CNES Sentinel Satellite Image Analysis, On-Premises and in the Cloud wi...
PEPS: CNES Sentinel Satellite Image Analysis, On-Premises and in the Cloud wi...OW2
 
ClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud TestbedClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud TestbedJazz Yao-Tsung Wang
 
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...IOSR Journals
 
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed SystemsZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed SystemsGokhan Boranalp
 
Predicting Space Weather with Docker
Predicting Space Weather with DockerPredicting Space Weather with Docker
Predicting Space Weather with DockerDocker, Inc.
 
Access security on cloud computing implemented in hadoop system
Access security on cloud computing implemented in hadoop systemAccess security on cloud computing implemented in hadoop system
Access security on cloud computing implemented in hadoop systemJoão Gabriel Lima
 
Google Cloud Networking Deep Dive
Google Cloud Networking Deep DiveGoogle Cloud Networking Deep Dive
Google Cloud Networking Deep DiveMichelle Holley
 
Dataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayJosef Adersberger
 
云计算及其应用
云计算及其应用云计算及其应用
云计算及其应用lantianlcdx
 
FIWARE Global Summit - FogFlow, a new GE for IoT Edge Computing
FIWARE Global Summit - FogFlow, a new GE for IoT Edge ComputingFIWARE Global Summit - FogFlow, a new GE for IoT Edge Computing
FIWARE Global Summit - FogFlow, a new GE for IoT Edge ComputingFIWARE
 
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ..."Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...Edge AI and Vision Alliance
 
CPaaS.io Y1 Review Meeting - Cloud & Edge Programming
CPaaS.io Y1 Review Meeting - Cloud & Edge ProgrammingCPaaS.io Y1 Review Meeting - Cloud & Edge Programming
CPaaS.io Y1 Review Meeting - Cloud & Edge ProgrammingStephan Haller
 
Deep Learning Neural Networks in the Cloud
Deep Learning Neural Networks in the CloudDeep Learning Neural Networks in the Cloud
Deep Learning Neural Networks in the CloudIJAEMSJORNAL
 

Similaire à Cloudgene - Simplified Data Processing Platform for MapReduce Programs (20)

Delivering Bioinformatics MapReduce Applications in the Cloud
Delivering Bioinformatics MapReduce Applications in the CloudDelivering Bioinformatics MapReduce Applications in the Cloud
Delivering Bioinformatics MapReduce Applications in the Cloud
 
The evolution of data center network fabrics
The evolution of data center network fabricsThe evolution of data center network fabrics
The evolution of data center network fabrics
 
cncf overview and building edge computing using kubernetes
cncf overview and building edge computing using kubernetescncf overview and building edge computing using kubernetes
cncf overview and building edge computing using kubernetes
 
FinalReport
FinalReportFinalReport
FinalReport
 
Towards CloudML, a Model-Based Approach to Provision Resources in the Clouds
Towards CloudML, a Model-Based Approach  to Provision Resources in the CloudsTowards CloudML, a Model-Based Approach  to Provision Resources in the Clouds
Towards CloudML, a Model-Based Approach to Provision Resources in the Clouds
 
Paper444012-4014
Paper444012-4014Paper444012-4014
Paper444012-4014
 
PEPS: CNES Sentinel Satellite Image Analysis, On-Premises and in the Cloud wi...
PEPS: CNES Sentinel Satellite Image Analysis, On-Premises and in the Cloud wi...PEPS: CNES Sentinel Satellite Image Analysis, On-Premises and in the Cloud wi...
PEPS: CNES Sentinel Satellite Image Analysis, On-Premises and in the Cloud wi...
 
ClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud TestbedClassCloud: switch your PC Classroom into Cloud Testbed
ClassCloud: switch your PC Classroom into Cloud Testbed
 
D017212027
D017212027D017212027
D017212027
 
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
A Novel Approach for Workload Optimization and Improving Security in Cloud Co...
 
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed SystemsZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed Systems
 
Predicting Space Weather with Docker
Predicting Space Weather with DockerPredicting Space Weather with Docker
Predicting Space Weather with Docker
 
Access security on cloud computing implemented in hadoop system
Access security on cloud computing implemented in hadoop systemAccess security on cloud computing implemented in hadoop system
Access security on cloud computing implemented in hadoop system
 
Google Cloud Networking Deep Dive
Google Cloud Networking Deep DiveGoogle Cloud Networking Deep Dive
Google Cloud Networking Deep Dive
 
Dataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice Way
 
云计算及其应用
云计算及其应用云计算及其应用
云计算及其应用
 
FIWARE Global Summit - FogFlow, a new GE for IoT Edge Computing
FIWARE Global Summit - FogFlow, a new GE for IoT Edge ComputingFIWARE Global Summit - FogFlow, a new GE for IoT Edge Computing
FIWARE Global Summit - FogFlow, a new GE for IoT Edge Computing
 
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ..."Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
"Portable Performance via the OpenVX Computer Vision Library: Case Studies," ...
 
CPaaS.io Y1 Review Meeting - Cloud & Edge Programming
CPaaS.io Y1 Review Meeting - Cloud & Edge ProgrammingCPaaS.io Y1 Review Meeting - Cloud & Edge Programming
CPaaS.io Y1 Review Meeting - Cloud & Edge Programming
 
Deep Learning Neural Networks in the Cloud
Deep Learning Neural Networks in the CloudDeep Learning Neural Networks in the Cloud
Deep Learning Neural Networks in the Cloud
 

Plus de Jan Aerts

VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationJan Aerts
 
Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Jan Aerts
 
Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Jan Aerts
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Jan Aerts
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Jan Aerts
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data AnalysisJan Aerts
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualizationJan Aerts
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsJan Aerts
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...Jan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloudJan Aerts
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumJan Aerts
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloudJan Aerts
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisJan Aerts
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...Jan Aerts
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...Jan Aerts
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...Jan Aerts
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...Jan Aerts
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsJan Aerts
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesJan Aerts
 

Plus de Jan Aerts (20)

VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic Variation
 
Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?
 
Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualization
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformatics
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing Consortium
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysis
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining components
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 

Dernier

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Dernier (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Cloudgene - Simplified Data Processing Platform for MapReduce Programs

  • 1. Cloudgene - an execution platform for MapReduce programs in public and private clouds Lukas Forer, Sebastian Schönherr, Hansi Weißensteiner University of Innsbruck, Austria Medical University Innsbruck, Austria BOSC 2012
  • 2. MapReduce cluster Serial approach Parallel approach cloud private public How to support scientists when using (our) MapReduce programs? Simplify the execution of MapReduce programs including data management Simplify access to a working MapReduce cluster Maintain data sensitivity 2 MapReduce: Simplified Data Processing on Large Clusters - Dean & Ghemawat - 2004
  • 3. MapReduce in Genetics CloudBurst highly sensitive read mapping with MapReduce; Schatz, 2009 Crossbow Searching for SNPs with cloud computing; Langmead et al., 2009 MyRNA Cloud-scale RNA-sequencing differential expression analysis with Myrna; Langmead et al., 2010 Seal a Distributed Short Read Mapping and Duplicate Removal Tool; Pireddu et al., 2012 Hadoop BAM directly manipulating next generation sequencing data in the cloud; Matti Niemenmaa et al., 2012 CloudBioLinux CloudBioLinux: pre-configured and on-demand bioinformatics computing for the genomics community; Krampis et al., 2012 3
  • 4. Difficulties with MapReduce Additional steps, when setting up a cluster in a public environment Required steps when cluster is up and running, Hadoop installed 4
  • 5. Approaches Possible approaches Program specific approach Implement a GUI for every program Redundant work for the developer Heterogeneity Workflow systems Galaxy, Taverna, Mobyle Possible, but no HDFS support, blackbox Our approach for Hadoop MapReduce One GUI for different programs Feedback, Standardized Import/Export Integration of programs via a plugin interface 5
  • 6. What is Cloudgene? Open-source platform to improve the usability of Hadoop MapReduce jobs Provides a graphical web interface for their execution Programs can be integrated by writing a simple configuration file Public cloud & private cloud Setting up a cluster in the cloud, installs all data on it History of executed jobs with defined input/output parameters Runs in your browser Myrna CloudBurst Seal Crossbow CloudBioLinux Cloudgene 6
  • 8. Features Integration of programs easily possible standard MapReduce programs (Java -> CloudBurst) streaming jobs (e.g. Mapper and Reducer using Perl-> Myrna) command line programs (e.g. using Pydoop -> Seal) Data can be imported from different sources S3 / HTTP / FTP Import of huge datasets Export results to S3 (public cloud) Connect different MapReduce programs to a pipeline Install additional programs via a web repository 8
  • 9. Features Cloudgene can be used on private and public clusters sensitive data local data } private cloud data on S3 no in-house cluster } public cloud available Open source 9
  • 11. Cloudgene in Action How to integrate a new program in Cloudgene 1. Implement the program (or use existing) 2. Write plugin configuration file 11
  • 12. Cloudgene in Action Step 1 - Implement a program, executable via the command line e.g: FastQ pre-processing with MapReduce base quality / sequence quality / duplication levels / length distribution hadoop jar exomePreprocessing.jar -input exomeData -step baseJob -encoding 0 -output resultsOutput 12
  • 13. Cloudgene in Action Step 2 - Write configuration file including 3 parts Part 1 – General information: 13
  • 14. Cloudgene in Action Step 2 - Write configuration file including 3 parts Part 2 – Public cloud information: 14
  • 15. Cloudgene in Action Step 2 - Write configuration file including 3 parts Part 3 – MapReduce information: 15
  • 20. Cloudgene in Action Different application – different GUI 20
  • 21. Technologies Apache Hadoop http://hadoop.apache.org Apache Whirr http://whirr.apache.org Restlet http://www.restlet.org ExtJS http://www.sencha.com H2 http://www.h2database.com 21
  • 22. Evaluation 4000 sec Amazon Elastic MapReduce (EMR) 3500 sec 3000 sec Graphical execution for MapReduce programs 2500 sec Export Excellent solution for public clouds 2000 sec Calculation Import Combination with S3 1500 sec Setup but 1000 sec data sensitivity 500 sec Reproducibility 0 sec Additional costs Cloudgene Amazon EMR 22
  • 23. Integrated programs Wordcount, Grep, etc. http://sourceforge.net/apps/medihouse in awiki/cloudburst- bio/nfs/project/c/cl/cloudburst- Exome Preprocessing bio/7/70/MediaWikiSidebarLogo .png Finding SNPs 23
  • 24. Acknowledgements Project-Website: Sebastian Schönherr Lukas Forer Hansi Weissensteiner http://cloudgene.uibk.ac.at Source Code: http://github.com/genepi Thanks to the Open Source Anita Kloss-Brandstätter Florian Kronenberg Günther Specht Community 24