SlideShare une entreprise Scribd logo
1  sur  22
Cloudgene
A MapReduce based Workflow Management System
Lukas Forer and Sebstian Schönherr
Division of Genetic Epidemiology
Medical University of Innsbruck, Austria
UPPNEX Workshop - January 2015
Page 2
Motivation: Bioinformatics
• Next Generation Sequencing (NGS)
– Sequencing the whole genome at low cost
– Gigabytes of produced data per experiment
– Allows data production at high scale
• Data generation is not the bottleneck anymore
• Data processing as the current bottleneck
– Single Workstation not sufficient
– Super-Computers too expensive
Page 3
MapReduce
• Commodity computing
– Parallel computing on a large number of low budget
components
• MapReduce
– Parallelization framework
– Enables analyzing large data
– User writes map/reduce function
– Framework takes care about
fault-tolerance, data distribution, load balancing
– Apache Hadoop: Open Source implementation
Page 4
MapReduce in Bioinformatics (1)
Hadoop
MapReduce
libraries for
Bioinformatics
Hadoop BAM
Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ,
FASTA, QSEQ, BCF, and VCF)
SeqPig
Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop-
BAM
BioPig Processing NGS data with Apache Pig; Presenting UDFs
Biodoop
MapReduce suite for sequence alignments / manipulation of aligned records; written in
Python
DNA -
Alignment
algorithms
based on
Hadoop
CloudBurst
Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non-
overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds
Seal
Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal
file format) Reduce: Remove duplicates (optional)
Crossbow
Based on Bowtie / SOAPsnp
Map: Executing Bowtie on chunks
Reduce: SNP calling using SOAPsnp
RNA - Analysis
based on
Hadoop
MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie
FX RNA-Seq analysis tool
Eoulsan RNA-Seq analysis tool
Non-Hadoop
based
Approaches
GATK
MapReduce-like framework including a rich set of tools for quality assurance, alignment and
variant calling; not based on Hadoop MapReduce
Page 5
MapReduce in Bioinformatics (2)
• Bioinformatics MapReduce Applications
– Available only on a per-tool basis
– Cover one aspect of a larger data analysis pipeline
– Hard to use for scientists without background in
Computer Science
• Popular workflow systems
– Enable this level of abstraction for the traditional tools
– Do not support tools based on MapReduce
Missing: System which enables building
MapReduce workflows
Page 6
Cloudgene
• System to execute MapReduce programs graphically
and combine them to workflows
• One platform – many programs
– Integration of existing MapReduce programs without
source code adaptations
– Create workflows using MapReduce, Apache Pig, R or
Unix command-line programs
• Runs in your browser
Page 7
Cloudgene: Overview
Cloudgene-MapRed
MapReduce Workflow Manager
Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows
• Requires a compatible cluster to execute workflows
– Small/Medium sized research institutes can hardly
afford own clusters
– Cloud computing: rent computer hardware from different
providers (e.g. Amazon, HP)
Page 8
CloudgeneCloudgene: Overview
Cloudgene-MapRed
MapReduce Workflow Manager
Cloudgene-Cluster
Infrastructure Manager
Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows
Page 9
Cloudgene: Advantages
Page 10
Architecture
Page 11
Workflow Composition
• New MapReduce algorithms can be integrated easily
• Integration of existing MapReduce algorithms without
adaptations in source code
• Cloudgene uses its own workflow language
• Workflow Definition Language (WDL)
– Formal description of tasks and workflow steps
– Property-based and uses the YAML syntax
– Supports heterogeneous software components
(MapReduce, R and unix command-line programs)
– Basic workflow control patterns (loops and conditions)
Page 13
Workflow Composition
• Example of a simple WDL-Manifest file
Command line
parameters
Inputs:
Are set by the user
through the web
interface
Outputs:
are created by
tasks (intermediate
or persistent)
Page 14
Workflow Composition
• The user interface is created automatically
Page 15
Workflow Execution Engine
1. Creates a dependency graph based on the WDL file and user input
2. Optimizes the graph to minimize the execution time (i.e. caching)
3. Schedules and submits jobs to the Hadoop Cluster
Page 16
Web Interface
Page 17
Workflow Results
Used
Parameters
Download links
to result files
Page 18
Supported Technologies
• Apache Hadoop MapReduce
• Apache PIG
• RMarkdown
– Ideal to generate html files with charts, statistics, …
• Unix command line programs
– Cloudgene exports automatically all HDFS files
– No manual file staging between HDFS and POSIX filesystem
needed!
Advantage: Composition of
hybrid Workflows possible
Page 19
Other Features
• Authentication and User-Management
• Parameter Tracking
• HDFS Workspace
– Hides HDFS filesystem by the end-user
– Importing Data from Amazon S3 Buckets,
HTTP and (S)FTP Servers, File Uploads, ...
– Facilitates the management of datasets on
the cluster
Page 20
Preview: Cloudgene 2.0
• Interface for web-services
– Same WDL file, but different interface
– User Registration
– Intelligent Queuing
– User Notification
• Examples:
– https://imputationserver.sph.umich.edu
– http://mtdna-server.uibk.ac.at
Page 21
Preview: Cloudgene 2.0
• Generic data analysis platform
– Integration of additional data processing models
Cloudgene
Hadoop 1.0
MapReduce
Cloudgene
Hadoop 2.0
YARN
MapReduce Spark Giraph …
Page 22
Conclusion
• Website
– http://cloudgene.uibk.ac.at
• Virtual Machine
– https://bioimg.org/cloudgene
• Getting started
– http://cloudgene.uibk.ac.at/getting-started
• Developer Guide
– http://cloudgene.uibk.ac.at/developer-guide
Page 23
Acks
• Cloudgene
– Lukas Forer (@lukfor) and Sebastian Schoenherr
(@seppinho)
• Imputation with Minimac
– Goncalo Abecasis, Christian Fuchsberger
• mtDNA-Server
– Hansi Weißensteiner
• Univ.-Prof. Florian Kronenberg
– Head of the Division of Genetic Epidemiology,
Medical University of Innsbruck
23

Contenu connexe

Tendances

Development History Data Management in Hadoop
Development History Data Management in HadoopDevelopment History Data Management in Hadoop
Development History Data Management in HadoopJohan Gustavsson
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - HortonworksAvery Ching
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMRABC Talks
 
Review of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsNamuk Park
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2Aswini Ashu
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol PerformanceBIOVIA
 
Vgu bis2010 Mapreduce and Batch processing
Vgu bis2010 Mapreduce and Batch processingVgu bis2010 Mapreduce and Batch processing
Vgu bis2010 Mapreduce and Batch processingLam Pham
 
Hadoop at LinkedIn
Hadoop at LinkedInHadoop at LinkedIn
Hadoop at LinkedInKeith Dsouza
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Lu Wei
 
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from LynchpinMeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from LynchpinLynchpin Analytics Consultancy
 
Filtering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkFiltering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkDatabricks
 
Data Center Operating System
Data Center Operating SystemData Center Operating System
Data Center Operating SystemKeshav Yadav
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin
 

Tendances (20)

Development History Data Management in Hadoop
Development History Data Management in HadoopDevelopment History Data Management in Hadoop
Development History Data Management in Hadoop
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMR
 
Review of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its Components
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance
 
Merging heterogeneous network measurement data
Merging heterogeneous network measurement dataMerging heterogeneous network measurement data
Merging heterogeneous network measurement data
 
Vgu bis2010 Mapreduce and Batch processing
Vgu bis2010 Mapreduce and Batch processingVgu bis2010 Mapreduce and Batch processing
Vgu bis2010 Mapreduce and Batch processing
 
Hadoop at LinkedIn
Hadoop at LinkedInHadoop at LinkedIn
Hadoop at LinkedIn
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from LynchpinMeasureCamp 7   Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin
 
Filtering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache SparkFiltering vs Enriching Data in Apache Spark
Filtering vs Enriching Data in Apache Spark
 
Data Center Operating System
Data Center Operating SystemData Center Operating System
Data Center Operating System
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
 

En vedette

SM SILO Presentation
SM SILO Presentation SM SILO Presentation
SM SILO Presentation Gavin Baird
 
HITEK Circuits Company Presentation16_Ray.Lei(sale05@hitekcircuits.com )
HITEK Circuits Company  Presentation16_Ray.Lei(sale05@hitekcircuits.com )HITEK Circuits Company  Presentation16_Ray.Lei(sale05@hitekcircuits.com )
HITEK Circuits Company Presentation16_Ray.Lei(sale05@hitekcircuits.com )Aron Le
 
سوءالن تاهون 5
سوءالن تاهون 5سوءالن تاهون 5
سوءالن تاهون 5fakhar zack
 
GallupReport_Signature Themes
GallupReport_Signature ThemesGallupReport_Signature Themes
GallupReport_Signature ThemesDavid Higgins
 
The innovation intensive
The innovation intensiveThe innovation intensive
The innovation intensiveAlan J Sears
 
Segurança no Regresso às Aulas
Segurança no Regresso às AulasSegurança no Regresso às Aulas
Segurança no Regresso às Aulasfmcardoso2014
 
Alphatise Presentation
Alphatise PresentationAlphatise Presentation
Alphatise PresentationGavin Baird
 

En vedette (9)

SM SILO Presentation
SM SILO Presentation SM SILO Presentation
SM SILO Presentation
 
Diploma
DiplomaDiploma
Diploma
 
HITEK Circuits Company Presentation16_Ray.Lei(sale05@hitekcircuits.com )
HITEK Circuits Company  Presentation16_Ray.Lei(sale05@hitekcircuits.com )HITEK Circuits Company  Presentation16_Ray.Lei(sale05@hitekcircuits.com )
HITEK Circuits Company Presentation16_Ray.Lei(sale05@hitekcircuits.com )
 
Aleem Ashraf CV
Aleem Ashraf CV Aleem Ashraf CV
Aleem Ashraf CV
 
سوءالن تاهون 5
سوءالن تاهون 5سوءالن تاهون 5
سوءالن تاهون 5
 
GallupReport_Signature Themes
GallupReport_Signature ThemesGallupReport_Signature Themes
GallupReport_Signature Themes
 
The innovation intensive
The innovation intensiveThe innovation intensive
The innovation intensive
 
Segurança no Regresso às Aulas
Segurança no Regresso às AulasSegurança no Regresso às Aulas
Segurança no Regresso às Aulas
 
Alphatise Presentation
Alphatise PresentationAlphatise Presentation
Alphatise Presentation
 

Similaire à Cloudgene - A MapReduce based Workflow Management System

Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationGeorge Long
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Pegasus-Poster-2016-final-v2
Pegasus-Poster-2016-final-v2Pegasus-Poster-2016-final-v2
Pegasus-Poster-2016-final-v2Samrat Jha
 
Delivering Bioinformatics MapReduce Applications in the Cloud
Delivering Bioinformatics MapReduce Applications in the CloudDelivering Bioinformatics MapReduce Applications in the Cloud
Delivering Bioinformatics MapReduce Applications in the CloudLukas Forer
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigDataThanusha154
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReducehuguk
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 

Similaire à Cloudgene - A MapReduce based Workflow Management System (20)

Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Mapreduce Hadop.pptx
Mapreduce Hadop.pptxMapreduce Hadop.pptx
Mapreduce Hadop.pptx
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Pegasus-Poster-2016-final-v2
Pegasus-Poster-2016-final-v2Pegasus-Poster-2016-final-v2
Pegasus-Poster-2016-final-v2
 
Delivering Bioinformatics MapReduce Applications in the Cloud
Delivering Bioinformatics MapReduce Applications in the CloudDelivering Bioinformatics MapReduce Applications in the Cloud
Delivering Bioinformatics MapReduce Applications in the Cloud
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
An Analytics Platform for Connected Vehicles
An Analytics Platform for Connected VehiclesAn Analytics Platform for Connected Vehicles
An Analytics Platform for Connected Vehicles
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduce
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 

Dernier

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Dernier (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Cloudgene - A MapReduce based Workflow Management System

  • 1. Cloudgene A MapReduce based Workflow Management System Lukas Forer and Sebstian Schönherr Division of Genetic Epidemiology Medical University of Innsbruck, Austria UPPNEX Workshop - January 2015
  • 2. Page 2 Motivation: Bioinformatics • Next Generation Sequencing (NGS) – Sequencing the whole genome at low cost – Gigabytes of produced data per experiment – Allows data production at high scale • Data generation is not the bottleneck anymore • Data processing as the current bottleneck – Single Workstation not sufficient – Super-Computers too expensive
  • 3. Page 3 MapReduce • Commodity computing – Parallel computing on a large number of low budget components • MapReduce – Parallelization framework – Enables analyzing large data – User writes map/reduce function – Framework takes care about fault-tolerance, data distribution, load balancing – Apache Hadoop: Open Source implementation
  • 4. Page 4 MapReduce in Bioinformatics (1) Hadoop MapReduce libraries for Bioinformatics Hadoop BAM Manipulation of aligned next-generation sequencing data (supports BAM, SAM, FASTQ, FASTA, QSEQ, BCF, and VCF) SeqPig Processing NGS data with Apache Pig; Presenting UDFs for frequent tasks; using Hadoop- BAM BioPig Processing NGS data with Apache Pig; Presenting UDFs Biodoop MapReduce suite for sequence alignments / manipulation of aligned records; written in Python DNA - Alignment algorithms based on Hadoop CloudBurst Based on RMAP (seed-and-extend algorithm) Map: Extracting k-mers of reference, non- overlapping k-mers of reads (as keys) Reduce: End-to-end alignments of seeds Seal Based on BWA (version 0.5.9) Map: Alignment using BWA (on a previously created internal file format) Reduce: Remove duplicates (optional) Crossbow Based on Bowtie / SOAPsnp Map: Executing Bowtie on chunks Reduce: SNP calling using SOAPsnp RNA - Analysis based on Hadoop MyRNA Pipeline for calculating differential gene expression in RNA; including Bowtie FX RNA-Seq analysis tool Eoulsan RNA-Seq analysis tool Non-Hadoop based Approaches GATK MapReduce-like framework including a rich set of tools for quality assurance, alignment and variant calling; not based on Hadoop MapReduce
  • 5. Page 5 MapReduce in Bioinformatics (2) • Bioinformatics MapReduce Applications – Available only on a per-tool basis – Cover one aspect of a larger data analysis pipeline – Hard to use for scientists without background in Computer Science • Popular workflow systems – Enable this level of abstraction for the traditional tools – Do not support tools based on MapReduce Missing: System which enables building MapReduce workflows
  • 6. Page 6 Cloudgene • System to execute MapReduce programs graphically and combine them to workflows • One platform – many programs – Integration of existing MapReduce programs without source code adaptations – Create workflows using MapReduce, Apache Pig, R or Unix command-line programs • Runs in your browser
  • 7. Page 7 Cloudgene: Overview Cloudgene-MapRed MapReduce Workflow Manager Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows • Requires a compatible cluster to execute workflows – Small/Medium sized research institutes can hardly afford own clusters – Cloud computing: rent computer hardware from different providers (e.g. Amazon, HP)
  • 8. Page 8 CloudgeneCloudgene: Overview Cloudgene-MapRed MapReduce Workflow Manager Cloudgene-Cluster Infrastructure Manager Bioinformatics WorkflowsBioinformatics WorkflowsBioinformatics WorkflowsBioinformatics Workflows
  • 11. Page 11 Workflow Composition • New MapReduce algorithms can be integrated easily • Integration of existing MapReduce algorithms without adaptations in source code • Cloudgene uses its own workflow language • Workflow Definition Language (WDL) – Formal description of tasks and workflow steps – Property-based and uses the YAML syntax – Supports heterogeneous software components (MapReduce, R and unix command-line programs) – Basic workflow control patterns (loops and conditions)
  • 12. Page 13 Workflow Composition • Example of a simple WDL-Manifest file Command line parameters Inputs: Are set by the user through the web interface Outputs: are created by tasks (intermediate or persistent)
  • 13. Page 14 Workflow Composition • The user interface is created automatically
  • 14. Page 15 Workflow Execution Engine 1. Creates a dependency graph based on the WDL file and user input 2. Optimizes the graph to minimize the execution time (i.e. caching) 3. Schedules and submits jobs to the Hadoop Cluster
  • 17. Page 18 Supported Technologies • Apache Hadoop MapReduce • Apache PIG • RMarkdown – Ideal to generate html files with charts, statistics, … • Unix command line programs – Cloudgene exports automatically all HDFS files – No manual file staging between HDFS and POSIX filesystem needed! Advantage: Composition of hybrid Workflows possible
  • 18. Page 19 Other Features • Authentication and User-Management • Parameter Tracking • HDFS Workspace – Hides HDFS filesystem by the end-user – Importing Data from Amazon S3 Buckets, HTTP and (S)FTP Servers, File Uploads, ... – Facilitates the management of datasets on the cluster
  • 19. Page 20 Preview: Cloudgene 2.0 • Interface for web-services – Same WDL file, but different interface – User Registration – Intelligent Queuing – User Notification • Examples: – https://imputationserver.sph.umich.edu – http://mtdna-server.uibk.ac.at
  • 20. Page 21 Preview: Cloudgene 2.0 • Generic data analysis platform – Integration of additional data processing models Cloudgene Hadoop 1.0 MapReduce Cloudgene Hadoop 2.0 YARN MapReduce Spark Giraph …
  • 21. Page 22 Conclusion • Website – http://cloudgene.uibk.ac.at • Virtual Machine – https://bioimg.org/cloudgene • Getting started – http://cloudgene.uibk.ac.at/getting-started • Developer Guide – http://cloudgene.uibk.ac.at/developer-guide
  • 22. Page 23 Acks • Cloudgene – Lukas Forer (@lukfor) and Sebastian Schoenherr (@seppinho) • Imputation with Minimac – Goncalo Abecasis, Christian Fuchsberger • mtDNA-Server – Hansi Weißensteiner • Univ.-Prof. Florian Kronenberg – Head of the Division of Genetic Epidemiology, Medical University of Innsbruck 23

Notes de l'éditeur

  1. Welcome everybody to the defense of my phd thesis. In the next 20 minutes i give you an overview about the results and outcomes of my thesis. The main topic is the efficient analysis of data in the field of bioinformatics.
  2. NGS enables to sequence the whole genome. This is done in a extremely parallel way and enables to sequence the genome at low cost and high scale- This has the consequence that more and more data will be produced. So the bottleneck is no longer the data production in the lab, but its analysis. This is because one experiment produces gigabytes of data. Therefore ,one single workstation is no sufficient for the data analysis and super computers are often too expensive!
  3. So one solution fot that problem is to use commodity computing. That means we use a large number of normal cheap computing components and use them to perform our analysis in parallel And one approach which was developed specially for that kind of infrastructure is mapreduce. It is a parallelization framework developed by google in 2004 and enables to analyze large data efficiently in parallel The user writes only the map and the reduce function and the framework takes care about fault-tolerance, data-dist and load balancing. All the stuff we need in parallel computing environment. The map and reduce functions are stateless, and can be executed in parallel and therefore this approach scales very well! Apache hadoop is open source implementation of mapreduce.
  4. As this table shows, there exist already several Mapreduce apps in the field of bioinformatics and it is a high potential. For example there are algorithm available for mapping shot reads to a reference ot for rna analysis.
  5. But the problem of such approaches is that thei are available only on a per-tool basis in genetics we often need large workflows which consits of several steps To analyze data. But those tools cover only one aspect of such a pipeline. Moreover, for biologists without background in cs it is very hard to use them Most of popular workflow systems such as galaxy enable this abstraction only for traditional tools and not for mapreduce. So a system which enables building such mapreduce workflows is missing.
  6. So the aim sof my thesis can be classified in two parts: First, developing a system to compose complex workflows of multiple Mapreduce tools. This is done by abstracting all the technical details Second, evaluating this system by applying it to Bioinformatics. For that reason i adapted 3 different workflows to MapReduce. The first workflow is for genotype imputation, the second for genome-wide association studies and the last one detects Copy number variations.
  7. The first aim was solved by implementing a Workflow Execution Engine called Cloudgene MapRed. And on the top of this i have integrated the three workflows. Cloudgene-Mapred requires a compatible cluster to execute the pipeline. Especially for small research institutes it can be hard to afford and maintain their own cluster So a possible solution is cloud computing which enables to rent computer hardware from different providers for example amazon. So they can use the rented resources on demand.
  8. To overcome this issues, sebastian developed in his thesis a infrastructure manager which enables to launch and manage an hadoop cluster through the browser. So ist possible to run the same workflows on a local cluster, on private cloud or on a public cloud. This whole system is called cloudgene and in my presentation today i talk about the workflow executing engine and one of the three workflows. And this workflow is called imputation server.
  9. On this slide you can see the advantages of cloudgene compared with the manual appraoch.
  10. This workflow manager assists scientists in executing and monitoring worklfows The core of the architecture is the execution engine. As you can see in this picture, the workflow execution engine operates on a hadoop cluster . Therefore data reliability and fault tolerance are provided. The workflow engine contains an optimizer which tries to minimize the execution time by using caching mechanisms. Moreover, it contains a data manager for importing and exporting datasets. The system has rest api interface in order to communicate with clients. In our case the the client is a webapplication.
  11. The Workflow composition in Cloudgene was developed with two aims in mind: first it should be possible to implement new algorithms easily Second, it should be possible to integrate existing algorithm without source code adaptations For that reason, i developed a new Workflow Language which is called WDL and is used by cloudgene. - It enables a formal description of workflows and their tasks It is property based and uses a human readable syntax Supports different software components as tasks And supports some basic control patterns like conditions and loops.
  12. Here is a very simple example of such a workflow written in WDL. I don't want to go too much in detail, but you can define inputs and outputs and then you can reuse them in your tasks.
  13. Based on this manifest file, we create automatically a use rinterface which can be used to submit the job with different parameters and datasets. And when the user clicks on the submit button then the workflow engine comes into play.
  14. We have the WDL manifest file with thr workflow structure, and the user input which is used to execute it. Based on this information a graph is created which contains all tasks and their dependencies. Then the optimized tries to minimize the graph by using caching. And finally based on this graph are task execution plan is created which is used to submit the jobs to the cluster.
  15. Once the job is submitted, we can monitor the progress.
  16. When the job is complete, we can download the results files directly through the browser and all used parameters are tracked.
  17. Beside the hadoop technologies we support also other useful technologies. For example rmarkdown to create html reports Or any other unix command line program. In this case cloudgene automatically exports files from the hdfs to the local filesystem. So an intuitive combination of these technologies is possible.
  18. The next step of this project is to turn Cloudgene into a more generic data analysis cloud platform. Therefore we plan to integrate additional big data computation models so that cloudgene is not limited to mapreduce. One possibility is to integrate YARN which is the new version of hadoop and is a middle layer between hadoop and mapreduce. So we can support also other models for example for graph data processing and in-memory calculations.