SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
Workflow Engines for
Hadoop
Joe Crobak
@joecrobak
NYC Data Engineering Meetup
September 5, 2013
1
Intro
2
Background
• Devops/Infra for Hadoop
• ~4 years with Hadoop
• Have done two migrations from EMR to the colo.
• Formerly Data/Analytics Infrastructure @
• worked with Apache Oozie and Luigi
• Before that, Hadoop @
• worked with Azkaban 1.0
Disclosure: I’ve contributed to Luigi and Azkaban 1.0
3
What is Apache Hadoop?
4
What is a workflow?
5
What is a workflow engine?
6
Two Example Use-Cases
7
Analytics / Data Warehousing
• logs -> fact table(s).
• database backups -> dimension tables.
• Compute rollups/cubes.
• Load data into a low-latency store (e.g.
Redshift,Vertica, HBase).
• Dashboarding & BI tools hit database.
8
Analytics / Data Warehousing
9
Analytics / Data Warehousing
• What happens if there’s a failure?
• rebuild the failed day
• ... and any downstream datasets
10
Hadoop-Driven Features
• PeopleYou May Know
• Amazon-style “People
that buy this often by
that”
• SPAM detection
• logs, databases ->
machine learning /
collaborative filtering
• derivative datasets ->
production database
(often k/v store)
11
Hadoop-Driven Features
12
Hadoop-Driven Features
• What happens if there’s a failure?
• possibly OK to skip a day.
• Workflow tends to be self-contained, so
you don’t need to rerun downstream.
• Sanity check your data before pushing to
production.
13
Workflow Engine Evolution
• Usually start with cron
• at 01:00 import data
• at 02:00 run really expensive query A
• at 03:00 run query B, C, D
• ...
• This goes on until you have ~10 jobs or so.
• It’s hard to debug and rerun.
• Doesn’t scale to many people.
14
Workflow Engine Evolution
• Two possibilities:
1. “a workflow engine can’t be too hard,
let’s write our own”
2. spend weeks evaluating all the options
out there.Try to shoehorn your
workflow into each one.
15
Workflow Engine
Considerations
How do I...
• Deploy and Upgrade
• workflows and the workflow engine
• Test
• Detect Failure
• Debug/find logs
• Rebuild/backfill datasets
• Load data to/from a RDBMS
• Manage a set of similar tasks
16
Apache
http://oozie.apache.org/
17
Oozie - architecture
18
Oozie - the good
• Great community support
• Integrated with HUE, Cloudera Manager,Apache
Ambari
• HCatalog integration
• SLA alerts (new in Oozie 4)
• Ecosystem support: Pig, Hive, Sqoop, etc.
• Very detailed documentation
• Launcher jobs as map tasks
19
Oozie - the bad
• Launcher jobs as map tasks.
• UI - but HUE, oozie-web (and
good API)
• Confusing object model (bundles,
coordinators, workflows) - high
barrier to entry.
• Setup - extjs, hadoop proxy user,
RDBMS.
• XML!
20
Oozie - the bad
• Hello World in Oozie
21
http://azkaban.github.io/azkaban2/
22
Azkaban - architecture
Source: http://azkaban.github.io/azkaban2/overview.html
23
Azkaban - the good
• Great UI
• DAG visualization
• Task history
• Easy access to log files
• Plugin architecture
• Pig, Hive, etc. Also, voldemort “build and push” integration
• SLA Alerting
• HDFS Browser
• User Authentication/Authorization and auditing.
• Reportal: https://github.com/azkaban/azkaban-plugins/pull/6
24
25
Azkaban - the bad
• Representing data dependencies
• i.e. run job X when datasetY is available.
• Executors run on separate workers, can be
under-utilized (YARN anyone?).
• Community - mostly just LinkedIn, and they
rewrote it in isolation.
• mailing list responsiveness is good.
26
Azkaban - good and
bad
• Job definitions as java properties
• Web uploads/deploy
• Running jobs, scheduling jobs.
• nearly impossible to integrate with
configuration management
27
https://github.com/spotify/luigi
28
Luigi - architecture
29
Luigi - the good
• Task definitions are code.
• Tasks are idempotent.
• Workflow defines data (and task) dependencies.
• Growing community.
• Easy to hack on the codebase (<6k LoC).
• Postgres integration
• Foursquare got this working with Redshift and
Vertica.
30
Luigi - the bad
• Missing some key features, e.g. Pig support
• but this is easy to add
• Deploy situation is confusing (but easy to
automate)
• visualizer scaling
• no persistent backing
• JVM overhead
31
Comparison matrix -
part 1
Lang
Code
Complexity
Frameworks Logs Community Docs
oozie java high - 105k
pig, hive, sqoop,
mapreduce
decentralized,
map tasks
Good - ASF in
many distros
excelle
nt
azkaban java moderate - 26k
pig, hive,
mapreduce
UI-accessible
few users,
responsive on
MLs
good
luigi python simple - 5.9k
hive, postgres,
scalding, python
streaming
decentral-ized
on workers
few users,
responsive on
github and MLs
good
32
Comparison matrix -
part 2
property
configuration
Reruns
Customizat
ion (new
job type)
Testing User Auth
oozie
command-line,
properties file, xml
defaults
oozie job -
rerun
difficult MiniOozie
Kerberos, simple,
custom
azkaban
bundled inside
workflow zip, system
defaults
partial
reruns in UI
plugin
architecture
?
xml-based,
custom
luigi
command-line,
python ini file
remove
output,
idempotency
subclass
luigi.Task
python
unittests
linux-based
33
Other workflow
engines
• Chronos
• EMR
• Mortar
• Qubole
• general purpose:
• kettle, spring batch
34
Qualities I like in a
workflow engine
• scripting language
• you end up writing scripts to run your job anyway
• custom logic, e.g. representing a dep on 7-days of data or run
only every week
• Less property propagation
• Idempotency
• WYSIWYG
• It shouldn't be hard to take my existing job and move it to the
workflow engine (it should just work).
• Easy to hack on
35
Less important
• High availability (cold failover with manual
intervention is OK)
• Multiple cluster support
• Security
36
Best Practices
• Version datasets
• Backfilling datasets
• Monitor the absence of a job running
• Continuous deploy?
37
Resources
• Azkaban talk at Hadoop User Group:
http://www.youtube.com/watch?
v=rIUlh33uKMU
• PyData talk on Luigi: http://vimeo.com/
63435580
• Oozie talk at Hadoop user Group: http://
www.slideshare.net/mislam77/oozie-hug-
may12
38
Thanks!
• Questions?
• shameless plug: Subscribe to my
newsletter: http://hadoopweekly.com
39

Contenu connexe

Tendances

Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with LuigiTeemu Kurppa
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engineWalter Liu
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementBurasakorn Sabyeying
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Getting to Know Airflow
Getting to Know AirflowGetting to Know Airflow
Getting to Know AirflowRosanne Hoyem
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesDataWorks Summit
 

Tendances (20)

What is Spark
What is SparkWhat is Spark
What is Spark
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with Luigi
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
AIRflow at Scale
AIRflow at ScaleAIRflow at Scale
AIRflow at Scale
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Getting to Know Airflow
Getting to Know AirflowGetting to Know Airflow
Getting to Know Airflow
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 

Similaire à Workflow Engines for Hadoop

Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 DistilledGrig Gheorghiu
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopAllen Wittenauer
 
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experienceIgor Anishchenko
 
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experienceAlex Tumanoff
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted CloudColin Charles
 
DrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilityDrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilitycherryhillco
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"IT Event
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
What we talk about when we talk about DevOps
What we talk about when we talk about DevOpsWhat we talk about when we talk about DevOps
What we talk about when we talk about DevOpsRicard Clau
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexMaking sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexApache Apex
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Using LuaJIT in mid-load web-projects
Using LuaJIT in mid-load web-projectsUsing LuaJIT in mid-load web-projects
Using LuaJIT in mid-load web-projectsAlexander Gladysh
 
Databases in the hosted cloud
Databases in the hosted cloudDatabases in the hosted cloud
Databases in the hosted cloudColin Charles
 

Similaire à Workflow Engines for Hadoop (20)

Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experience
 
Overview of PaaS: Java experience
Overview of PaaS: Java experienceOverview of PaaS: Java experience
Overview of PaaS: Java experience
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
 
DrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalabilityDrupalCampLA 2014 - Drupal backend performance and scalability
DrupalCampLA 2014 - Drupal backend performance and scalability
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
What we talk about when we talk about DevOps
What we talk about when we talk about DevOpsWhat we talk about when we talk about DevOps
What we talk about when we talk about DevOps
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexMaking sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Using LuaJIT in mid-load web-projects
Using LuaJIT in mid-load web-projectsUsing LuaJIT in mid-load web-projects
Using LuaJIT in mid-load web-projects
 
Stackato v2
Stackato v2Stackato v2
Stackato v2
 
Databases in the hosted cloud
Databases in the hosted cloudDatabases in the hosted cloud
Databases in the hosted cloud
 

Dernier

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 

Dernier (20)

Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Workflow Engines for Hadoop

  • 1. Workflow Engines for Hadoop Joe Crobak @joecrobak NYC Data Engineering Meetup September 5, 2013 1
  • 3. Background • Devops/Infra for Hadoop • ~4 years with Hadoop • Have done two migrations from EMR to the colo. • Formerly Data/Analytics Infrastructure @ • worked with Apache Oozie and Luigi • Before that, Hadoop @ • worked with Azkaban 1.0 Disclosure: I’ve contributed to Luigi and Azkaban 1.0 3
  • 4. What is Apache Hadoop? 4
  • 5. What is a workflow? 5
  • 6. What is a workflow engine? 6
  • 8. Analytics / Data Warehousing • logs -> fact table(s). • database backups -> dimension tables. • Compute rollups/cubes. • Load data into a low-latency store (e.g. Redshift,Vertica, HBase). • Dashboarding & BI tools hit database. 8
  • 9. Analytics / Data Warehousing 9
  • 10. Analytics / Data Warehousing • What happens if there’s a failure? • rebuild the failed day • ... and any downstream datasets 10
  • 11. Hadoop-Driven Features • PeopleYou May Know • Amazon-style “People that buy this often by that” • SPAM detection • logs, databases -> machine learning / collaborative filtering • derivative datasets -> production database (often k/v store) 11
  • 13. Hadoop-Driven Features • What happens if there’s a failure? • possibly OK to skip a day. • Workflow tends to be self-contained, so you don’t need to rerun downstream. • Sanity check your data before pushing to production. 13
  • 14. Workflow Engine Evolution • Usually start with cron • at 01:00 import data • at 02:00 run really expensive query A • at 03:00 run query B, C, D • ... • This goes on until you have ~10 jobs or so. • It’s hard to debug and rerun. • Doesn’t scale to many people. 14
  • 15. Workflow Engine Evolution • Two possibilities: 1. “a workflow engine can’t be too hard, let’s write our own” 2. spend weeks evaluating all the options out there.Try to shoehorn your workflow into each one. 15
  • 16. Workflow Engine Considerations How do I... • Deploy and Upgrade • workflows and the workflow engine • Test • Detect Failure • Debug/find logs • Rebuild/backfill datasets • Load data to/from a RDBMS • Manage a set of similar tasks 16
  • 19. Oozie - the good • Great community support • Integrated with HUE, Cloudera Manager,Apache Ambari • HCatalog integration • SLA alerts (new in Oozie 4) • Ecosystem support: Pig, Hive, Sqoop, etc. • Very detailed documentation • Launcher jobs as map tasks 19
  • 20. Oozie - the bad • Launcher jobs as map tasks. • UI - but HUE, oozie-web (and good API) • Confusing object model (bundles, coordinators, workflows) - high barrier to entry. • Setup - extjs, hadoop proxy user, RDBMS. • XML! 20
  • 21. Oozie - the bad • Hello World in Oozie 21
  • 23. Azkaban - architecture Source: http://azkaban.github.io/azkaban2/overview.html 23
  • 24. Azkaban - the good • Great UI • DAG visualization • Task history • Easy access to log files • Plugin architecture • Pig, Hive, etc. Also, voldemort “build and push” integration • SLA Alerting • HDFS Browser • User Authentication/Authorization and auditing. • Reportal: https://github.com/azkaban/azkaban-plugins/pull/6 24
  • 25. 25
  • 26. Azkaban - the bad • Representing data dependencies • i.e. run job X when datasetY is available. • Executors run on separate workers, can be under-utilized (YARN anyone?). • Community - mostly just LinkedIn, and they rewrote it in isolation. • mailing list responsiveness is good. 26
  • 27. Azkaban - good and bad • Job definitions as java properties • Web uploads/deploy • Running jobs, scheduling jobs. • nearly impossible to integrate with configuration management 27
  • 30. Luigi - the good • Task definitions are code. • Tasks are idempotent. • Workflow defines data (and task) dependencies. • Growing community. • Easy to hack on the codebase (<6k LoC). • Postgres integration • Foursquare got this working with Redshift and Vertica. 30
  • 31. Luigi - the bad • Missing some key features, e.g. Pig support • but this is easy to add • Deploy situation is confusing (but easy to automate) • visualizer scaling • no persistent backing • JVM overhead 31
  • 32. Comparison matrix - part 1 Lang Code Complexity Frameworks Logs Community Docs oozie java high - 105k pig, hive, sqoop, mapreduce decentralized, map tasks Good - ASF in many distros excelle nt azkaban java moderate - 26k pig, hive, mapreduce UI-accessible few users, responsive on MLs good luigi python simple - 5.9k hive, postgres, scalding, python streaming decentral-ized on workers few users, responsive on github and MLs good 32
  • 33. Comparison matrix - part 2 property configuration Reruns Customizat ion (new job type) Testing User Auth oozie command-line, properties file, xml defaults oozie job - rerun difficult MiniOozie Kerberos, simple, custom azkaban bundled inside workflow zip, system defaults partial reruns in UI plugin architecture ? xml-based, custom luigi command-line, python ini file remove output, idempotency subclass luigi.Task python unittests linux-based 33
  • 34. Other workflow engines • Chronos • EMR • Mortar • Qubole • general purpose: • kettle, spring batch 34
  • 35. Qualities I like in a workflow engine • scripting language • you end up writing scripts to run your job anyway • custom logic, e.g. representing a dep on 7-days of data or run only every week • Less property propagation • Idempotency • WYSIWYG • It shouldn't be hard to take my existing job and move it to the workflow engine (it should just work). • Easy to hack on 35
  • 36. Less important • High availability (cold failover with manual intervention is OK) • Multiple cluster support • Security 36
  • 37. Best Practices • Version datasets • Backfilling datasets • Monitor the absence of a job running • Continuous deploy? 37
  • 38. Resources • Azkaban talk at Hadoop User Group: http://www.youtube.com/watch? v=rIUlh33uKMU • PyData talk on Luigi: http://vimeo.com/ 63435580 • Oozie talk at Hadoop user Group: http:// www.slideshare.net/mislam77/oozie-hug- may12 38
  • 39. Thanks! • Questions? • shameless plug: Subscribe to my newsletter: http://hadoopweekly.com 39