SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Oozie Evolution
Gateway to Hadoop Eco-System


             Mohammad Islam
Agenda

•    What is Oozie?
•    What is in the Next Release?
•    Challenges
•    Future Works
•    Q&A
Oozie in Hadoop Eco-System

                Oozie




                               HCatalog
        Pig    Sqoop    Hive
Oozie




              Map-Reduce

                  HDFS
Oozie : The Conductor
A Workflow Engine
•  Oozie executes workflow defined as DAG of jobs
•  The job type includes: Map-Reduce/Pig/Hive/Any script/
   Custom Java Code etc
                                      M/R
                                   streaming
                                       job


             M/R
  start               fork                           join
             job



                                     Pig                    MORE
                                                                          decision
                                     job



                                                        M/R                   ENOUGH
                                                        job




                                               FS
                             end                                   Java
                                               job
A Scheduler
•  Oozie executes workflow based on:
   –  Time Dependency (Frequency)
   –  Data Dependency

                 Oozie Server
                                        Check
  WS API           Oozie            Data Availability
                 Coordinator

                   Oozie
 Oozie            Workflow
 Client                                     Hadoop
REST-API for Hadoop Components

•  Direct access to Hadoop components
  –  Emulates the command line through REST
     API.
•  Supported Products:
  –  Pig
  –  Map Reduce
Three Questions …
 Do you need Oozie?


Q1 : Do you have multiple jobs with
     dependency?
Q2 : Does your job start based on time or data
     availability?
Q3 : Do you need monitoring and operational
     support for your jobs?
   If any one of your answers is YES,
   then you should consider Oozie!
What Oozie is NOT

•  Oozie is not a resource scheduler

•  Oozie is not for off-grid scheduling
   o  Note: Off-grid execution is possible through
   SSH action.

•  If you want to submit your job occasionally,
   Oozie is an option.
    o  Oozie provides REST API based submission.
Oozie in Apache
Main Contributors
Oozie in Apache

•  Y! internal usages:
  –  Total number of user : 375
  –  Total number of processed jobs ≈ 750K/
     month
•  External downloads:
  –  2500+ in last year from GitHub
  –  A large number of downloads maintained by
     3rd party packaging.
Oozie Usages Contd.

•  User Community:
  –  Membership
    •  Y! internal - 286
    •  External – 163
  –  Message (approximate)
    •  Y! internal – 7/day
    •  External – 8/day
Next Release …

•  Integration with Hadoop 0.23

•  HCatalog integration
  –  Non-polling approach
Usability

•    Script Action
•    Distcp Action
•    Suspend Action
•    Mini-Oozie for CI
     –  Like Mini-cluster
•  Support multiple versions
     –  Pig, Distcp, Hive etc.
Reliability

•  Auto-Retry in WF Action level

•  High-Availability
  –  Hot-Warm through ZooKeeper
Manageability

•  Email action

•  Query Pig Stats/Hadoop Counters
  –  Runtime control of Workflow based on stats
  –  Application-level control using the stats
Challenges : Queue Starvation

•  Which Queue?
  –  Not a Hadoop queue issue.
  –  Oozie internal queue to process the Oozie
     sub-tasks.
  –  Oozie’s main execution engine.
•  User Problem :
  –  Job’s kill/suspend takes very long time.
Challenges : Queue Starvation
Technical Problem:
           •  Before   execution, every task acquires lock on the job id.
           •  Specialhigh-priority tasks (such as Kill or Suspend)
           couldn’t get the lock and therefore, starve.


           In Queue                                          J1   J2

 J1   J1        J2      J1(H)   J2                           J1



       Starvation for High Priority Task!
Challenges : Queue Starvation
Resolution:
    • Add the high priority task in both the interrupt list and normal queue.
   •  Before de-queue, check if there is any task in the interrupt list for the
   same job id. If there is one, execute that first.



             In Queue                                                 J1    J2

 J1     J1         J2        J1(H)         J2                        J1

                finds a task in interrupt queue

             In Interrupt List

J1(H)
Oozie Futures

•  Easy adoption
  –  Modeling tool
  –  IDE integration
  –  Modular Configurations
•  Allow job notification through JMS
•  Event-based data processing
•  Prioritization
  –  By user, system level.
Take Away ..

•  Oozie is
  –  In Apache!
  –  Reliable and feature-rich.
  –  Growing fast.
Q&A




                  Mohammad K Islam
               kamrul@yahoo-inc.com
      http://incubator.apache.org/oozie/
Who needs Oozie?

•  Multiple jobs that have sequential/
   conditional/parallel dependency
•  Need to run job/Workflow periodically.
•  Need to launch job when data is available.
•  Operational requirements:
  –  Easy monitoring
  –  Reprocessing
  –  Catch-up
Challenges : Queue Starvation
Problem:
                 •  Consider queue with tasks of type T1 and T2. Max Concurrency = 2.
                 •  Over-provisioned task (marked by red) is pushed back to the queue.
                 •  At high load, it gets penalized in favor of same type, but later arrival
                    of tasks .


             In Queue                                   Running             C (T1) C (T2)

T1      T2     T1       T1    T1     T2      T1                              012      01



     Starvation!
     T1 cannot execute and is pushed to head of queue
Challenges : Queue Starvation
Resolution:
            •  Before de-queuing any task, check its concurrency.
            •  If violated, skip and get the next task.


          In Queue                               Running           C (T1) C (T2)

T1   T2     T1       T1   T1    T2     T1                          012     01 2


Enqueue T2 now   T1 cannot execute, so skip by one normallyfront
                                T1 now executes node to

Contenu connexe

Similaire à Oozie hugnov11

Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011mislam77
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopYahoo Developer Network
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12mislam77
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Oozie sweet
Oozie sweetOozie sweet
Oozie sweetmislam77
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessYahoo Developer Network
 
Outbrain River Presentation at Reversim Summit 2013
Outbrain River Presentation at Reversim Summit 2013Outbrain River Presentation at Reversim Summit 2013
Outbrain River Presentation at Reversim Summit 2013Harel Ben-Attia
 
Lessons from Branch's launch
Lessons from Branch's launchLessons from Branch's launch
Lessons from Branch's launchaflock
 
Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance
Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query PerformanceInnovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance
Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query PerformanceDataWorks Summit
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users GroupNitay Joffe
 
2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwordsNitay Joffe
 
Asynchronous Programming Lab @ DotNetToscana
Asynchronous Programming Lab @ DotNetToscanaAsynchronous Programming Lab @ DotNetToscana
Asynchronous Programming Lab @ DotNetToscanaMatteo Baglini
 
Luigi presentation OA Summit
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA SummitOpen Analytics
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieChicago Hadoop Users Group
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureJianfeng Zhang
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureRajesh Balamohan
 
Storm distributed processing
Storm distributed processingStorm distributed processing
Storm distributed processingducquoc_vn
 
C# Async/Await Explained
C# Async/Await ExplainedC# Async/Await Explained
C# Async/Await ExplainedJeremy Likness
 
Message Queues in Ruby - An Overview
Message Queues in Ruby - An OverviewMessage Queues in Ruby - An Overview
Message Queues in Ruby - An OverviewPradeep Elankumaran
 

Similaire à Oozie hugnov11 (20)

Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Oozie sweet
Oozie sweetOozie sweet
Oozie sweet
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification Process
 
Outbrain River Presentation at Reversim Summit 2013
Outbrain River Presentation at Reversim Summit 2013Outbrain River Presentation at Reversim Summit 2013
Outbrain River Presentation at Reversim Summit 2013
 
Lessons from Branch's launch
Lessons from Branch's launchLessons from Branch's launch
Lessons from Branch's launch
 
Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance
Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query PerformanceInnovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance
Innovations in Apache Hadoop MapReduce Pig Hive for Improving Query Performance
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
 
2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords
 
Asynchronous Programming Lab @ DotNetToscana
Asynchronous Programming Lab @ DotNetToscanaAsynchronous Programming Lab @ DotNetToscana
Asynchronous Programming Lab @ DotNetToscana
 
Luigi presentation OA Summit
Luigi presentation OA SummitLuigi presentation OA Summit
Luigi presentation OA Summit
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Storm distributed processing
Storm distributed processingStorm distributed processing
Storm distributed processing
 
C# Async/Await Explained
C# Async/Await ExplainedC# Async/Await Explained
C# Async/Await Explained
 
Message Queues in Ruby - An Overview
Message Queues in Ruby - An OverviewMessage Queues in Ruby - An Overview
Message Queues in Ruby - An Overview
 

Dernier

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Oozie hugnov11

  • 1. Oozie Evolution Gateway to Hadoop Eco-System Mohammad Islam
  • 2. Agenda •  What is Oozie? •  What is in the Next Release? •  Challenges •  Future Works •  Q&A
  • 3. Oozie in Hadoop Eco-System Oozie HCatalog Pig Sqoop Hive Oozie Map-Reduce HDFS
  • 4. Oozie : The Conductor
  • 5. A Workflow Engine •  Oozie executes workflow defined as DAG of jobs •  The job type includes: Map-Reduce/Pig/Hive/Any script/ Custom Java Code etc M/R streaming job M/R start fork join job Pig MORE decision job M/R ENOUGH job FS end Java job
  • 6. A Scheduler •  Oozie executes workflow based on: –  Time Dependency (Frequency) –  Data Dependency Oozie Server Check WS API Oozie Data Availability Coordinator Oozie Oozie Workflow Client Hadoop
  • 7. REST-API for Hadoop Components •  Direct access to Hadoop components –  Emulates the command line through REST API. •  Supported Products: –  Pig –  Map Reduce
  • 8. Three Questions … Do you need Oozie? Q1 : Do you have multiple jobs with dependency? Q2 : Does your job start based on time or data availability? Q3 : Do you need monitoring and operational support for your jobs? If any one of your answers is YES, then you should consider Oozie!
  • 9. What Oozie is NOT •  Oozie is not a resource scheduler •  Oozie is not for off-grid scheduling o  Note: Off-grid execution is possible through SSH action. •  If you want to submit your job occasionally, Oozie is an option. o  Oozie provides REST API based submission.
  • 10. Oozie in Apache Main Contributors
  • 11. Oozie in Apache •  Y! internal usages: –  Total number of user : 375 –  Total number of processed jobs ≈ 750K/ month •  External downloads: –  2500+ in last year from GitHub –  A large number of downloads maintained by 3rd party packaging.
  • 12. Oozie Usages Contd. •  User Community: –  Membership •  Y! internal - 286 •  External – 163 –  Message (approximate) •  Y! internal – 7/day •  External – 8/day
  • 13. Next Release … •  Integration with Hadoop 0.23 •  HCatalog integration –  Non-polling approach
  • 14. Usability •  Script Action •  Distcp Action •  Suspend Action •  Mini-Oozie for CI –  Like Mini-cluster •  Support multiple versions –  Pig, Distcp, Hive etc.
  • 15. Reliability •  Auto-Retry in WF Action level •  High-Availability –  Hot-Warm through ZooKeeper
  • 16. Manageability •  Email action •  Query Pig Stats/Hadoop Counters –  Runtime control of Workflow based on stats –  Application-level control using the stats
  • 17. Challenges : Queue Starvation •  Which Queue? –  Not a Hadoop queue issue. –  Oozie internal queue to process the Oozie sub-tasks. –  Oozie’s main execution engine. •  User Problem : –  Job’s kill/suspend takes very long time.
  • 18. Challenges : Queue Starvation Technical Problem: •  Before execution, every task acquires lock on the job id. •  Specialhigh-priority tasks (such as Kill or Suspend) couldn’t get the lock and therefore, starve. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 Starvation for High Priority Task!
  • 19. Challenges : Queue Starvation Resolution: • Add the high priority task in both the interrupt list and normal queue. •  Before de-queue, check if there is any task in the interrupt list for the same job id. If there is one, execute that first. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 finds a task in interrupt queue In Interrupt List J1(H)
  • 20. Oozie Futures •  Easy adoption –  Modeling tool –  IDE integration –  Modular Configurations •  Allow job notification through JMS •  Event-based data processing •  Prioritization –  By user, system level.
  • 21. Take Away .. •  Oozie is –  In Apache! –  Reliable and feature-rich. –  Growing fast.
  • 22. Q&A Mohammad K Islam kamrul@yahoo-inc.com http://incubator.apache.org/oozie/
  • 23. Who needs Oozie? •  Multiple jobs that have sequential/ conditional/parallel dependency •  Need to run job/Workflow periodically. •  Need to launch job when data is available. •  Operational requirements: –  Easy monitoring –  Reprocessing –  Catch-up
  • 24. Challenges : Queue Starvation Problem: •  Consider queue with tasks of type T1 and T2. Max Concurrency = 2. •  Over-provisioned task (marked by red) is pushed back to the queue. •  At high load, it gets penalized in favor of same type, but later arrival of tasks . In Queue Running C (T1) C (T2) T1 T2 T1 T1 T1 T2 T1 012 01 Starvation! T1 cannot execute and is pushed to head of queue
  • 25. Challenges : Queue Starvation Resolution: •  Before de-queuing any task, check its concurrency. •  If violated, skip and get the next task. In Queue Running C (T1) C (T2) T1 T2 T1 T1 T1 T2 T1 012 01 2 Enqueue T2 now T1 cannot execute, so skip by one normallyfront T1 now executes node to