SlideShare a Scribd company logo
1 of 31
Everything that you ever wanted to
know about Oozie, but were afraid
              to ask

       B Lublinsky, A Yakubovich
Apache Oozie
• Oozie is a workflow/coordination system to
  manage Apache Hadoop jobs.
• A single Oozie server implements all four
  functional Oozie components:
  – Oozie workflow
  – Oozie coordinator
  – Oozie bundle
  – Oozie SLA.
Main components
                                                                               Oozie Server


                                                                    Bundle
3rd party application


                         time condition monitoring

                                                                Coordinator


             WS API

                                                                    workflow
                                                                                                                   data condition monitoring




                                            action
 Oozie Command                  action                        action
  Line Interface
                                                     action
                                                                               wf logic              job submission
                                                                                                     and monitoring




                                                     definitions,
                                                       states




                                                                                              Oozie shared
                                                                                                libraries
                                                                                                                            HDFS



                                            Bundle
                                           Coordinator
                                            Coordinator
                                                                                                       MapReduce

                                             Data
                                          Coordinator
                                           Coordinator
                                            Coordinator



                                           Workflow
                                           Coordinator
                                            Coordinator
                                                                                                             Hadoop
Oozie workflow
Workflow Language
Flow-control   XML element type       Description
node
Decision       workflow:DECISION      expressing “switch-case” logic

Fork           workflow:FORK          splits one path of execution into multiple concurrent paths
Join           workflow:JOIN          waits until every concurrent execution path of a previous fork
                                      node arrives to it
Kill           workflow:kill          forces a workflow job to kill (abort) itself

Action node    XML element type    Description
java           workflow:JAVA       invokes the main() method from the specified java class
fs             workflow:FS         manipulate files and directories in HDFS; supports commands:
                                   move, delete, mkdir
MapReduce      workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job,
                                   streaming job or pipe job
Pig            workflow:pig        runs a Pig job
Sub workflow   workflow:SUB-       runs a child workflow job
               WORKFLOW
Hive *         workflow:HIVE       runs a Hive job
Shell *        workflow:SHELL      runs a Shell command
ssh *          workflow:SSH        starts a shell command on a remote machine as a remote secure
                                   shell
Sqoop *        workflow:SQOOP      runs a Sqoop job
Email *        workflow:EMAIL      sending emails from Oozie workflow application
Distcp ?                           Under development (Yahoo)
Workflow actions
• Oozie workflow supports two types of actions:
    Synchronous, executed inside Oozie runtime
    Asynchronous, executed as a Map Reduce job.
 ActionStartCommand             WorkflowStore                    Services          ActionExecutorContext                 JavaActionExecutor             JobClient


         1 : workflow := getWorkflow()


            2 : action := getAction()


                                           3 : context := init<>()


                            4 : executor := get()



                                                                     5 : start()



                                                                                                                                 6 : submitLauncher()




                                                                                      7 : jobClient := get()

                                                                                                                                  8 : runningJob := submit()


                                                                                                         9 : setStartData()
Workflow lifecycle

                       PREP




KILLED                RUNNING               FAILED




          SUSPENDED             SUCCEDDED
Oozie execution console
Extending Oozie workflow
• Oozie provides a “minimal” workflow language, which
  contains only a handful of control and actions nodes.
• Oozie supports a very elegant extensibility mechanism –
  custom action nodes. Custom action nodes allow to extend
  Oozie’ language with additional actions (verbs).
• Creation of custom action requires implementation of
  following:
   – Java action implementation, which extends ActionExecutor
     class.
   – Implementation of the action’s XML schema defining action’s
     configuration parameters
   – Packaging of java implementation and configuration schema
     into action jar, which has to be added to Oozie war
   – extending oozie-site.xml to register information about custom
     executor with Oozie runtime.
Oozie Workflow Client
• Oozie provides an easy way for integration with enterprise
  applications through Oozie client APIs. It provides two
  types of APIs
• REST HTTP API
   Number of HTTP requests
   • Info requests (job status, job configuration)
   • Job management (submit, start, suspend, resume, kill)
   Example: job definition info request
       GET /oozie/v0/job/job-ID?show=definition
• Java API - package org.apache.oozie.client
   – OozieClient
       start(), submit(), run(), reRunXXX(), resume(), kill(), suspend()
   – WorkflowJob, WorkflowAction
   – CoordinatorJob, CoordinatorAction
   – SLAEvent
Oozie workflow good, bad and ugly
• Good
   – Nice integration with Hadoop ecosystem, allowing to easily build
     processes encompassing synchronized execution of multiple Map
     Reduce, Hive, Pig, etc jobs.
   – Nice UI for tracking execution progress
   – Simple APIs for integration with other applications
   – Simple extensibility APIs
• Bad
   – Process has to be expressed directly in hPDL with no visual support
   – No support for Uber Jars (but we added our own)
• Ugly
   – Static forking (but you can regenerate workflow and invoke on a fly)
   – No support for loops
Oozie Coordinator
Coordinator language
Element type   Description                                         Attributes and sub-elements
coordinator-   top-level element in coordinator instance           frequency
app                                                                start
                                                                   end
controls       specify the execution policy for coordinator and timeout (actions)
               it’s elements (workflow actions)                 concurrency (actions)
                                                                execution order (workflow
                                                                instances)
action         Required singular element specifying the            Workflow name
               associated workflow. The jobs specified in
               workflow consume and produce dataset
               instances
datasets       Collection of data referred to by a logical name.
               Datasets serve to specify data dependences
               between workflow instances
input event    specifies the input conditions (in the form of
               present data sets) that are required in order to
               execute a coordinator action
output event   specifies the dataset that should be produced
               by coordinator action
Coordinator lifecycle
Oozie Bundle
Bundle lifecycle

                                  PREP




 PREPSUSPENDED       PREPPAUSED          RUNNING    KILLED




SUSPENDED                                  FAILED   PAUSED
                   SUCCEDDED
Oozie SLA
SLA Navigation
                      COORD_JOBS

                       id
                       app_name
                       app_path
                       …
                                         WF_JOBS
SLA_EVENT

event_id                                id
alert_contact                           app_name
alert-frieuency                         app_path
…                                       …
sla_id
...                   COORD_ACTIONS

                        id
                        action_number
                        action_xml      WF_ACTIONS
                        …
                        external_id
                        ...              id
                                         conf
                                         console_url
                                         …
Using Probes to analyze/monitor Places

• Select probe data for specified time/location
• Validate – Filter - Transform probe data
• Calculate statistics on available probe data
• Distribute data per geo-tiles
• Calculate place statistics (e.g. attendance index)
-------------------------------------------------------------
If exception condition happens, report failure
If all steps succeeded, report success
Workflow as acyclic graph
Workflow – fragment 1
Workflow – fragment 2
Oozie tips and tricks
Configuring workflow
• Oozie provides 3 overlapping mechanisms to configure workflow -
  config-default.xml, jobs properties file and job arguments that can
  be passed to Oozie as part of command line invocations.
• The way Oozie processes these three sets of the parameters is as
  follows:
    – Use all of the parameters from command line invocation
    – For remaining unresolved parameters, job config is used
    – Use config-default.xml for everything else
• Although documentation does not describe clearly when to use
  which, the overall recommendation is as follows:
    – Use config-default.xml for defining parameters that never change for a
      given workflow
    – Use jobs properties for the parameters that are common for a given
      deployment of a workflow
    – Use command line arguments for the parameters that are specific for
      a given workflow invocation.
Accessing and storing process
                variables
• Accessing
  – Through the arguments in java main
• Storing
     String ooziePropFileName =
            System.getProperty("oozie.action.output.properties");
     OutputStream os = new FileOutputStream(new
            File(ooziePropFileName));
     Properties props = new Properties();
     props.setProperty(key, value);
     props.store(os, "");
     os.close();
Validating data presence
• Oozie provides two possible approaches for validating
  resource file(s) presence
   – using Oozie coordinator’s input events based on the data set -
     technically the simplest implementation approach, but it does
     not provide a more complex decision support that might be
     required. It just either runs a corresponding workflow or not.
   – custom java node inside Oozie workflow. - allows to extend
     decision logic by sending notifications about data absence, run
     execution on partial data under certain timing conditions, etc.
• Additional configuration parameters for Oozie coordinator,
  for example, ability to wait for files arrival, etc. can expand
  usage of Oozie coordinator.
Invoking map Reduce jobs
• Oozie provides two different ways of invoking Map Reduce
  job – MapReduce action and java action.
• Invocation of Map Reduce job with java action is somewhat
  similar to invocation of this job with Hadoop command line
  from the edge node. You specify a driver as a class for the
  java activity and Oozie invokes the driver. This approach
  has two main advantages:
   – The same driver class can be used for both – running Map
     Reduce job from an edge node and a java action in an Oozie
     process.
   – A driver provides a convenient place for executing additional
     code, for example clean-up required for Map Reduce execution.
• Driver requires a proper shutdown hook to ensure that
  there are no lingering Map Reduce jobs
Implementing predefined looping and
              forking
• hPDL is an XML document with the well-defined
  schema.
• This means that the actual workflow can be easily
  manipulated using JAXB objects, which can be
  generated from hPDL schema using xjc compiler.
• This means that we can create the complete
  workflow programmatically, based on calculated
  amount of fork branches or implementing loops
  as a repeated actions.
• The other option is creation of template process
  and modifying it based on calculated parameters.
Oozie client security (or lack of)
• By default Oozie client reads clients identity from the
  local machine OS and passes it to the Oozie server,
  which uses this identity for MR jobs invocation
• Impersonation can be implemented by overwriting
  OozieClient class’ method createConfiguration, where
  client variables can be set through new constructor.
         public Properties createConfiguration() {
             Properties conf = new Properties();
             if(user == null)
                conf.setProperty(USER_NAME, System.getProperty("user.name"));
             else
                conf.setProperty(USER_NAME, user);
             return conf;
          }
uber jars with Oozie
uber jar contains resources: other jars, so libraries, zip files


                                                           unpack resources
     Oozie                               launcher        to current uber jar dir
     server                             java action
                                                         set inverse classloader
                       uber jar
                   Classes (Launcher)                      invoke MR driver
                                                            pass arguments
                      jars so zip

<java>                                                    set shutdown hook
   …                                                      ‘wait for complete’
  <main-class>${wfUberLauncher}</main-class>
  <arg>-appStart=${wfAppMain}</arg>
   …                                                  mapper
</java>                                                   mapper

More Related Content

What's hot

Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
Jukka Zitting
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
Databricks
 

What's hot (20)

Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...
 
What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
 
Presto overview
Presto overviewPresto overview
Presto overview
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기[245] presto 내부구조 파헤치기
[245] presto 내부구조 파헤치기
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeSOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 

Viewers also liked

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Yahoo Developer Network
 

Viewers also liked (20)

Apache Oozie
Apache OozieApache Oozie
Apache Oozie
 
Oozie sweet
Oozie sweetOozie sweet
Oozie sweet
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtime
 
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case StudyOozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
 
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for HadoopMay 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
 
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Process Safety Life Cycle Management: Best Practices and Processes
Process Safety Life Cycle Management: Best Practices and ProcessesProcess Safety Life Cycle Management: Best Practices and Processes
Process Safety Life Cycle Management: Best Practices and Processes
 
Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot Games
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification Process
 
Oozie at Yahoo
Oozie at YahooOozie at Yahoo
Oozie at Yahoo
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
October 2014 HUG : Oozie HA
October 2014 HUG : Oozie HAOctober 2014 HUG : Oozie HA
October 2014 HUG : Oozie HA
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Dmp hadoop getting_start
Dmp hadoop getting_startDmp hadoop getting_start
Dmp hadoop getting_start
 

Similar to Everything you wanted to know, but were afraid to ask about Oozie

Similar to Everything you wanted to know, but were afraid to ask about Oozie (20)

Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)
 
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
Hadoop World 2011: Proven Tools to Manage Hadoop Environments - Joey Jablonsk...
 
Apache Oozie
Apache OozieApache Oozie
Apache Oozie
 
Apache Oozie.pptx
Apache Oozie.pptxApache Oozie.pptx
Apache Oozie.pptx
 
Hadoop Oozie
Hadoop OozieHadoop Oozie
Hadoop Oozie
 
oozieee.pdf
oozieee.pdfoozieee.pdf
oozieee.pdf
 
Hbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jarsHbase coprocessor with Oozie WF referencing 3rd Party jars
Hbase coprocessor with Oozie WF referencing 3rd Party jars
 
Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011
 
Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10Apache Oozie Workflow Scheduler - Module 10
Apache Oozie Workflow Scheduler - Module 10
 
Oozie Hug May 2011
Oozie Hug May 2011Oozie Hug May 2011
Oozie Hug May 2011
 
Introducing spring
Introducing springIntroducing spring
Introducing spring
 
WORKS 11 Presentation
WORKS 11 PresentationWORKS 11 Presentation
WORKS 11 Presentation
 
F03-Cloud-Obiwee
F03-Cloud-ObiweeF03-Cloud-Obiwee
F03-Cloud-Obiwee
 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Status update OEG - Nov 2012
Status update OEG - Nov 2012Status update OEG - Nov 2012
Status update OEG - Nov 2012
 
BPMS1
BPMS1BPMS1
BPMS1
 
BPMS1
BPMS1BPMS1
BPMS1
 

More from Chicago Hadoop Users Group

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Chicago Hadoop Users Group
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Chicago Hadoop Users Group
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
Chicago Hadoop Users Group
 

More from Chicago Hadoop Users Group (18)

Kinetica master chug_9.12
Kinetica master chug_9.12Kinetica master chug_9.12
Kinetica master chug_9.12
 
Chug dl presentation
Chug dl presentationChug dl presentation
Chug dl presentation
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Financial Data Analytics with Hadoop
Financial Data Analytics with HadoopFinancial Data Analytics with Hadoop
Financial Data Analytics with Hadoop
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
 
Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416Hadoop in a Windows Shop - CHUG - 20120416
Hadoop in a Windows Shop - CHUG - 20120416
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Everything you wanted to know, but were afraid to ask about Oozie

  • 1. Everything that you ever wanted to know about Oozie, but were afraid to ask B Lublinsky, A Yakubovich
  • 2. Apache Oozie • Oozie is a workflow/coordination system to manage Apache Hadoop jobs. • A single Oozie server implements all four functional Oozie components: – Oozie workflow – Oozie coordinator – Oozie bundle – Oozie SLA.
  • 3. Main components Oozie Server Bundle 3rd party application time condition monitoring Coordinator WS API workflow data condition monitoring action Oozie Command action action Line Interface action wf logic job submission and monitoring definitions, states Oozie shared libraries HDFS Bundle Coordinator Coordinator MapReduce Data Coordinator Coordinator Coordinator Workflow Coordinator Coordinator Hadoop
  • 5. Workflow Language Flow-control XML element type Description node Decision workflow:DECISION expressing “switch-case” logic Fork workflow:FORK splits one path of execution into multiple concurrent paths Join workflow:JOIN waits until every concurrent execution path of a previous fork node arrives to it Kill workflow:kill forces a workflow job to kill (abort) itself Action node XML element type Description java workflow:JAVA invokes the main() method from the specified java class fs workflow:FS manipulate files and directories in HDFS; supports commands: move, delete, mkdir MapReduce workflow:MAP-REDUCE starts a Hadoop map/reduce job; that could be java MR job, streaming job or pipe job Pig workflow:pig runs a Pig job Sub workflow workflow:SUB- runs a child workflow job WORKFLOW Hive * workflow:HIVE runs a Hive job Shell * workflow:SHELL runs a Shell command ssh * workflow:SSH starts a shell command on a remote machine as a remote secure shell Sqoop * workflow:SQOOP runs a Sqoop job Email * workflow:EMAIL sending emails from Oozie workflow application Distcp ? Under development (Yahoo)
  • 6. Workflow actions • Oozie workflow supports two types of actions:  Synchronous, executed inside Oozie runtime  Asynchronous, executed as a Map Reduce job. ActionStartCommand WorkflowStore Services ActionExecutorContext JavaActionExecutor JobClient 1 : workflow := getWorkflow() 2 : action := getAction() 3 : context := init<>() 4 : executor := get() 5 : start() 6 : submitLauncher() 7 : jobClient := get() 8 : runningJob := submit() 9 : setStartData()
  • 7. Workflow lifecycle PREP KILLED RUNNING FAILED SUSPENDED SUCCEDDED
  • 9. Extending Oozie workflow • Oozie provides a “minimal” workflow language, which contains only a handful of control and actions nodes. • Oozie supports a very elegant extensibility mechanism – custom action nodes. Custom action nodes allow to extend Oozie’ language with additional actions (verbs). • Creation of custom action requires implementation of following: – Java action implementation, which extends ActionExecutor class. – Implementation of the action’s XML schema defining action’s configuration parameters – Packaging of java implementation and configuration schema into action jar, which has to be added to Oozie war – extending oozie-site.xml to register information about custom executor with Oozie runtime.
  • 10. Oozie Workflow Client • Oozie provides an easy way for integration with enterprise applications through Oozie client APIs. It provides two types of APIs • REST HTTP API Number of HTTP requests • Info requests (job status, job configuration) • Job management (submit, start, suspend, resume, kill) Example: job definition info request GET /oozie/v0/job/job-ID?show=definition • Java API - package org.apache.oozie.client – OozieClient start(), submit(), run(), reRunXXX(), resume(), kill(), suspend() – WorkflowJob, WorkflowAction – CoordinatorJob, CoordinatorAction – SLAEvent
  • 11. Oozie workflow good, bad and ugly • Good – Nice integration with Hadoop ecosystem, allowing to easily build processes encompassing synchronized execution of multiple Map Reduce, Hive, Pig, etc jobs. – Nice UI for tracking execution progress – Simple APIs for integration with other applications – Simple extensibility APIs • Bad – Process has to be expressed directly in hPDL with no visual support – No support for Uber Jars (but we added our own) • Ugly – Static forking (but you can regenerate workflow and invoke on a fly) – No support for loops
  • 13. Coordinator language Element type Description Attributes and sub-elements coordinator- top-level element in coordinator instance frequency app start end controls specify the execution policy for coordinator and timeout (actions) it’s elements (workflow actions) concurrency (actions) execution order (workflow instances) action Required singular element specifying the Workflow name associated workflow. The jobs specified in workflow consume and produce dataset instances datasets Collection of data referred to by a logical name. Datasets serve to specify data dependences between workflow instances input event specifies the input conditions (in the form of present data sets) that are required in order to execute a coordinator action output event specifies the dataset that should be produced by coordinator action
  • 16. Bundle lifecycle PREP PREPSUSPENDED PREPPAUSED RUNNING KILLED SUSPENDED FAILED PAUSED SUCCEDDED
  • 18. SLA Navigation COORD_JOBS id app_name app_path … WF_JOBS SLA_EVENT event_id id alert_contact app_name alert-frieuency app_path … … sla_id ... COORD_ACTIONS id action_number action_xml WF_ACTIONS … external_id ... id conf console_url …
  • 19.
  • 20. Using Probes to analyze/monitor Places • Select probe data for specified time/location • Validate – Filter - Transform probe data • Calculate statistics on available probe data • Distribute data per geo-tiles • Calculate place statistics (e.g. attendance index) ------------------------------------------------------------- If exception condition happens, report failure If all steps succeeded, report success
  • 24. Oozie tips and tricks
  • 25. Configuring workflow • Oozie provides 3 overlapping mechanisms to configure workflow - config-default.xml, jobs properties file and job arguments that can be passed to Oozie as part of command line invocations. • The way Oozie processes these three sets of the parameters is as follows: – Use all of the parameters from command line invocation – For remaining unresolved parameters, job config is used – Use config-default.xml for everything else • Although documentation does not describe clearly when to use which, the overall recommendation is as follows: – Use config-default.xml for defining parameters that never change for a given workflow – Use jobs properties for the parameters that are common for a given deployment of a workflow – Use command line arguments for the parameters that are specific for a given workflow invocation.
  • 26. Accessing and storing process variables • Accessing – Through the arguments in java main • Storing String ooziePropFileName = System.getProperty("oozie.action.output.properties"); OutputStream os = new FileOutputStream(new File(ooziePropFileName)); Properties props = new Properties(); props.setProperty(key, value); props.store(os, ""); os.close();
  • 27. Validating data presence • Oozie provides two possible approaches for validating resource file(s) presence – using Oozie coordinator’s input events based on the data set - technically the simplest implementation approach, but it does not provide a more complex decision support that might be required. It just either runs a corresponding workflow or not. – custom java node inside Oozie workflow. - allows to extend decision logic by sending notifications about data absence, run execution on partial data under certain timing conditions, etc. • Additional configuration parameters for Oozie coordinator, for example, ability to wait for files arrival, etc. can expand usage of Oozie coordinator.
  • 28. Invoking map Reduce jobs • Oozie provides two different ways of invoking Map Reduce job – MapReduce action and java action. • Invocation of Map Reduce job with java action is somewhat similar to invocation of this job with Hadoop command line from the edge node. You specify a driver as a class for the java activity and Oozie invokes the driver. This approach has two main advantages: – The same driver class can be used for both – running Map Reduce job from an edge node and a java action in an Oozie process. – A driver provides a convenient place for executing additional code, for example clean-up required for Map Reduce execution. • Driver requires a proper shutdown hook to ensure that there are no lingering Map Reduce jobs
  • 29. Implementing predefined looping and forking • hPDL is an XML document with the well-defined schema. • This means that the actual workflow can be easily manipulated using JAXB objects, which can be generated from hPDL schema using xjc compiler. • This means that we can create the complete workflow programmatically, based on calculated amount of fork branches or implementing loops as a repeated actions. • The other option is creation of template process and modifying it based on calculated parameters.
  • 30. Oozie client security (or lack of) • By default Oozie client reads clients identity from the local machine OS and passes it to the Oozie server, which uses this identity for MR jobs invocation • Impersonation can be implemented by overwriting OozieClient class’ method createConfiguration, where client variables can be set through new constructor. public Properties createConfiguration() { Properties conf = new Properties(); if(user == null) conf.setProperty(USER_NAME, System.getProperty("user.name")); else conf.setProperty(USER_NAME, user); return conf; }
  • 31. uber jars with Oozie uber jar contains resources: other jars, so libraries, zip files unpack resources Oozie launcher to current uber jar dir server java action set inverse classloader uber jar Classes (Launcher) invoke MR driver pass arguments jars so zip <java> set shutdown hook … ‘wait for complete’ <main-class>${wfUberLauncher}</main-class> <arg>-appStart=${wfAppMain}</arg> … mapper </java> mapper