4. • Previous workflow architecture
• What Oozie is
• How we incorporated Oozie
– Relational Data Pipeline
– Non-relational Data Pipeline
• Lessons learned
• Where we’re headed
THIS PRESENTATION IS ABOUT…1
2
3
4
5
6
7
INTRO
5. • Developer and publisher of League of Legends
• Founded 2006 by gamers for gamers
• Player experience focused
– Needless to say, data is pretty important to
understanding the player experience!
WHO is RIOT GAMES?1
2
3
4
5
6
7
INTRO
10. 1
2
3
4
5
6
7
Architecture
WHY WORKFLOWS?
• Retry a series of jobs in the event of failure
• Execute jobs at a specific time or when data is
available
• Correctly order job execution based on
resolved dependencies
• Provide a common framework for
communication and execution of production
process
• Use the the workflow to couple resources
instead of having a monolithic code base
12. 1
2
3
4
5
6
7
Architecture
ISSUES WITH PREVIOUS PROCESS
• All of the ETL processes were run on one node
which limited concurrency
• If our main runner execution died then the
whole ETL for that day would need to be
restarted
• No reporting of what was run or the
configuration of the ETL without log diving on
the actual machine
• No retries (outside of native MR tasks) and no
good way to rerun a previous config if the
underlying code has been changed
16. Oozie
1
2
3
4
5
6
7
WHAT IS OOZIE?
• Oozie is a workflow scheduler system to
manage Apache Hadoop jobs
• Oozie is integrated with the rest of the Hadoop
stack supporting several types of Hadoop jobs
out of the box as well as system specific jobs
• Oozie is a scalable, reliable and extensible
system
17. Oozie
1
2
3
4
5
6
7
WHY OOZIE?
No need to create custom hooks for job submission
NATIVE HADOOP INTEGRATION
Jobs are spread against available mappers
HORIZONTALLY SCALABLE
The project has strong community backing and has
committers from several companies
OPEN SOURCE
Logging and debugging is extremely quick with the web
console and SQL
VERBOSE REPORTING
30. Oozie
1
2
3
4
5
6
7
WORKFLOW ACTION: MAPREDUCE
<action name="myfirstHadoopJob">
<map-reduce>
<job-tracker>foo:9001</job-tracker>
<name-node>bar:9000</name-node>
<prepare>
<delete path="hdfs://foo:9000/usr/foo/output-data"/>
</prepare>
<job-xml>/myfirstjob.xml</job-xml>
<configuration>
<property>
<name>mapred.input.dir</name>
<value>/usr/foo/input-data</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/usr/foo/input-data</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>${firstJobReducers}</value>
</property>
</configuration>
</map-reduce>
<ok to="myNextAction"/>
<error to="errorCleanup"/>
</action>
• Each action has a type
and each type has
defined set of key:values
that can be used to
configure it
31. Oozie
1
2
3
4
5
6
7
WORKFLOW ACTION: MAPREDUCE
<action name="myfirstHadoopJob">
<map-reduce>
<job-tracker>foo:9001</job-tracker>
<name-node>bar:9000</name-node>
<prepare>
<delete path="hdfs://foo:9000/usr/foo/output-data"/>
</prepare>
<job-xml>/myfirstjob.xml</job-xml>
<configuration>
<property>
<name>mapred.input.dir</name>
<value>/usr/foo/input-data</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>/usr/foo/input-data</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>${firstJobReducers}</value>
</property>
</configuration>
</map-reduce>
<ok to="myNextAction"/>
<error to="errorCleanup"/>
</action>
• Each action has a type
and each type has
defined set of key:values
that can be used to
configure it
The action must also
specify which actions to
direct to based on
success or failure
33. 1
2
3
4
5
6
7
Oozie
THE WORKFLOW ENGINE
Start
End
fork joinMapReduce
Java
Sqoop
Hive
HDFS
Shell
decision
• Oozie runs workflows in
the form of DAGs (directed
acyclical graphs)
• Each element in this
workflow is an action
• Some node types are
processed internally to
Oozie vs farmed to the
cluster
34. 1
2
3
4
5
6
7
Oozie
WORKFLOW EXAMPLE
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
<start to=‘java-node’/>
<action name=”java-node">
...
</action>
<end name=‘end’/>
<kill name=‘fail’/>
</workflow-app>
• This workflow will run
the action defined as
java-node
35. 1
2
3
4
5
6
7
Oozie
WORKFLOW EXAMPLE
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
<start to=‘java-node’/>
<action name=”java-node">
...
</action>
<end name=‘end’/>
<kill name=‘fail’/>
</workflow-app> start java-node
• This workflow will run
the action defined as
java-node
1
1
36. 1
2
3
4
5
6
7
Oozie
WORKFLOW EXAMPLE
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
<start to=‘java-node’/>
<action name=”java-node">
...
</action>
<end name=‘end’/>
<kill name=‘fail’/>
</workflow-app> start endjava-node
• This workflow will run
the action defined as
java-node
1
2
1 2
37. 1
2
3
4
5
6
7
Oozie
WORKFLOW EXAMPLE
<workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1">
<start to=‘java-node’/>
<action name=”java-node">
...
</action>
<end name=‘end’/>
<kill name=‘fail’/>
</workflow-app> start endjava-node
fail
Error
• This workflow will run
the action defined as
java-node
1
2
3
1 2
3
39. 1
2
3
4
5
6
7
Oozie
COORDINATOR
• Oozie coordinators can execute workflows based on time and data
dependencies
• Each coordinator is specified a workflow to execute upon meeting its
trigger criteria
• Coordinators can pass variables to the workflow layer allowing for
dynamic resolution
Client Oozie Coordinator
Oozie Workflow
Oozie Server
Hadoop
40. 1
2
3
4
5
6
7
Oozie
EXAMPLE COORDINATOR
<?xml version="1.0" ?><coordinator-app end="${COORD_END}"
frequency="${coord:hours(1)}" name="test_job_coord" start="$
{COORD_START}" timezone="UTC" xmlns="uri:oozie:coordinator:
0.1">
<action>
<workflow>
<app-path>hdfs://bar:9000/user/hadoop/oozie/app/test_job</
app-path>
</workflow>
</action>
</coordinator-app> • This coordinator
will run every
hour and invoke
the workflow
found in the /
test_job folder
41. 1
2
3
4
5
6
7
Oozie
EXAMPLE COORDINATOR
<?xml version="1.0" ?><coordinator-app end="${COORD_END}"
frequency="${coord:hours(1)}" name="test_job_coord" start="$
{COORD_START}" timezone="UTC" xmlns="uri:oozie:coordinator:
0.1">
<action>
<workflow>
<app-path>hdfs://bar:9000/user/hadoop/oozie/app/test_job</
app-path>
</workflow>
</action>
</coordinator-app> • This coordinator
will run every
hour and invoke
the workflow
found in the /
test_job folder
42. 1
2
3
4
5
6
7
Oozie
EXAMPLE COORDINATOR
<?xml version="1.0" ?><coordinator-app end="${COORD_END}"
frequency="${coord:hours(1)}" name="test_job_coord" start="$
{COORD_START}" timezone="UTC" xmlns="uri:oozie:coordinator:
0.1">
<action>
<workflow>
<app-path>hdfs://bar:9000/user/hadoop/oozie/app/test_job</
app-path>
</workflow>
</action>
</coordinator-app> • This coordinator
will run every
hour and invoke
the workflow
found in the /
test_job folder
44. Oozie
1
2
3
4
5
6
7
BUNDLE
Client
Oozie Coordinator
Oozie Workflow
Oozie Server
Hadoop
Oozie Coordinator
Oozie Workflow
Oozie Bundle
• Bundles are higher level abstractions that will batch a set of
coordinators together.
• There is no explicit dependency between coordinators within a
bundle but it can be used to more formally define a data pipeline
45. 1
2
3
4
5
6
7
Oozie
THE INTERFACE
Multiple ways to interact with Oozie:
• Web Console (read only)
• CLI
• Java client
• Web Service Endpoints
• Directly with the DB using SQL
The Java client / CLI are just an abstraction for
the web service endpoints and it is easy to
extend this functionality in your own apps.
46. 1
2
3
4
5
6
7
Oozie
PIECES OF A DEPLOYABLE
The list of components that are needed for a scheduled
workflow:
• Coordinator.xml
Contains the scheduler definition and path to
workflow.xml
• Workflow.xml
Contains the job definition
• Libraries
Optional jar files
• Properties file (also possible through WS call)
Initial parameterization and mandatory
specification of coordinator path
48. 1
2
3
4
5
6
7
Oozie
COORDINATOR SUBMISSION
• Deploy the workflow and coordinator to HDFS
$ hadoop fs –put test_job oozie/app/
• Submit and run the workflow job
$ oozie job -run -config job.properties
• Check the coordinator status on the web console
57. A USE CASE: HOURLY JOBS1
2
3
4
5
6
7
Oozie
Replace a current CRON job that runs a bash script once a
day (6):
• The shell will execute a Java main which pulls data from a
filestream (1), dumps it to HDFS and then runs a
MapReduce job on the files (2). It will then email a person
when the report is done (3).
• It should start within X amount of time (4)
• It should complete withinY amount of time (5)
• It should retry Z times on failure (automatic)
67. Use Case 1
WORKFLOWS: RELATIONAL1
2
3
4
5
6
7
Tableau
Hive Data
Warehouse
Oozie MySQLPentaho
Analysts
EUROPE
Audit Plat
LoL
KOREA
Audit Plat
LoL
NORTH AMERICA
Audit Plat
LoL
Business
Analyst
68. Use Case 1
WORKFLOWS: RELATIONAL1
2
3
4
5
6
7
Tableau
Hive Data
Warehouse
Oozie MySQLPentaho
Analysts
EUROPE
Audit Plat
LoL
KOREA
Audit Plat
LoL
NORTH AMERICA
Audit Plat
LoL
Business
Analyst
69. Use Case 1
WORKFLOWS: RELATIONAL1
2
3
4
5
6
7
Hive
Final Tables provide
more descriptive
column naming and
native type conversions
REGION X
Audit Plat
LoL
Hive Staging Transform
Temp Tables map 1:1
with DB table meta
Extract
Oozie Actions
70. Use Case 1
WORKFLOWS: RELATIONAL1
2
3
4
5
6
7
Hive
Final Tables provide
more descriptive
column naming and
native type conversions
REGION X
Audit Plat
LoL
Hive Staging Transform
Temp Tables map 1:1
with DB table meta
Extract
Oozie Actions
1. [Java] Check the partitions for the table and pull
the latest date found.Write the key:value pair for
latest date back out to a properties file so that it
can be referenced by the rest of the workflow.
71. Use Case 1
WORKFLOWS: RELATIONAL1
2
3
4
5
6
7
Hive
Final Tables provide
more descriptive
column naming and
native type conversions
REGION X
Audit Plat
LoL
Hive Staging Transform
Temp Tables map 1:1
with DB table meta
Extract
Oozie Actions
2. [Sqoop] If the table is flagged as dynamically
partitioned, pull data from the table from the latest
partition (referencing the output of the Java node)
through todays date. If not, pull data just for the
current date.
72. Use Case 1
WORKFLOWS: RELATIONAL1
2
3
4
5
6
7
Hive
Final Tables provide
more descriptive
column naming and
native type conversions
REGION X
Audit Plat
LoL
Hive Staging Transform
Temp Tables map 1:1
with DB table meta
Extract
Oozie Actions
3. [Hive] Copy the table from the updated partitions
from the staging DB to the prod DB while also
performing column name and type conversions.
73. Use Case 1
WORKFLOWS: RELATIONAL1
2
3
4
5
6
7
Hive
Final Tables provide
more descriptive
column naming and
native type conversions
REGION X
Audit Plat
LoL
Hive Staging Transform
Temp Tables map 1:1
with DB table meta
Extract
Oozie Actions
4. [Java] Grab row counts for both source and Hive
across the dates pulled.Write this as well as some
other meta out to a audit DB for reporting.
Validation
74. Use Case 1
AUDITING1
2
3
4
5
6
7
• We have a Tableau report pointing at the output
audit data for a rapid high level view of the health
of our ETLs
75. Use Case 1
SINGLE TABLE ACTION FLOW1
2
3
4
5
6
7
Initialize-
node
Sqoop-
node
Oozie-
node
Extraction
actions
76. Use Case 1
SINGLE TABLE ACTION FLOW1
2
3
4
5
6
7
End
Initialize-
node
Hive-node Audit-node
Sqoop-
node
Oozie-
node
Start
• This action flow is done once per table
Extraction
actions
Transform
workflow
77. Use Case 1
SINGLE TABLE ACTION FLOW1
2
3
4
5
6
7
End
Initialize-
node
Hive-node Audit-node
Sqoop-
node
Oozie-
node
Start
• This action flow is done once per table
Extraction
actions
Transform
workflow
• The Oozie action allows us to asynchronously run the
Hive staging->prod action and the auditing action. It is
a Java action which uses the Oozie java client and
submits key:value pairs to another workflow.
80. Table 1
Extraction
actions
Use Case 1
FULL SCHEMA WORKFLOW1
2
3
4
5
6
7
End
Start
Table 1
Transform
workflow
Table 2
Extraction
actions
Table 2
Transform
workflow
81. Table 1
Extraction
actions
Use Case 1
FULL SCHEMA WORKFLOW1
2
3
4
5
6
7
End
Start
Table 1
Transform
workflow
Table 2
Extraction
actions
Table 2
Transform
workflow
Table N
Extraction
actions
Table N
Transform
workflow
• We have one of
these workflows
per schema
• Different schemas
have a different
number of tables
(e.g. range from
5-20 tables)
• We could fork and
do each of these
table extractions in
parallel but we are
trying to limit the
I/O load we create
on the sources
82. Use Case 1
COORDINATORS1
2
3
4
5
6
7
Schema 1
Workflow
Schema 1 Coordinator
• We have one coordinator per schema workflow
• Currently coordinators are staged in groups
based on schema type.
Schema 2
Workflow
Schema 2 Coordinator
Schema N
Workflow
Schema N Coordinator
83. • 20+ Regions
• 5+ DBs per region
• 5-20 Tables per DB
20 * 5 * 12(avg) = ~1200 tables!
Use Case 1
IMPORTANT NUMBERS1
2
3
4
5
6
7
84. • 20+ Regions
• 5+ DBs per region
• 5-20 Tables per DB
20 * 5 * 12(avg) = ~1200 tables!
Use Case 1
IMPORTANT NUMBERS1
2
3
4
5
6
7
85. • Not if you have a good
deployment pipeline!
Use Case 1
TOO UNWIELDY?1
2
3
4
5
6
7
87. Use Case 1
DEPLOYMENT STACK: JAVA1
2
3
4
5
6
7
• The java project compiles into the library that is used by the workflows
• It also contains some custom functionality for interacting with the Oozie
WS endpoints / Oozie DB Tables
88. Use Case 1
DEPLOYMENT STACK: PYTHON1
2
3
4
5
6
7
• The python project dynamically generates all of our workflow/coordinator
xml files. It has a multipleYML configs which hold the meta associated with
tall of the tables. It also interacts with a DB table for the various DB
connection meta.
89. Use Case 1
DEPLOYMENT STACK: GITHUB1
2
3
4
5
6
7
• GitHub houses all of the Big Data group’s code bases no matter the
language.
90. Use Case 1
DEPLOYMENT STACK: JENKINS1
2
3
4
5
6
7
• Jenkins polls GitHub and builds either set of artifacts (Java lib / tar
containing workflows/coordinators) whenever it detects changes. It
deploys the build artifacts to a simple mount point.
91. Use Case 1
DEPLOYMENT STACK: CHEF1
2
3
4
5
6
7
• The Chef cookbook will check for the version declared for both sets of
artifacts and grab them from the mount point. It runs a shell which deploys
the deflated workflows/coordinators and mounts the jar lib file.
92. • 20+ Regions
• 5+ DBs per region
• 5-20 Tables per DB
20 * 5 * 12(avg) = ~1200 tables!
Use Case 1
IMPORTANT NUMBERS1
2
3
4
5
6
7
93. • 20+ Regions
• 5+ DBs per region
• 5-20 Tables per DB
20 * 5 * 12(avg) = ~1200 tables!
Use Case 1
IMPORTANT NUMBERS1
2
3
4
5
6
7
94. • 20+ Regions
• 5+ DBs per region
• 5-20 Tables per DB
20 * 5 * 12(avg) = ~1200 tables!
Use Case 1
IMPORTANT NUMBERS1
2
3
4
5
6
7
95. • 20+ Regions
• 5+ DBs per region
• 5-20 Tables per DB
20 * 5 * 12(avg) = ~1200 tables per day!
1 person < 5 hours a week!
Use Case 1
IMPORTANT NUMBERS1
2
3
4
5
6
7
96. Use Case 2
USE CASE 2 – Dashboarding Cloud Data1
2
3
4
5
6
7
97. Use Case 2
WORKFLOWS: NON-RELATIONAL1
2
3
4
5
6
7
DashboardHive Data
Warehouse
Honu
Analysts
Business
Analyst
Client
Mobile
WWW
Self Service
App (Workflow
and Meta)
98. Use Case 2
WORKFLOWS: NON-RELATIONAL1
2
3
4
5
6
7
DashboardHive Data
Warehouse
Honu
Analysts
Business
Analyst
Client
Mobile
WWW
Self Service
App (Workflow
and Meta)
99. WORKFLOWS: NON-RELATIONAL1
2
3
4
5
6
7
External
Queue
Amazon SQS is a
message queue we use
for asynchronous
communication
HONU SOURCE TABLES
Audit Plat
LoL
Honu
Derived
Message
Derived Tables are
filtered datasets joined
from 1 or more sources
Transform
Oozie ActionsUse Case 2
100. WORKFLOWS: NON-RELATIONAL1
2
3
4
5
6
7
External
Queue
Amazon SQS is a
message queue we use
for asynchronous
communication
HONU SOURCE TABLES
Audit Plat
LoL
Honu
Derived
Message
Derived Tables are
filtered datasets joined
from 1 or more sources
Transform
Oozie Actions
1. [Java] Check that the required partitions for the
derived query exist and contain data. Send a
message to an SNS endpoint if a partition exists but
contains no rows.
Use Case 2
101. WORKFLOWS: NON-RELATIONAL1
2
3
4
5
6
7
External
Queue
Amazon SQS is a
message queue we use
for asynchronous
communication
HONU SOURCE TABLES
Audit Plat
LoL
Honu
Derived
Message
Derived Tables are
filtered datasets joined
from 1 or more sources
Transform
Oozie Actions
2. [Hive] Perform the table transformation query on
the selected partition(s).This query can filter any
subset of source columns and join any number of
source tables.
Use Case 2
102. WORKFLOWS: NON-RELATIONAL1
2
3
4
5
6
7
External
Queue
Amazon SQS is a
message queue we use
for asynchronous
communication
HONU SOURCE TABLES
Audit Plat
LoL
Honu
Derived
Message
Derived Tables are
filtered datasets joined
from 1 or more sources
Transform
Oozie Actions
3. [Java] Send an SQS message to an external queue
based on the consumer type. Consumers will pull
from these queues regularly and update the various
dashboards artifacts.
Use Case 2
105. LESSONS
LESSON #1
Distros andVersioning
• If you choose to go with a distro for
your Hadoop stack, be extremely
vigilant about upgrading to the latest
versions whenever possible.You will
receive a lot more community support
and a lot less headaches if you are not
running into bugs that were patched in
trunk over a year ago!
1
2
3
4
5
6
7
106. LESSONS
LESSON #2
Solidify Deployment
• The usefulness of Oozie can degrade as
complexity creeps into your pipeline. If
you do not work towards an
automated deployment pipeline at the
early stages of your development, you
will quickly find maintenance costs
rising significantly over time.
1
2
3
4
5
6
7
107. LESSONS
LESSON #3
Extend Capabilities
• Don’t feel limited to using tools based
on the supplied APIs. Feel free to
implement harnesses that extend
capabilities and submit them back to
the community – we will welcome it
with open arms J
1
2
3
4
5
6
7
108. LESSONS
LESSON #4
Ask for Help!
• Oozie is an open source project and is
getting new members/organizations
everyday. Don’t spend multiple hours
trying to solve an issue that many of us
have already worked through.
• There is also a large amount of
documentation both in the wikis AND
archived listserv responses – leverage
them both!
1
2
3
4
5
6
7
112. CHALLENGE: MAKE IT GLOBAL
• Data centers across the globe since latency has huge effect on
gameplay à log data scattered around the world
• Large presence in Asia -- some areas (e.g., PH) have bandwidth
challenges or bandwidth is expensive
1
2
3
4
5
6
7 THE
FUTURE
113. CHALLENGE: WE HAVE BIG DATA
+ chat logs
+ detailed gameplay event tracking
+ so on….
1
2
3
4
5
6
7
500G DAILY
STRUCTURED DATA
> 7PB
GAME EVENT DATA
3MM SUBSCRIBERS
448+ MMVIEWS
RIOTYOUTUBE CHANNEL
THE
FUTURE
114. OUR AUDACIOUS GOALS
Have deep, real-time understanding of our systems
from player experience and operational standpoints
1
2
3
4
5
6
7
Have ability to identify, understand and react to
meaningful trends in real time
Build a world-class data and analytics organization
• Deeply understand players across the globe
• Apply that understanding to improve games for players
• Deeply understand our entire ecosystem, including social media
THE
FUTURE
115. SHAMELESS HIRING PLUG1
2
3
4
5
6
7 THE
FUTURE
Like most everybody else at this conference… we’re hiring!
PLAYER EXPERIENCE FIRST
CHALLENGE CONVENTION
FOCUS ON TALENT AND TEAM
TAKE PLAY SERIOUSLY
STAY HUNGRY, STAY HUMBLE
THE RIOT
MANIFESTO