Building a Self-Service Hadoop Platform at Linkedin with Azkaban

Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Building a Self-Service Hadoop Platform at LinkedIn
with Azkaban

Hadoop at LinkedIn
3

Hadoop at LinkedIn
4
Profile PageHome Page

Hadoop at LinkedIn
5

Evolution of Workflows
6
20092010201120122013

Azkaban 1.0
7

Azkaban 1.0
 Run workflows
 Schedule jobs
 Job History
 Failure notification
 Easy to use web UI and
visualizations
8

Azkaban 2.0
9

Azkaban 2.0
 Major re-architecting
 Separate executor and web servers
 User authentication
 Pluggable database drivers
– H2
– MySQL
 Brand new UI
10

Azkaban 2.0
 Jobtype plugins
– Built-in type: command
– Pluggable jobtypes:
 Java
 Pig
 Hive
– Non-Hadoop jobtypes:
 Teradata
 Voldemort
 Viewer plugins – extending the
Azkaban UI for other tools
– HDFS browser
– Reportal
 LinkedIn-specific code as plugins
11

Azkaban 2.5
12

Azkaban 2.5
 UI overhauled using Bootstrap
 Embedded flows
 New self-service tools
– Job Summary
– Flow Summary
– Pig Visualizer
 Jobtype-specific plugins
 HDFS viewer improvements
– Display file schema in addition to
content
– Parquet file viewer
 And more
13

Who’s using Azkaban?
 Software Engineers
 Data Scientists
 Analysts
 Product Managers
14

Azkaban Today
 Workflow manager and scheduler
 Integrated runtime environment
 Unified front-end for Hadoop tools
15

Good News! Success!
 1000+ users
 Several clusters
 2,500 flows executing per day
 30,000 jobs executing per day
16

Bad News! Success
 1000+ users
17

Creating and Running Workflows
18

Creating Workflows
 Add job “type” plugins
– hadoopJava
– Command
– Pig
– Hive
 Dependencies
– Determine the dependency graph
 Parameter passing
– Parameters can be passed to job
19
type=pig
creamy.level=4
chunky.level=4
...
type=hadoopJava
jelly.type=grape
sugar=HFCS
...
type=command
bread.type=wheat
dependencies=peanutbutter,jelly
...
peanutbutter.job
bread.job
jelly.job

Embedded Flows
 Embed a flow as a node in another
flow.
– “flow” job type
– Set flow.name to name of the
embedded flow
– Parameters can be passed to flow
20
peanutbutter jelly
bread
type=flow
flow.name=bread
dependencies=coffee,fruit
type=hive
coffee.decaf=false
coffee.cream=true
...
type=hadoopJava
fruit.type=apple
...
coffee.job fruit.job
sandwich.job

Project Management
Project Page
21

Running Workflows
Flow Execution Panel
22

Running Workflows
Notification Options
23

Running Workflows
Failure Options
24

 Finish Current
– Finishes current running flows, then stops
 Cancel All
– Kills all running jobs and finishes immediately
 Finish Possible
– Finish all possible jobs if their dependencies have met. Then it fails.
Running Workflows
Failure Options
25

Running Workflows
Flow Parameters
26

Running Workflows
Concurrent Execution Options
27

 Skip Executions
– Prevent concurrent executions
 Run Concurrently
– Concurrently run the flow
 Pipeline
– Distance 1: jobA waits until concurrent jobA finishes
– Distance 2: jobA waits until concurrent jobA’s children finishes
Running Workflows
Concurrent Execution Options
28

Running Workflows
Executing Flow Page
29

Running Workflows
Flow Job List
30

Scheduling Workflows
31

Schedule Flow Panel
32

Scheduled Flows
33

Scheduling Flows
Setting SLAs
34

Debugging and Tuning
35

Hadoop at LinkedIn
 1000+ users
36

Job Execution History
37

Flow Execution History
38

Running Workflows
Job Logs
39

Job Summary
40

Pig Visualizer
41

Pig Visualizer
42

Pig Visualizer
43

Pig Visualizer
44

Pig Visualizer
45

Flow Summary
46

Flow Summary
47

Browsing HDFS
48

HDFS Viewer
Browsing Files
49

HDFS Viewer
Viewing Files
50

HDFS Viewer
File Schema
51

 Avro
 Parquet
 Binary JSON
 Sequence File
 Image
 Text
HDFS Viewer
Supported File Types
52

Reportal
53

Reportal
Dashboard
54

Reportal
New Report
55

Reportal
Viewing Results
56

 Pig
 Hive
 Teradata
Reportal
Supported Query Types
57

Upcoming Features
58

Azkaban Gradle Plugin and DSL
 Describe Azkaban flow and deploy
with Gradle
 Single file (more if you want) to
describe all your workflows
– Compiles to .job files
 Static checker
 Valid Groovy code
– Add conditionals for deployment to
different clusters
59
azkaban {
jobConfDir = ‘./jobs’
workflow(‘workflow2’) {
pigJob(‘job2’) {
script = ‘src/main/pig/count-by-country.job’
parameter ‘inputFile’, ‘/user/foo/sample’
reads ‘/data/databases/foo’, [as: ‘input’]
writes ‘/data/databases/bar’, [as: ‘output’]
}
hiveJob(‘job3’) {
query = ‘show tables’
}
workflowDepends ‘job2’, ‘job3’
}
}

Future Roadmap
 New visualizers (Hive, Tez, etc.)
 Support DSL from other tools
 Operationalization tooling
 Scalability improvements
 Improved plugin interfaces
60

Future Discussions
 Conditional branching
 Hive Metastore browser
 Pluggable executors (e.g. YARN)
 Persistence storage server
 Launching and monitoring long-running YARN applications (Samza, Storm,
etc.)
61

Main Contributors
 David Chen (LinkedIn)
 Hien Luu (LinkedIn)
 Anthony Hsu (LinkedIn)
 Alex Bain (LinkedIn)
 Richard Park (RelateIQ)
 Chenjie Yu (Tango)
 Shida Li (University of Waterloo)
62

How to Contribute
Website: azkaban.github.io
GitHub: github.com/azkaban
LinkedIn’s Data Website: data.linkedin.com
63

Building a Self-Service Hadoop Platform at Linkedin with Azkaban

Building a Self-Service Hadoop Platform at Linkedin with Azkaban

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Building a Self-Service Hadoop Platform at Linkedin with Azkaban

Similaire à Building a Self-Service Hadoop Platform at Linkedin with Azkaban (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Building a Self-Service Hadoop Platform at Linkedin with Azkaban

Notes de l'éditeur