SlideShare une entreprise Scribd logo
1  sur  64
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Building a Self-Service Hadoop Platform at LinkedIn
with Azkaban
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Hadoop at LinkedIn
3
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Hadoop at LinkedIn
4
Profile PageHome Page
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Hadoop at LinkedIn
5
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Evolution of Workflows
6
20092010201120122013
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 1.0
7
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 1.0
 Run workflows
 Schedule jobs
 Job History
 Failure notification
 Easy to use web UI and
visualizations
8
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 2.0
9
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 2.0
 Major re-architecting
 Separate executor and web servers
 User authentication
 Pluggable database drivers
– H2
– MySQL
 Brand new UI
10
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 2.0
 Jobtype plugins
– Built-in type: command
– Pluggable jobtypes:
 Java
 Pig
 Hive
– Non-Hadoop jobtypes:
 Teradata
 Voldemort
 Viewer plugins – extending the
Azkaban UI for other tools
– HDFS browser
– Reportal
 LinkedIn-specific code as plugins
11
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 2.5
12
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 2.5
 UI overhauled using Bootstrap
 Embedded flows
 New self-service tools
– Job Summary
– Flow Summary
– Pig Visualizer
 Jobtype-specific plugins
 HDFS viewer improvements
– Display file schema in addition to
content
– Parquet file viewer
 And more
13
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Who’s using Azkaban?
 Software Engineers
 Data Scientists
 Analysts
 Product Managers
14
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban Today
 Workflow manager and scheduler
 Integrated runtime environment
 Unified front-end for Hadoop tools
15
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Good News! Success!
 1000+ users
 Several clusters
 2,500 flows executing per day
 30,000 jobs executing per day
16
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Bad News! Success
 1000+ users
 Several clusters
 2,500 flows executing per day
 30,000 jobs executing per day
17
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Creating and Running Workflows
18
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Creating Workflows
 Add job “type” plugins
– hadoopJava
– Command
– Pig
– Hive
 Dependencies
– Determine the dependency graph
 Parameter passing
– Parameters can be passed to job
19
type=pig
creamy.level=4
chunky.level=4
...
type=hadoopJava
jelly.type=grape
sugar=HFCS
...
type=command
bread.type=wheat
dependencies=peanutbutter,jelly
...
peanutbutter.job
bread.job
jelly.job
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Embedded Flows
 Embed a flow as a node in another
flow.
– “flow” job type
– Set flow.name to name of the
embedded flow
– Parameters can be passed to flow
20
peanutbutter jelly
bread
type=flow
flow.name=bread
dependencies=coffee,fruit
type=hive
coffee.decaf=false
coffee.cream=true
...
type=hadoopJava
fruit.type=apple
...
coffee.job fruit.job
sandwich.job
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Project Management
Project Page
21
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Flow Execution Panel
22
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Notification Options
23
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Failure Options
24
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
 Finish Current
– Finishes current running flows, then stops
 Cancel All
– Kills all running jobs and finishes immediately
 Finish Possible
– Finish all possible jobs if their dependencies have met. Then it fails.
Running Workflows
Failure Options
25
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Flow Parameters
26
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Concurrent Execution Options
27
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
 Skip Executions
– Prevent concurrent executions
 Run Concurrently
– Concurrently run the flow
 Pipeline
– Distance 1: jobA waits until concurrent jobA finishes
– Distance 2: jobA waits until concurrent jobA’s children finishes
Running Workflows
Concurrent Execution Options
28
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Executing Flow Page
29
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Flow Job List
30
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Scheduling Workflows
31
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Scheduling Workflows
Schedule Flow Panel
32
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Scheduling Workflows
Scheduled Flows
33
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Scheduling Flows
Setting SLAs
34
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Debugging and Tuning
35
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Hadoop at LinkedIn
 1000+ users
 Several clusters
 2,500 flows executing per day
 30,000 jobs executing per day
36
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Job Execution History
37
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Flow Execution History
38
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Job Logs
39
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Job Summary
40
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Pig Visualizer
41
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Pig Visualizer
42
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Pig Visualizer
43
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Pig Visualizer
44
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Pig Visualizer
45
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Flow Summary
46
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Flow Summary
47
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Browsing HDFS
48
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
HDFS Viewer
Browsing Files
49
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
HDFS Viewer
Viewing Files
50
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
HDFS Viewer
File Schema
51
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
 Avro
 Parquet
 Binary JSON
 Sequence File
 Image
 Text
HDFS Viewer
Supported File Types
52
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Reportal
53
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Reportal
Dashboard
54
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Reportal
New Report
55
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Reportal
Viewing Results
56
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
 Pig
 Hive
 Teradata
Reportal
Supported Query Types
57
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Upcoming Features
58
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban Gradle Plugin and DSL
 Describe Azkaban flow and deploy
with Gradle
 Single file (more if you want) to
describe all your workflows
– Compiles to .job files
 Static checker
 Valid Groovy code
– Add conditionals for deployment to
different clusters
59
azkaban {
jobConfDir = ‘./jobs’
workflow(‘workflow2’) {
pigJob(‘job2’) {
script = ‘src/main/pig/count-by-country.job’
parameter ‘inputFile’, ‘/user/foo/sample’
reads ‘/data/databases/foo’, [as: ‘input’]
writes ‘/data/databases/bar’, [as: ‘output’]
}
hiveJob(‘job3’) {
query = ‘show tables’
}
workflowDepends ‘job2’, ‘job3’
}
}
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Future Roadmap
 New visualizers (Hive, Tez, etc.)
 Support DSL from other tools
 Operationalization tooling
 Scalability improvements
 Improved plugin interfaces
60
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Future Discussions
 Conditional branching
 Hive Metastore browser
 Pluggable executors (e.g. YARN)
 Persistence storage server
 Launching and monitoring long-running YARN applications (Samza, Storm,
etc.)
61
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Main Contributors
 David Chen (LinkedIn)
 Hien Luu (LinkedIn)
 Anthony Hsu (LinkedIn)
 Alex Bain (LinkedIn)
 Richard Park (RelateIQ)
 Chenjie Yu (Tango)
 Shida Li (University of Waterloo)
62
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
How to Contribute
Website: azkaban.github.io
GitHub: github.com/azkaban
LinkedIn’s Data Website: data.linkedin.com
63
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Contenu connexe

Tendances

WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integrationibi
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)Jun Rao
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidDataWorks Summit
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Spark Summit
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
 
Hotel inspection data set analysis copy
Hotel inspection data set analysis   copyHotel inspection data set analysis   copy
Hotel inspection data set analysis copySharon Moses
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsDataWorks Summit
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkLi Jin
 
Druid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiDruid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiBrian Olsen
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!DataWorks Summit/Hadoop Summit
 

Tendances (20)

Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integration
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Hotel inspection data set analysis copy
Hotel inspection data set analysis   copyHotel inspection data set analysis   copy
Hotel inspection data set analysis copy
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Druid Overview by Rachel Pedreschi
Druid Overview by Rachel PedreschiDruid Overview by Rachel Pedreschi
Druid Overview by Rachel Pedreschi
 
The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!The key to unlocking the Value in the IoT? Managing the Data!
The key to unlocking the Value in the IoT? Managing the Data!
 

Similaire à Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceWilfried Hoge
 
Building a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedInBuilding a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedInJens Pillgram-Larsen
 
Hive at LinkedIn
Hive at LinkedIn Hive at LinkedIn
Hive at LinkedIn mislam77
 
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop Shirshanka Das
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesDATAVERSITY
 
Innovate with the data you have with UiPath and Snowflake.pdf
Innovate with the data you have with UiPath and Snowflake.pdfInnovate with the data you have with UiPath and Snowflake.pdf
Innovate with the data you have with UiPath and Snowflake.pdfCristina Vidu
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014Wilfried Hoge
 
apidays LIVE Paris - Data with a mission: a COVID-19 API case study by Matt M...
apidays LIVE Paris - Data with a mission: a COVID-19 API case study by Matt M...apidays LIVE Paris - Data with a mission: a COVID-19 API case study by Matt M...
apidays LIVE Paris - Data with a mission: a COVID-19 API case study by Matt M...apidays
 
apidays LIVE Australia - Data with a Mission by Matt McLarty
apidays LIVE Australia -  Data with a Mission by Matt McLarty apidays LIVE Australia -  Data with a Mission by Matt McLarty
apidays LIVE Australia - Data with a Mission by Matt McLarty apidays
 
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...Khalid Salama
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduCloudera, Inc.
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsightsWilfried Hoge
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
 
How Linkedin uses Automic for Big Data Processes
How Linkedin uses Automic for Big Data ProcessesHow Linkedin uses Automic for Big Data Processes
How Linkedin uses Automic for Big Data ProcessesCA | Automic Software
 
SQL + Hadoop: The High Performance Advantage�
SQL + Hadoop:  The High Performance Advantage�SQL + Hadoop:  The High Performance Advantage�
SQL + Hadoop: The High Performance Advantage�Actian Corporation
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopWilfried Hoge
 
Apigee Insights: Data & Context-Driven Actions
Apigee Insights: Data & Context-Driven ActionsApigee Insights: Data & Context-Driven Actions
Apigee Insights: Data & Context-Driven ActionsApigee | Google Cloud
 
Big and fast data strategy 2017 jr
Big and fast data strategy 2017 jrBig and fast data strategy 2017 jr
Big and fast data strategy 2017 jrJonathan Raspaud
 
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics
 

Similaire à Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban (20)

InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
 
Building a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedInBuilding a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedIn
 
Hive at LinkedIn
Hive at LinkedIn Hive at LinkedIn
Hive at LinkedIn
 
Geode Meetup Apachecon
Geode Meetup ApacheconGeode Meetup Apachecon
Geode Meetup Apachecon
 
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
 
Innovate with the data you have with UiPath and Snowflake.pdf
Innovate with the data you have with UiPath and Snowflake.pdfInnovate with the data you have with UiPath and Snowflake.pdf
Innovate with the data you have with UiPath and Snowflake.pdf
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
 
apidays LIVE Paris - Data with a mission: a COVID-19 API case study by Matt M...
apidays LIVE Paris - Data with a mission: a COVID-19 API case study by Matt M...apidays LIVE Paris - Data with a mission: a COVID-19 API case study by Matt M...
apidays LIVE Paris - Data with a mission: a COVID-19 API case study by Matt M...
 
apidays LIVE Australia - Data with a Mission by Matt McLarty
apidays LIVE Australia -  Data with a Mission by Matt McLarty apidays LIVE Australia -  Data with a Mission by Matt McLarty
apidays LIVE Australia - Data with a Mission by Matt McLarty
 
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsights
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
How Linkedin uses Automic for Big Data Processes
How Linkedin uses Automic for Big Data ProcessesHow Linkedin uses Automic for Big Data Processes
How Linkedin uses Automic for Big Data Processes
 
SQL + Hadoop: The High Performance Advantage�
SQL + Hadoop:  The High Performance Advantage�SQL + Hadoop:  The High Performance Advantage�
SQL + Hadoop: The High Performance Advantage�
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
 
Apigee Insights: Data & Context-Driven Actions
Apigee Insights: Data & Context-Driven ActionsApigee Insights: Data & Context-Driven Actions
Apigee Insights: Data & Context-Driven Actions
 
Big and fast data strategy 2017 jr
Big and fast data strategy 2017 jrBig and fast data strategy 2017 jr
Big and fast data strategy 2017 jr
 
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
 

Dernier

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Dernier (20)

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban

Notes de l'éditeur

  1. Hello everyone. Welcome to my talk on Building a Self-Service Hadoop platform at LinkedIn with Azkaban.
  2. A little bit about who I am. I am a Software Engineer on the Hadoop Development Team at LinkedIn. I am one of the main contributors to Azkaban Previously, I was at Microsoft, working on the Windows Kernel.
  3. I want to start by talking a little about how LinkedIn uses Hadoop. About half of our Hadoop usage is in ad hoc queries, queries for general analytics. The other half is used on the vast majority of our data products
  4. These are two pages from my LinkedIn experience: <click> - On the left is my homepage - On the right is my profile page. The features on this page that are directly powered by Hadoop are everything in blue. <click> - People You May Know - Recommendations - Many other features - Our incredibly robust AB test platform, and we test almost every change, every placement, every impression. The sample set, the test analysis… they are all done on Hadoop.
  5. Let’s take a closer look at the LinkedIn home page. <click> People You May Know and Ad Recommendations are two of the many data products powered by Hadoop at LinkedIn. <click> Here are visual representation of dependency graphs of Hadoop jobs that create these two features. - Dozens to hundreds of Hadoop jobs per data product. - People you may know, is one of our premier data products. It was at the core of LinkedIn’s membership growth. These run reliably every day, and sometimes multiple times a day, to provide fresh data to our users. And it’s critical that they work.
  6. We’re constantly testing and improving the workflows for all of our data products. We’ve had a very high velocity of changes since the beginning of data products at LinkedIn. This is the graph representing the workflow for one of LinkedIn’s major data products back in 2009. Its small, only a few Hadoop jobs. Not difficult to understand or maintain. But we were in a phase of rapid growth in 2009… and still are. As we grew, these workflows changed… a lot. <Click> Each image represents a month of change to the workflow. - Every month, you can see that there are small to huge modification to the workflow. - It quickly got very complex - And this was just one of our data products. Imagine having one of these for every product. It became clear that without being able to visualize what your workflow looked like and how it ran, it was very difficult to understand what the whole workflow did and keep up with the rapid development. And it was with this in mind that we created <click>
  7. Azkaban, our workflow scheduler, back in 2009 This is what it looked like back then, pretty gothic looking, but had the basic features we needed.
  8. With Azkaban, we could Run workflows Scheduling jobs View Job History Other features, like failure notification But, most importantly, an easy to use Web UI with visualizations that showed you how your flow looked and how it ran Well, it wasn’t perfect It was one server stored everything in flat files on disk ran every job as forked local processes. Didn’t really have user management UI was engineer-designed and a bit dark and depressing Nonetheless, the tool was heavily used, because above all, it was simple and easy to use.
  9. About a year and half ago, in order to keep up with our rapid growth, we did a massive re-architecting of Azkaban. We got some help from a Web Dev to create a completely new and easier to use UI.
  10. Azkaban was now two separate servers Executor server handled job scheduling and execution Web server managed projects, users, and provided the UI User management so users can no longer overwrite each other’s files Stored data in a database, which is pluggable H2 MySQL Of course, brand new UI and more
  11. New plugin system Azkaban itself has no dependency on Hadoop. Modularized many components of Azkaban and turned them into plugins The only built-in job type is command All other job types became plugins: Java, Pig, Hive, etc. Supported non-Hadoop jobtypes such as Teradata and Voldemort Build-and-Push Viewer plugins, extend the Azkaban UI to integrate new tools HDFS browser Reportal This means that we do not run a fork of Azkaban internally. We run the same code that lives in the open source repository on GitHub. All LinkedIn-specific code are implemented as plugins Over time, we had found that our users used Azkaban as more than just a workflow scheduler. They used Azkaban not only as a major tool for developing Hadoop workflows Also debugging them with the integrated job and flow logs Only going to the Job Tracker logs when absolutely necessary. They used Azkaban to browse HDFS. Many of our users don’t even know the namenode HDFS browser even exists. Azkaban had become not just a workflow manager, but the main front-end to Hadoop at LinkedIn. With this in mind, at the beginning of this year, we released <click>
  12. Azkaban 2.5, focusing on making Azkaban easier and more powerful to use and more productive to develop and extend. We rebuilt the UI to be both familiar to our users and more beautiful and more intuitive. By using Bootstrap, it is also more future-proof and easier to extend.
  13. We also added a number of new features and improvements. We added more powerful workflow features such as embedded flows Embed and reuse flows as nodes within other flows. A number of new self-service tools to help users to better understand how their flows and jobs ran and make it easier for them to debug and tune them Viewer plugins that are specific to jobtypes, so that you can build tools that are seamlessly on the job details page. We used this to build the new self-service tools Improvements to the HDFS viewer Display the schema in addition to content Parquet file viewer A number of under-the-hood improvements and more.
  14. One major reason we pushed the ease-of-use and simplicity so much was due to our users. Our users not only include software engineers and data scientists, but also analysts and product managers who are creating, modifying and scheduling Hadoop workflows
  15. So, Azkaban started off as a workflow scheduler for Hadoop. Today, it is also: An integrated runtime environment Where users develop and run complex workflows consisting of Pig, Hive, Java MapReduce and other types of jobs A unified front-end for Hadoop tools
  16. In the first year, we only had a dozen different workflows to manage and a handful of people that used them. Over the last 4 years, we’ve grown to have over 1000 different users Azkaban instances on over six different clusters 2,500 flows accounting for 30,000 Hadoop jobs executing per day. We have jobs that run from just a few minutes all the way to 8 days.
  17. The bad news is that we have over 1000 users, 2,500 flows and 30,000 jobs executing per day. And this is only going to keep increasing It surprised us at how much it is being used. And this is only from one Azkaban instance and we have about 6 of them. Our Hadoop development team is fairly small, only about 8 people. To keep up with our users, we had to make our Hadoop infrastructure as self-service as possible. And the primary front-end to our Hadoop infrastructure is, of course, Azkaban.
  18. So, how do you use Azkaban?
  19. I am going to show you how easy it is to create an Azkaban workflow. A workflow is simply a collection of job files. These are key value property files. I use the ‘type’ property to define what kind of job I want to run I can run a variety of jobs, like Pig, Hive, Java or just plain old command line jobs. The dependencies parameter is self explanatory. It specifies which jobs must completely successful before this job can be run. The rest of the parameters are passed to the executing job itself. All Azkaban does with these parameters is to construct a process that is run locally. The reason we can get away with having a lot of processes locally, is that most of them don’t do much more than spawning Hadoop jobs on the cluster. So in this case, bread waits on peanut butter and jelly, making this the most delicious workflow ever.
  20. We found that many users reuse what is effectively a sub-flow several times with different parameters As a result, we added support for embedded flows, making it possible to embed a workflow as a node in other workflows. I just set the type to “flow” and set “flow.name” to the name of the workflow I want to embed. In this case, my embedded flow is “sandwich”, consisting of peanutbutter, jelly, and bread, waits on coffee and fruit, making this workflow a complete and healthy breakfast.
  21. Afterwards, I just package my jobs into a zip archive and upload it to Azkaban via the web UI or a REST interface. Here is the project page, where I can run my workflows, set permissions, and customize my jobs. If I want a birds-eye view of which jobs make up the flows in my project, I simply click on the drop-down for one of my flows and I can see the hierarchy of its jobs in an outline form. When I mouse over one of jobs, Azkaban automatically highlights the jobs it depends on and the ones that depend on it When I are ready to run my workflow, I just click the “Execute Flow” button, which brings up <click>
  22. …the Flow Execution Panel. Here, I can do a lot of things to customize my flow before I run it. I have this beautiful visualization of my flow. I can enable or disable any part of my flow. For example, here I have 3 embedded flows that processes data and a final flow that pushes data back out to the front end. If I want to test my workflow and not push any data, I can right click and disable the last flow.
  23. Because the flow will take a long time to run, I want to be notified by email if the execution fails. I just click into the Notification tab, select to be notified when the first job fails, and specify the email address for the notification. Here, I can also set the failure notification to be sent after the failed flow finishes running. I can also set a notification to be sent if the flow finishes successfully.
  24. By default, if a job in a flow fails, Azkaban will let the currently running jobs finish before killing the flow. When I am testing my flow, I might want to change what happens when a failure is caught. I just click into the Failure Options tab and select which behavior I want.
  25. Aside from letting the currently running jobs finish, I can also have Azkaban kill all jobs immediately when a failure is caught or try to continue to run as many of the remaining jobs that it can run. And I can do all this through the UI.
  26. I can also customize the execution of my flow by setting custom parameters. I just click into the Flow Parameters tab and set any parameters I like. The parameters that I set here will override the parameters set in my job files. This is particularly useful when I am testing my flow.
  27. I might want to run multiple instances of my flow concurrently. This is often useful when I am testing my flow and I want to kick off a few instances of it each with different parameters. If I am concerned about the different instance of the flow stepping on each other and touching the same files when running in parallel, I can go into the Concurrent tab and specify how I want the workflow to run if one instance is already running.
  28. I can either disallow concurrent executions of my flow altogether. I can let all instances of the flow run concurrently. Or I can pipeline them in a way that ensures that new executions will not overrun the current execution. I can block the execution of Job A until the previous flow’s Job A has finished running Or, I can block Job A until the children of the previous flow’s Job A has completed. This is useful if I want my flows to be a few steps behind an already executing flow
  29. Now, when I run my flow, I am presented with this interactive visualization. At one point, Hadoop users had to constantly refresh the Hadoop JobTracker page to see the status of their jobs. We try to do better than that. Instead of having to sit and click refresh every 3 seconds, Azkaban visualizes the execution of your flow, automatically updating as the jobs run. Here, I can also expand into the embedded flows and view the progress of the inner jobs as well. If I want to look at which jobs are run at which point in the flow’s execution, I can click into the Job List tab and view the flow as a Gantz chart. <click>
  30. This is also updated automatically as my workflow runs. I can also expand into the embedded flows here as well. I can also click on the details link for any of the jobs and drill down into the job logs and other details specific to the job. Often times, at LinkedIn, Hadoop users will have Azkaban open on this page one one screen while they do work on the other. This is how we make it easier for users to understand their workflow executions and is critical to allowing them to understanding their workflows.
  31. Many of our Hadoop workflows are scheduled to be run repeatedly, either monthly, weekly, or even daily, as is the case for some of our ETL workflows. Azkaban makes it easy to schedule your workflow to be run at a specific time, date, and recurrence. If I want my workflow to run at 5:30 PM every Friday, I just click “Schedule” and <click>
  32. …and bring up the the Schedule Flow Panel. Here, I can specify the time, date, and the recurrence for when I want my flow to be run.
  33. Once my workflow is scheduled, I can view it on the Scheduled Flows page. I can also manage other flows that I have scheduled from this page. But what about those times when one of my jobs may be running abnormally long? We have a feature for that too. I can click Set SLA for any of my scheduled flows and <click>
  34. …and bring up the SLA Options panel. Here, I can set SLA settings for my entire flow, or individual jobs in the flow. I can notifications or even kill my job or flow when the SLA is exceeded
  35. The other side of developing Hadoop workflows is debugging and tuning their performance. As we all know, Hadoop, Pig, and Hive are all very complex systems, with many different knobs and logs that provide a ton of information but are not always easy to find or read.
  36. Again, with over 1000 users, several clusters, so many job executing per day, as well as a small Hadoop team we needed to build tools that make it as easy as possible to help our users debug problems and tune performance of their workflows and jobs on their own.
  37. As my jobs change over time, one thing I would want to know is how the new changes affect the performance of the job. On the job history page, Azkaban provides a time graph, visualizing of the history of a job’s runtime. I can click on any data point to view the details and logs of that particular job execution.
  38. With Azkaban 2.5, we added a similar time graph visualizing the history of the runtimes of the workflow as a whole. This way, as I make modifications to my workflow over time, I can easily see how the performance of my workflow changes with each of the new modifications.
  39. Another important feature of Azkaban is that it provides all the job execution logs so that you don’t have to hunt them down. Job logs contain very rich information. For example, Pig and Hive job logs provide tables describing numbers of mappers and reducers and task runtimes for the MapReduce jobs they fire off. However, job logs are very verbose are not always easy to read. One complaint we often get is that the headers on the tables of job stats never line up with the actual columns.
  40. As a result, one of the new self-service features we added in Azkaban 2.5 is the Job Summary. The Job Summary parses the job logs and extracts the information most useful to the users such as - The command used to run the job - Classpath - JVM options - Memory settings - Parameters passed to the job For Pig and Hive jobs, - Displays the table of mappers and reducers clearly in an actual table.
  41. At LinkedIn, the majority of our Hadoop jobs are Pig jobs. As we know, Pig scripts are ultimately compiled into DAGs of MapReduce jobs. When developing and tuning Pig jobs, it is very useful to be able to visualize the DAG. As a result, we built a Pig Visualizer, which is plugin specific to the Pig jobtype, which Uses the Pig listener interface to collect stats while my Pig job runs Visualizes the plan DAG and provides detailed information about each node, that previously required going to the job tracker This is similar to Lipstick and Ambrose but is completely integrated with Azkaban. You do not need to modify your Pig jobs at all You do not have to leave Azkaban to go to a separate tool As you can see, it is integrated seamlessly in the UI as a new tab next to the job logs I’ll show you some of the things you can do with the Pig Visualizer
  42. Here, I can select one of the nodes, which will display some summary information for that job in the sidebar, including types of operations aliases whether the job succeeded a Job Tracker URL and some stats Clicking on More Details brings up a modal dialog
  43. …which displays more detailed information about the job. The first tab displays the Pig Job Stats, which include Stats about the mapper and reducer stats I/O stats and spill count Which files the job read or wrote to and how many records
  44. The second tab contains the Hadoop job counters. These are the counters that you would find on the Job Tracker page but again, you can view them right here in Azkaban rather than having to go to a completely different tool.
  45. Finally, the Pig Visualizer also displays part of the job configuration. The values we picked to display on this page are the ones that our users most commonly look at when tuning their Pig jobs, such as Split size Io.sort.mb Compression options Pig Map and Reduce plans converted from base64 to text Of course, there is a convenient link to view the full details on the Job Tracker page.
  46. Often, users want to better understand how their workflow is performing as a whole to find out information such as which jobs run the longest use the most number of tasks We built the Flow Summary to provide a dashboard of details and stats for a given workflow. At the top, we have the project name and a list of job types used. Then, we have scheduling information. I can remove the schedule or set an SLA right from this page. Then, the flow summary can analyze the last successful run of my flow and display an aggregate view of stats from all the jobs. It displays a histogram of the runtimes of each of the jobs in the flow.
  47. Below, it shows aggregate views of resource consumption Such as which job used the highest number of map and reduce slots total slots used. It also shows which jobs set the highest values for different parameters: such as maximum Xmx, Xms, and distribute cache usage And which jobs read and wrote the most number of bytes. These tools, especially the Pig Visualizer, have been heavily used at LinkedIn and have been really helpful for our users to understand and tune their workflows.
  48. Browsing HDFS is something we all do extensively when developing Hadoop jobs. We make that really easy to do with the Azkaban HDFS viewer plugin.
  49. Browsing files is really easy. The HDFS browser works just like any other web-based file browser. You can jump back to any parent folder in the path and easily go to your home directory. Often times, you will want to view files in HDFS, but the files may be in a binary format. At LinkedIn, we store most of our data in Avro and a few in other formats such as binary JSON. While tools like the Namenode HDFS browser and the Hadoop command line dfs tool also let you browse HDFS, they do not have good support for binary formats and will simply dump the raw contents of the files. And so, we did better than that.
  50. The Azkaban HDFS browser has pluggable file type viewers, which will parse files in different binary formats and display up to 1000 records from the file in human-readable text Sometimes, you would want to see the schema of the file. We made that really easy to do too. Just click the Schema tab and <click>
  51. ....and the file viewer will extract the schema from the file and display it in as readable text. Currently, this is supported by the Avro and Parquet file viewers.
  52. The HDFS browser supports a number of file formats, include …. You can even view images in the HDFS browser as well. And of course, text.
  53. As I mentioned before, Hadoop is also heavily used by analysts at LinkedIn. Often, analysts want to simply write a query, schedule it, and get the result without having to jump through the hoops of uploading job files. We made that really easy to do with Reportal, an easy reporting tool built on top of Azkaban.
  54. This is the main dashboard of Reportal. From here, I can manage my existing reports or schedule them. I can also create a new report.
  55. Creating a new report on Reportal is easy. For this report, I want to find the 10 most common words in Alice in Wonderland. As you can see, Reportal provides a nice editor with syntax highlighting. Once I have my query, I can run the report and <click>
  56. …view the results. I can download the results. I can also easily visualize the results using a line graph, bar graph, or pie chart.
  57. Reportal currently supports Pig, Hive, and Teradata queries. Reportal is built completely on top of Azkaban and uses Azkaban to execute and schedule jobs. Heavily used at LinkedIn for simple reporting.
  58. I want to give a sneak peak of some of the upcoming features that we are currently working on.
  59. <go through page> This plugin will be open sourced so stay tuned.
  60. These are other features that are on our roadmap. <go through list>
  61. Here are some possible future ideas that we have been discussing. <go through list> If you see something you would really want, please let us know.
  62. I would like to thank some of our main contributors. Aside from myself: Hien Luu – Azkaban core Anthony Hsu – Reportal, Job Summary Alex Bain – Azkaban Gradle Plugin and DSL Richard Park – Azkaban 2.0 rewrite, embedded flows, many more changes to core Azkaban Chenjie Yu – Azkaban core, trigger manager, job types Shida Li – Reportal on Azkaban
  63. Azkaban is developed on GitHub. We do not use a forked version. We run the same code internally on our clusters, except for a few LinkedIn-specific plugins. Please check out our website and our source code on GitHub. We look forward to you giving Azkaban a spin. We always welcome your feedback, bug reports, and pull requests, and we would love to get new contributors as well.
  64. Thank you very much for coming.