SlideShare une entreprise Scribd logo
1  sur  64
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Building a Self-Service Hadoop Platform at LinkedIn
with Azkaban
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Hadoop at LinkedIn
3
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Hadoop at LinkedIn
4
Profile PageHome Page
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Hadoop at LinkedIn
5
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Evolution of Workflows
6
20092010201120122013
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 1.0
7
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 1.0
 Run workflows
 Schedule jobs
 Job History
 Failure notification
 Easy to use web UI and
visualizations
8
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 2.0
9
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 2.0
 Major re-architecting
 Separate executor and web servers
 User authentication
 Pluggable database drivers
– H2
– MySQL
 Brand new UI
10
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 2.0
 Jobtype plugins
– Built-in type: command
– Pluggable jobtypes:
 Java
 Pig
 Hive
– Non-Hadoop jobtypes:
 Teradata
 Voldemort
 Viewer plugins – extending the
Azkaban UI for other tools
– HDFS browser
– Reportal
 LinkedIn-specific code as plugins
11
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 2.5
12
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 2.5
 UI overhauled using Bootstrap
 Embedded flows
 New self-service tools
– Job Summary
– Flow Summary
– Pig Visualizer
 Jobtype-specific plugins
 HDFS viewer improvements
– Display file schema in addition to
content
– Parquet file viewer
 And more
13
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Who’s using Azkaban?
 Software Engineers
 Data Scientists
 Analysts
 Product Managers
14
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban Today
 Workflow manager and scheduler
 Integrated runtime environment
 Unified front-end for Hadoop tools
15
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Good News! Success!
 1000+ users
 Several clusters
 2,500 flows executing per day
 30,000 jobs executing per day
16
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Bad News! Success
 1000+ users
 Several clusters
 2,500 flows executing per day
 30,000 jobs executing per day
17
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Creating and Running Workflows
18
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Creating Workflows
 Add job “type” plugins
– hadoopJava
– Command
– Pig
– Hive
 Dependencies
– Determine the dependency graph
 Parameter passing
– Parameters can be passed to job
19
type=pig
creamy.level=4
chunky.level=4
...
type=hadoopJava
jelly.type=grape
sugar=HFCS
...
type=command
bread.type=wheat
dependencies=peanutbutter,jelly
...
peanutbutter.job
bread.job
jelly.job
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Embedded Flows
 Embed a flow as a node in another
flow.
– “flow” job type
– Set flow.name to name of the
embedded flow
– Parameters can be passed to flow
20
peanutbutter jelly
bread
type=flow
flow.name=bread
dependencies=coffee,fruit
type=hive
coffee.decaf=false
coffee.cream=true
...
type=hadoopJava
fruit.type=apple
...
coffee.job fruit.job
sandwich.job
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Project Management
Project Page
21
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Flow Execution Panel
22
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Notification Options
23
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Failure Options
24
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
 Finish Current
– Finishes current running flows, then stops
 Cancel All
– Kills all running jobs and finishes immediately
 Finish Possible
– Finish all possible jobs if their dependencies have met. Then it fails.
Running Workflows
Failure Options
25
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Flow Parameters
26
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Concurrent Execution Options
27
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
 Skip Executions
– Prevent concurrent executions
 Run Concurrently
– Concurrently run the flow
 Pipeline
– Distance 1: jobA waits until concurrent jobA finishes
– Distance 2: jobA waits until concurrent jobA’s children finishes
Running Workflows
Concurrent Execution Options
28
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Executing Flow Page
29
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Flow Job List
30
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Scheduling Workflows
31
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Scheduling Workflows
Schedule Flow Panel
32
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Scheduling Workflows
Scheduled Flows
33
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Scheduling Flows
Setting SLAs
34
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Debugging and Tuning
35
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Hadoop at LinkedIn
 1000+ users
 Several clusters
 2,500 flows executing per day
 30,000 jobs executing per day
36
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Job Execution History
37
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Flow Execution History
38
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Job Logs
39
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Job Summary
40
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Pig Visualizer
41
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Pig Visualizer
42
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Pig Visualizer
43
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Pig Visualizer
44
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Pig Visualizer
45
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Flow Summary
46
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Flow Summary
47
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Browsing HDFS
48
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
HDFS Viewer
Browsing Files
49
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
HDFS Viewer
Viewing Files
50
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
HDFS Viewer
File Schema
51
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
 Avro
 Parquet
 Binary JSON
 Sequence File
 Image
 Text
HDFS Viewer
Supported File Types
52
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Reportal
53
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Reportal
Dashboard
54
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Reportal
New Report
55
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Reportal
Viewing Results
56
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
 Pig
 Hive
 Teradata
Reportal
Supported Query Types
57
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Upcoming Features
58
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban Gradle Plugin and DSL
 Describe Azkaban flow and deploy
with Gradle
 Single file (more if you want) to
describe all your workflows
– Compiles to .job files
 Static checker
 Valid Groovy code
– Add conditionals for deployment to
different clusters
59
azkaban {
jobConfDir = ‘./jobs’
workflow(‘workflow2’) {
pigJob(‘job2’) {
script = ‘src/main/pig/count-by-country.job’
parameter ‘inputFile’, ‘/user/foo/sample’
reads ‘/data/databases/foo’, [as: ‘input’]
writes ‘/data/databases/bar’, [as: ‘output’]
}
hiveJob(‘job3’) {
query = ‘show tables’
}
workflowDepends ‘job2’, ‘job3’
}
}
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Future Roadmap
 New visualizers (Hive, Tez, etc.)
 Support DSL from other tools
 Operationalization tooling
 Scalability improvements
 Improved plugin interfaces
60
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Future Discussions
 Conditional branching
 Hive Metastore browser
 Pluggable executors (e.g. YARN)
 Persistence storage server
 Launching and monitoring long-running YARN applications (Samza, Storm,
etc.)
61
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Main Contributors
 David Chen (LinkedIn)
 Hien Luu (LinkedIn)
 Anthony Hsu (LinkedIn)
 Alex Bain (LinkedIn)
 Richard Park (RelateIQ)
 Chenjie Yu (Tango)
 Shida Li (University of Waterloo)
62
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
How to Contribute
Website: azkaban.github.io
GitHub: github.com/azkaban
LinkedIn’s Data Website: data.linkedin.com
63
Building a Self-Service Hadoop Platform at Linkedin with Azkaban

Contenu connexe

Tendances

A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
Legacy Typesafe (now Lightbend)
 

Tendances (20)

Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data AnalyticsFugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 
E2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/LivyE2E Data Pipeline - Apache Spark/Airflow/Livy
E2E Data Pipeline - Apache Spark/Airflow/Livy
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture Overview
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Akka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in PracticeAkka and AngularJS – Reactive Applications in Practice
Akka and AngularJS – Reactive Applications in Practice
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at FacebookTangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Apache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectosApache spark y cómo lo usamos en nuestros proyectos
Apache spark y cómo lo usamos en nuestros proyectos
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
A Collaborative Data Science Development Workflow
A Collaborative Data Science Development WorkflowA Collaborative Data Science Development Workflow
A Collaborative Data Science Development Workflow
 
Rootconf
RootconfRootconf
Rootconf
 

Similaire à Building a Self-Service Hadoop Platform at Linkedin with Azkaban

Similaire à Building a Self-Service Hadoop Platform at Linkedin with Azkaban (20)

InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
 
Building a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedInBuilding a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedIn
 
Hive at LinkedIn
Hive at LinkedIn Hive at LinkedIn
Hive at LinkedIn
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Geode Meetup Apachecon
Geode Meetup ApacheconGeode Meetup Apachecon
Geode Meetup Apachecon
 
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
 
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data PipelinesPutting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
 
Innovate with the data you have with UiPath and Snowflake.pdf
Innovate with the data you have with UiPath and Snowflake.pdfInnovate with the data you have with UiPath and Snowflake.pdf
Innovate with the data you have with UiPath and Snowflake.pdf
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
 
apidays LIVE Australia 2020 - Data with a Mission by Matt McLarty
apidays LIVE Australia 2020 -  Data with a Mission by Matt McLarty apidays LIVE Australia 2020 -  Data with a Mission by Matt McLarty
apidays LIVE Australia 2020 - Data with a Mission by Matt McLarty
 
apidays LIVE Paris - Data with a mission: a COVID-19 API case study by Matt M...
apidays LIVE Paris - Data with a mission: a COVID-19 API case study by Matt M...apidays LIVE Paris - Data with a mission: a COVID-19 API case study by Matt M...
apidays LIVE Paris - Data with a mission: a COVID-19 API case study by Matt M...
 
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsights
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
How Linkedin uses Automic for Big Data Processes
How Linkedin uses Automic for Big Data ProcessesHow Linkedin uses Automic for Big Data Processes
How Linkedin uses Automic for Big Data Processes
 
SQL + Hadoop: The High Performance Advantage�
SQL + Hadoop:  The High Performance Advantage�SQL + Hadoop:  The High Performance Advantage�
SQL + Hadoop: The High Performance Advantage�
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
 
Apigee Insights: Data & Context-Driven Actions
Apigee Insights: Data & Context-Driven ActionsApigee Insights: Data & Context-Driven Actions
Apigee Insights: Data & Context-Driven Actions
 
Big and fast data strategy 2017 jr
Big and fast data strategy 2017 jrBig and fast data strategy 2017 jr
Big and fast data strategy 2017 jr
 

Plus de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Building a Self-Service Hadoop Platform at Linkedin with Azkaban

Notes de l'éditeur

  1. Hello everyone. Welcome to my talk on Building a Self-Service Hadoop platform at LinkedIn with Azkaban.
  2. A little bit about who I am. I am a Software Engineer on the Hadoop Development Team at LinkedIn. I am one of the main contributors to Azkaban Previously, I was at Microsoft, working on the Windows Kernel.
  3. I want to start by talking a little about how LinkedIn uses Hadoop. About half of our Hadoop usage is in ad hoc queries, queries for general analytics. The other half is used on the vast majority of our data products
  4. These are two pages from my LinkedIn experience: <click> - On the left is my homepage - On the right is my profile page. The features on this page that are directly powered by Hadoop are everything in blue. <click> - People You May Know - Recommendations - Many other features - Our incredibly robust AB test platform, and we test almost every change, every placement, every impression. The sample set, the test analysis… they are all done on Hadoop.
  5. Let’s take a closer look at the LinkedIn home page. <click> People You May Know and Ad Recommendations are two of the many data products powered by Hadoop at LinkedIn. <click> Here are visual representation of dependency graphs of Hadoop jobs that create these two features. - Dozens to hundreds of Hadoop jobs per data product. - People you may know, is one of our premier data products. It was at the core of LinkedIn’s membership growth. These run reliably every day, and sometimes multiple times a day, to provide fresh data to our users. And it’s critical that they work.
  6. We’re constantly testing and improving the workflows for all of our data products. We’ve had a very high velocity of changes since the beginning of data products at LinkedIn. This is the graph representing the workflow for one of LinkedIn’s major data products back in 2009. Its small, only a few Hadoop jobs. Not difficult to understand or maintain. But we were in a phase of rapid growth in 2009… and still are. As we grew, these workflows changed… a lot. <Click> Each image represents a month of change to the workflow. - Every month, you can see that there are small to huge modification to the workflow. - It quickly got very complex - And this was just one of our data products. Imagine having one of these for every product. It became clear that without being able to visualize what your workflow looked like and how it ran, it was very difficult to understand what the whole workflow did and keep up with the rapid development. And it was with this in mind that we created <click>
  7. Azkaban, our workflow scheduler, back in 2009 This is what it looked like back then, pretty gothic looking, but had the basic features we needed.
  8. With Azkaban, we could Run workflows Scheduling jobs View Job History Other features, like failure notification But, most importantly, an easy to use Web UI with visualizations that showed you how your flow looked and how it ran Well, it wasn’t perfect It was one server stored everything in flat files on disk ran every job as forked local processes. Didn’t really have user management UI was engineer-designed and a bit dark and depressing Nonetheless, the tool was heavily used, because above all, it was simple and easy to use.
  9. About a year and half ago, in order to keep up with our rapid growth, we did a massive re-architecting of Azkaban. We got some help from a Web Dev to create a completely new and easier to use UI.
  10. Azkaban was now two separate servers Executor server handled job scheduling and execution Web server managed projects, users, and provided the UI User management so users can no longer overwrite each other’s files Stored data in a database, which is pluggable H2 MySQL Of course, brand new UI and more
  11. New plugin system Azkaban itself has no dependency on Hadoop. Modularized many components of Azkaban and turned them into plugins The only built-in job type is command All other job types became plugins: Java, Pig, Hive, etc. Supported non-Hadoop jobtypes such as Teradata and Voldemort Build-and-Push Viewer plugins, extend the Azkaban UI to integrate new tools HDFS browser Reportal This means that we do not run a fork of Azkaban internally. We run the same code that lives in the open source repository on GitHub. All LinkedIn-specific code are implemented as plugins Over time, we had found that our users used Azkaban as more than just a workflow scheduler. They used Azkaban not only as a major tool for developing Hadoop workflows Also debugging them with the integrated job and flow logs Only going to the Job Tracker logs when absolutely necessary. They used Azkaban to browse HDFS. Many of our users don’t even know the namenode HDFS browser even exists. Azkaban had become not just a workflow manager, but the main front-end to Hadoop at LinkedIn. With this in mind, at the beginning of this year, we released <click>
  12. Azkaban 2.5, focusing on making Azkaban easier and more powerful to use and more productive to develop and extend. We rebuilt the UI to be both familiar to our users and more beautiful and more intuitive. By using Bootstrap, it is also more future-proof and easier to extend.
  13. We also added a number of new features and improvements. We added more powerful workflow features such as embedded flows Embed and reuse flows as nodes within other flows. A number of new self-service tools to help users to better understand how their flows and jobs ran and make it easier for them to debug and tune them Viewer plugins that are specific to jobtypes, so that you can build tools that are seamlessly on the job details page. We used this to build the new self-service tools Improvements to the HDFS viewer Display the schema in addition to content Parquet file viewer A number of under-the-hood improvements and more.
  14. One major reason we pushed the ease-of-use and simplicity so much was due to our users. Our users not only include software engineers and data scientists, but also analysts and product managers who are creating, modifying and scheduling Hadoop workflows
  15. So, Azkaban started off as a workflow scheduler for Hadoop. Today, it is also: An integrated runtime environment Where users develop and run complex workflows consisting of Pig, Hive, Java MapReduce and other types of jobs A unified front-end for Hadoop tools
  16. In the first year, we only had a dozen different workflows to manage and a handful of people that used them. Over the last 4 years, we’ve grown to have over 1000 different users Azkaban instances on over six different clusters 2,500 flows accounting for 30,000 Hadoop jobs executing per day. We have jobs that run from just a few minutes all the way to 8 days.
  17. The bad news is that we have over 1000 users, 2,500 flows and 30,000 jobs executing per day. And this is only going to keep increasing It surprised us at how much it is being used. And this is only from one Azkaban instance and we have about 6 of them. Our Hadoop development team is fairly small, only about 8 people. To keep up with our users, we had to make our Hadoop infrastructure as self-service as possible. And the primary front-end to our Hadoop infrastructure is, of course, Azkaban.
  18. So, how do you use Azkaban?
  19. I am going to show you how easy it is to create an Azkaban workflow. A workflow is simply a collection of job files. These are key value property files. I use the ‘type’ property to define what kind of job I want to run I can run a variety of jobs, like Pig, Hive, Java or just plain old command line jobs. The dependencies parameter is self explanatory. It specifies which jobs must completely successful before this job can be run. The rest of the parameters are passed to the executing job itself. All Azkaban does with these parameters is to construct a process that is run locally. The reason we can get away with having a lot of processes locally, is that most of them don’t do much more than spawning Hadoop jobs on the cluster. So in this case, bread waits on peanut butter and jelly, making this the most delicious workflow ever.
  20. We found that many users reuse what is effectively a sub-flow several times with different parameters As a result, we added support for embedded flows, making it possible to embed a workflow as a node in other workflows. I just set the type to “flow” and set “flow.name” to the name of the workflow I want to embed. In this case, my embedded flow is “sandwich”, consisting of peanutbutter, jelly, and bread, waits on coffee and fruit, making this workflow a complete and healthy breakfast.
  21. Afterwards, I just package my jobs into a zip archive and upload it to Azkaban via the web UI or a REST interface. Here is the project page, where I can run my workflows, set permissions, and customize my jobs. If I want a birds-eye view of which jobs make up the flows in my project, I simply click on the drop-down for one of my flows and I can see the hierarchy of its jobs in an outline form. When I mouse over one of jobs, Azkaban automatically highlights the jobs it depends on and the ones that depend on it When I are ready to run my workflow, I just click the “Execute Flow” button, which brings up <click>
  22. …the Flow Execution Panel. Here, I can do a lot of things to customize my flow before I run it. I have this beautiful visualization of my flow. I can enable or disable any part of my flow. For example, here I have 3 embedded flows that processes data and a final flow that pushes data back out to the front end. If I want to test my workflow and not push any data, I can right click and disable the last flow.
  23. Because the flow will take a long time to run, I want to be notified by email if the execution fails. I just click into the Notification tab, select to be notified when the first job fails, and specify the email address for the notification. Here, I can also set the failure notification to be sent after the failed flow finishes running. I can also set a notification to be sent if the flow finishes successfully.
  24. By default, if a job in a flow fails, Azkaban will let the currently running jobs finish before killing the flow. When I am testing my flow, I might want to change what happens when a failure is caught. I just click into the Failure Options tab and select which behavior I want.
  25. Aside from letting the currently running jobs finish, I can also have Azkaban kill all jobs immediately when a failure is caught or try to continue to run as many of the remaining jobs that it can run. And I can do all this through the UI.
  26. I can also customize the execution of my flow by setting custom parameters. I just click into the Flow Parameters tab and set any parameters I like. The parameters that I set here will override the parameters set in my job files. This is particularly useful when I am testing my flow.
  27. I might want to run multiple instances of my flow concurrently. This is often useful when I am testing my flow and I want to kick off a few instances of it each with different parameters. If I am concerned about the different instance of the flow stepping on each other and touching the same files when running in parallel, I can go into the Concurrent tab and specify how I want the workflow to run if one instance is already running.
  28. I can either disallow concurrent executions of my flow altogether. I can let all instances of the flow run concurrently. Or I can pipeline them in a way that ensures that new executions will not overrun the current execution. I can block the execution of Job A until the previous flow’s Job A has finished running Or, I can block Job A until the children of the previous flow’s Job A has completed. This is useful if I want my flows to be a few steps behind an already executing flow
  29. Now, when I run my flow, I am presented with this interactive visualization. At one point, Hadoop users had to constantly refresh the Hadoop JobTracker page to see the status of their jobs. We try to do better than that. Instead of having to sit and click refresh every 3 seconds, Azkaban visualizes the execution of your flow, automatically updating as the jobs run. Here, I can also expand into the embedded flows and view the progress of the inner jobs as well. If I want to look at which jobs are run at which point in the flow’s execution, I can click into the Job List tab and view the flow as a Gantz chart. <click>
  30. This is also updated automatically as my workflow runs. I can also expand into the embedded flows here as well. I can also click on the details link for any of the jobs and drill down into the job logs and other details specific to the job. Often times, at LinkedIn, Hadoop users will have Azkaban open on this page one one screen while they do work on the other. This is how we make it easier for users to understand their workflow executions and is critical to allowing them to understanding their workflows.
  31. Many of our Hadoop workflows are scheduled to be run repeatedly, either monthly, weekly, or even daily, as is the case for some of our ETL workflows. Azkaban makes it easy to schedule your workflow to be run at a specific time, date, and recurrence. If I want my workflow to run at 5:30 PM every Friday, I just click “Schedule” and <click>
  32. …and bring up the the Schedule Flow Panel. Here, I can specify the time, date, and the recurrence for when I want my flow to be run.
  33. Once my workflow is scheduled, I can view it on the Scheduled Flows page. I can also manage other flows that I have scheduled from this page. But what about those times when one of my jobs may be running abnormally long? We have a feature for that too. I can click Set SLA for any of my scheduled flows and <click>
  34. …and bring up the SLA Options panel. Here, I can set SLA settings for my entire flow, or individual jobs in the flow. I can notifications or even kill my job or flow when the SLA is exceeded
  35. The other side of developing Hadoop workflows is debugging and tuning their performance. As we all know, Hadoop, Pig, and Hive are all very complex systems, with many different knobs and logs that provide a ton of information but are not always easy to find or read.
  36. Again, with over 1000 users, several clusters, so many job executing per day, as well as a small Hadoop team we needed to build tools that make it as easy as possible to help our users debug problems and tune performance of their workflows and jobs on their own.
  37. As my jobs change over time, one thing I would want to know is how the new changes affect the performance of the job. On the job history page, Azkaban provides a time graph, visualizing of the history of a job’s runtime. I can click on any data point to view the details and logs of that particular job execution.
  38. With Azkaban 2.5, we added a similar time graph visualizing the history of the runtimes of the workflow as a whole. This way, as I make modifications to my workflow over time, I can easily see how the performance of my workflow changes with each of the new modifications.
  39. Another important feature of Azkaban is that it provides all the job execution logs so that you don’t have to hunt them down. Job logs contain very rich information. For example, Pig and Hive job logs provide tables describing numbers of mappers and reducers and task runtimes for the MapReduce jobs they fire off. However, job logs are very verbose are not always easy to read. One complaint we often get is that the headers on the tables of job stats never line up with the actual columns.
  40. As a result, one of the new self-service features we added in Azkaban 2.5 is the Job Summary. The Job Summary parses the job logs and extracts the information most useful to the users such as - The command used to run the job - Classpath - JVM options - Memory settings - Parameters passed to the job For Pig and Hive jobs, - Displays the table of mappers and reducers clearly in an actual table.
  41. At LinkedIn, the majority of our Hadoop jobs are Pig jobs. As we all know, Pig scripts are ultimately compiled into a DAG of MapReduce jobs. When developing and tuning Pig jobs, it is very useful to be able to visualize the DAG. As a result, we built a Pig Visualizer, which is plugin specific to the Pig jobtype, which Uses the Pig listener interface to collect stats while my Pig job runs Visualizes the plan DAG and provides detailed information about each node, that previously required going to the job tracker This is similar to Lipstick and Ambrose but is completely integrated with Azkaban. You do not need to modify your Pig jobs at all You do not have to leave Azkaban to go to a separate tool As you can see, it is integrated seamlessly as a new tab next to the job logs I’ll show you some of the things you can do with the Pig Visualizer
  42. Here, I can select one of the node, which will display some summary information for that job in the sidebar, including types of operations aliases whether the job succeeded a Job Tracker URL and some stats Clicking on More Details brings up a modal dialog
  43. …which displays more detailed information about the job. The first tab displays the Pig Job Stats, which include Stats about the mapper and reducer stats I/O stats and spill count Which files the job read or wrote to and how many records
  44. The second tab contains the Hadoop job counters. These are the counters that you would find on the Job Tracker page but again, you can view them right here in Azkaban rather than having to go to a completely different tool.
  45. Finally, the Pig Visualizer also displays part of the job configuration. The values we picked to display on this page are the ones that our users most commonly look at when tuning their Pig jobs, such as Split size Io.sort.mb Compression options Pig Map and Reduce plans converted from base64 to text Of course, there is a convenient link to view the full details on the Job Tracker page.
  46. Often, users want to better understand how their workflow is performing as a whole to find out information such as which jobs run the longest use the most number of tasks We built the Flow Summary to provide a dashboard of details and stats for a given workflow. At the top, we have the project name and a list of job types used. Then, we have scheduling information. I can remove the schedule or set an SLA right from this page. Then, the flow summary can analyze the last successful run of my flow and display an aggregate view of stats from all the jobs. It displays a histogram of the runtimes of each of the jobs in the flow.
  47. Below, it shows aggregate views of resource consumption Such as which job used the highest number of map and reduce slots total slots used. It also shows which jobs set the highest values for different parameters: such as maximum Xmx, Xms, and distribute cache usage And which jobs read and wrote the most number of bytes. These tools, especially the Pig Visualizer, have been heavily used at LinkedIn and have been really helpful for our users to understand and tune their workflows.
  48. Browsing HDFS is something we all do extensively when developing Hadoop jobs. We make that really easy to do with the Azkaban HDFS viewer plugin.
  49. Browsing files is really easy. The HDFS browser works just like any other web-based file browser. You can jump back to any parent folder in the path and easily go to your home directory. Often times, you will want to view files in HDFS, but the files may be in a binary format. At LinkedIn, we store most of our data in Avro and a few in other formats such as binary JSON. While tools like the Namenode HDFS browser and the Hadoop command line dfs tool also let you browse HDFS, they do not have good support for binary formats and will simply dump the raw contents of the files. And so, we did better than that.
  50. The Azkaban HDFS browser has pluggable file type viewers, which will parse files in different binary formats and display up to 1000 records from the file in human-readable text Sometimes, you would want to see the schema of the file. We made that really easy to do too. Just click the Schema tab and <click>
  51. ....and the file viewer will extract the schema from the file and display it in as readable text. Currently, this is supported by the Avro and Parquet file viewers.
  52. The HDFS browser supports a number of file formats, include …. You can even view images in the HDFS browser as well. And of course, text.
  53. As I mentioned before, Hadoop is also heavily used by analysts at LinkedIn. Often, analysts want to simply write a query, schedule it, and get the result without having to jump through the hoops of uploading job files. We made that really easy to do with Reportal, which built on top of Azkaban.
  54. This is the main dashboard of Reportal. From here, I can manage my existing reports or schedule them. I can also create a new report.
  55. Creating a new report on Reportal is easy. For this report, I want to find the 10 most common words in Alice in Wonderland. As you can see, Reportal provides a nice editor with syntax highlighting. Once I have my query, I can run the report and <click>
  56. …view the results. I can download the results. I can also easily visualize the results using a line graph, bar graph, or pie chart.
  57. Reportal currently supports Pig, Hive, and Teradata queries. Reportal is build completely on top of Azkaban and uses Azkaban to execute and schedule jobs. Heavily used at LinkedIn for simple reporting.
  58. I want to give a sneak peak of some of the upcoming features that we are currently working on.
  59. <go through page> This plugin will be open sourced so stay tuned.
  60. These are other features that are on our roadmap. <go through list>
  61. Here are some possible future ideas that we have been discussing. <go through list> If you see something you would really want, please let us know.
  62. I would like to thank some of our main contributors. Aside from myself: Hien Luu – Azkaban core Anthony Hsu – Reportal, Job Summary Alex Bain – Azkaban Gradle Plugin and DSL Richard Park – Azkaban 2.0 rewrite, embedded flows, many more changes to core Azkaban Chenjie Yu – Azkaban core, trigger manager, job types Shida Li – Reportal on Azkaban
  63. Azkaban is developed on GitHub. We do not use a forked version. We run the same code internally on our clusters, except for a few LinkedIn-specific plugins. Please check out our website and our source code on GitHub. We look forward to you giving Azkaban a spin. We always welcome your feedback, bug reports, and pull requests, and we would love to get new contributors as well.
  64. Thank you very much for coming.