SlideShare une entreprise Scribd logo
1  sur  32
Airflow Clustering
and High Availability
By: Robert Sanders
2Page:
Agenda
• Airflow Daemons
• Single Node Deployment
• Cluster Deployment
• Scaling
• Worker Nodes
• Master Nodes
• Limitations
• Airflow Scheduler Failover Controller
• Failover Controller Procedure
3Page:
Airflow Daemons
• Web Server
• Daemon that runs the Airflow Webserver
• 1 to many gunicorn processes to accept and process requests in
parallel.
• Allows you to track jobs progress, run jobs and more
• Scheduler
• Periodically runs (every X seconds) to determine if a DAG or Task
needs to be ran based off the DAG schedule
• Pushes messages to the Queuing Service to be executed
• Worker
• Daemon runs if you’re using the CeleryExecutors (as opposed to
SequentialExecutor and LocalExecutor)
• 1 to many dedicated celeryd processes which execute functions
• Pulls messages from a Queuing services to determine what
functions to execute
4Page:
Single Node Deployment
5Page:
Cluster Deployment
6Page:
Why setup a Cluster Deployment?
• Distributes heavy processes onto many machines for better
use of resources
• More Highly Available Airflow environment
• If you have many Workflows with many Tasks your executors
would not be able to get to all the messages in the queue.
Adding more executors would fix this issue.
7Page:
Scaling Workers
• Horizontally
• Add more machines to the cluster
• No need to register the machines with the master. You
just need to start up the Airflow Worker task on the new
Machine.
• Vertically
• Increase the number of executors (celeryd processes)
per node and restart the workers
8Page:
Scaling Master
9Page:
Limitations
• There can only be one scheduler running at a time
• If you have multiple Scheduler processes running, there's
a possibility that multiple instances of a single task that
will be scheduled to run.
• If the Scheduler Daemon or Machine with the process goes
down then no jobs will get scheduled
10Page:
Airflow Scheduler Failover Controller
• Dedicated Daemon that runs with Airflow on the Master
Nodes
• Ensures that there is always one and only one Scheduler
running on the Master nodes at a time
• Developed Internally and Open Sourced
• https://github.com/teamclairvoyant/airflow-scheduler-
failover-controller
• High Level Steps
• Polls (every x seconds) to check if the scheduler is
running
• If scheduler isn’t running, restart the scheduler
• If it still doesn’t start up, then try starting it up on the
other master nodes
11Page:
Failover Controller Diagram
12Page:
Start Up Scenario
13Page:
Failover Controller Process (Start Up)
Master Node 1
Failover
Controller
(standby)
Master Node 2
Failover
Controller
(standby)
On startup, the processes start out in STANDBY
14Page:
Failover Controller Process (Start Up)
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
The first one to enter data into the Metastore is elected as the active
controller.
15Page:
Failover Controller Process (Start Up)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
The Failover controller checks to see if the Scheduler is running, but it
isn’t.
16Page:
Failover Controller Process (Start Up)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller starts up the Scheduler
17Page:
Scheduler Failure
Scenario
18Page:
Failover Controller Process (Process Failure)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Scheduler process has died
19Page:
Failover Controller Process (Process Failure)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller restarts the Scheduler
20Page:
Scheduler Failure and
Failed Restart Scenario
21Page:
Failover Controller Process (Process Failure 2)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Scheduler process has died
22Page:
Failover Controller Process (Process Failure 2)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller tries to restart the Scheduler, but its still not running
23Page:
Failover Controller Process (Process Failure 2)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller tries to restart the Scheduler on a different node
24Page:
Failover Controller Process (Process Failure 2)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller succeeds to restart the scheduler and the cluster is
back to normal
25Page:
Node Failure Scenario
26Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Everything is running as expected
27Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(standby)
Master Node 1 dies and all the processes running on it are gone
28Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(active)
Failover Controller on Master 2 becomes active because the one running
on Master Node 1 has stopped sending a heart beat
29Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(active)
The newly active Failover Controller tries to check-in with and restart the
Scheduler on the daemon the Metadata says its running on and fails.
30Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(active)
The Failover Controller then starts it on another node and it succeeds
Scheduler
31Page:
Failover Controller Process (Node Failure)
Master Node 1
Failover
Controller
(standby)
Master Node 2
Failover
Controller
(active)
When Master Node 1 is brought back, the old Failover Controller goes
into STANDBY state
Scheduler
32Page:
Q&A

Contenu connexe

Tendances

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
Multi Master PostgreSQL Cluster on Kubernetes
Multi Master PostgreSQL Cluster on KubernetesMulti Master PostgreSQL Cluster on Kubernetes
Multi Master PostgreSQL Cluster on KubernetesOhyama Masanori
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Kaxil Naik
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringTaro L. Saito
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
 
Webinar: PostgreSQL continuous backup and PITR with Barman
Webinar: PostgreSQL continuous backup and PITR with BarmanWebinar: PostgreSQL continuous backup and PITR with Barman
Webinar: PostgreSQL continuous backup and PITR with BarmanGabriele Bartolini
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 

Tendances (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
Multi Master PostgreSQL Cluster on Kubernetes
Multi Master PostgreSQL Cluster on KubernetesMulti Master PostgreSQL Cluster on Kubernetes
Multi Master PostgreSQL Cluster on Kubernetes
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
 
Airflow and supervisor
Airflow and supervisorAirflow and supervisor
Airflow and supervisor
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
Webinar: PostgreSQL continuous backup and PITR with Barman
Webinar: PostgreSQL continuous backup and PITR with BarmanWebinar: PostgreSQL continuous backup and PITR with Barman
Webinar: PostgreSQL continuous backup and PITR with Barman
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 

Similaire à Airflow Clustering and High Availability

Docker Swarm for Beginner
Docker Swarm for BeginnerDocker Swarm for Beginner
Docker Swarm for BeginnerShahzad Masud
 
Oracle real application clusters system tests with demo
Oracle real application clusters system tests with demoOracle real application clusters system tests with demo
Oracle real application clusters system tests with demoAjith Narayanan
 
Fyber - airflow best practices in production
Fyber - airflow best practices in productionFyber - airflow best practices in production
Fyber - airflow best practices in productionItai Yaffe
 
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)STePINForum
 
Heart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelHeart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelDocker, Inc.
 
Bots on guard of sdlc
Bots on guard of sdlcBots on guard of sdlc
Bots on guard of sdlcAlexey Tokar
 
M|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouM|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouMariaDB plc
 
An introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAn introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAjith Narayanan
 
Container Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeContainer Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeDocker, Inc.
 
Server(less) Swift at SwiftCloudWorkshop 3
Server(less) Swift at SwiftCloudWorkshop 3Server(less) Swift at SwiftCloudWorkshop 3
Server(less) Swift at SwiftCloudWorkshop 3kognate
 
Container orchestration from theory to practice
Container orchestration from theory to practiceContainer orchestration from theory to practice
Container orchestration from theory to practiceDocker, Inc.
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Jimmy Lai
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
ScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency InjectionScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency Injection7mind
 
Weblogic 101 for dba
Weblogic  101 for dbaWeblogic  101 for dba
Weblogic 101 for dbaOsama Mustafa
 
Orchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresOrchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresDocker, Inc.
 

Similaire à Airflow Clustering and High Availability (20)

Docker Swarm for Beginner
Docker Swarm for BeginnerDocker Swarm for Beginner
Docker Swarm for Beginner
 
Oracle real application clusters system tests with demo
Oracle real application clusters system tests with demoOracle real application clusters system tests with demo
Oracle real application clusters system tests with demo
 
Fyber - airflow best practices in production
Fyber - airflow best practices in productionFyber - airflow best practices in production
Fyber - airflow best practices in production
 
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)
 
Heart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelHeart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object Model
 
Bots on guard of sdlc
Bots on guard of sdlcBots on guard of sdlc
Bots on guard of sdlc
 
M|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouM|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for You
 
An introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAn introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methods
 
Fail over fail_back
Fail over fail_backFail over fail_back
Fail over fail_back
 
Container Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeContainer Orchestration from Theory to Practice
Container Orchestration from Theory to Practice
 
Server(less) Swift at SwiftCloudWorkshop 3
Server(less) Swift at SwiftCloudWorkshop 3Server(less) Swift at SwiftCloudWorkshop 3
Server(less) Swift at SwiftCloudWorkshop 3
 
Container orchestration from theory to practice
Container orchestration from theory to practiceContainer orchestration from theory to practice
Container orchestration from theory to practice
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
RubyKaigi 2014: ServerEngine
RubyKaigi 2014: ServerEngineRubyKaigi 2014: ServerEngine
RubyKaigi 2014: ServerEngine
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
ScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency InjectionScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency Injection
 
Weblogic 101 for dba
Weblogic  101 for dbaWeblogic  101 for dba
Weblogic 101 for dba
 
Orchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresOrchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failures
 

Dernier

5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 

Dernier (20)

5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 

Airflow Clustering and High Availability

  • 1. Airflow Clustering and High Availability By: Robert Sanders
  • 2. 2Page: Agenda • Airflow Daemons • Single Node Deployment • Cluster Deployment • Scaling • Worker Nodes • Master Nodes • Limitations • Airflow Scheduler Failover Controller • Failover Controller Procedure
  • 3. 3Page: Airflow Daemons • Web Server • Daemon that runs the Airflow Webserver • 1 to many gunicorn processes to accept and process requests in parallel. • Allows you to track jobs progress, run jobs and more • Scheduler • Periodically runs (every X seconds) to determine if a DAG or Task needs to be ran based off the DAG schedule • Pushes messages to the Queuing Service to be executed • Worker • Daemon runs if you’re using the CeleryExecutors (as opposed to SequentialExecutor and LocalExecutor) • 1 to many dedicated celeryd processes which execute functions • Pulls messages from a Queuing services to determine what functions to execute
  • 6. 6Page: Why setup a Cluster Deployment? • Distributes heavy processes onto many machines for better use of resources • More Highly Available Airflow environment • If you have many Workflows with many Tasks your executors would not be able to get to all the messages in the queue. Adding more executors would fix this issue.
  • 7. 7Page: Scaling Workers • Horizontally • Add more machines to the cluster • No need to register the machines with the master. You just need to start up the Airflow Worker task on the new Machine. • Vertically • Increase the number of executors (celeryd processes) per node and restart the workers
  • 9. 9Page: Limitations • There can only be one scheduler running at a time • If you have multiple Scheduler processes running, there's a possibility that multiple instances of a single task that will be scheduled to run. • If the Scheduler Daemon or Machine with the process goes down then no jobs will get scheduled
  • 10. 10Page: Airflow Scheduler Failover Controller • Dedicated Daemon that runs with Airflow on the Master Nodes • Ensures that there is always one and only one Scheduler running on the Master nodes at a time • Developed Internally and Open Sourced • https://github.com/teamclairvoyant/airflow-scheduler- failover-controller • High Level Steps • Polls (every x seconds) to check if the scheduler is running • If scheduler isn’t running, restart the scheduler • If it still doesn’t start up, then try starting it up on the other master nodes
  • 13. 13Page: Failover Controller Process (Start Up) Master Node 1 Failover Controller (standby) Master Node 2 Failover Controller (standby) On startup, the processes start out in STANDBY
  • 14. 14Page: Failover Controller Process (Start Up) Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) The first one to enter data into the Metastore is elected as the active controller.
  • 15. 15Page: Failover Controller Process (Start Up) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) The Failover controller checks to see if the Scheduler is running, but it isn’t.
  • 16. 16Page: Failover Controller Process (Start Up) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller starts up the Scheduler
  • 18. 18Page: Failover Controller Process (Process Failure) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Scheduler process has died
  • 19. 19Page: Failover Controller Process (Process Failure) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller restarts the Scheduler
  • 21. 21Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Scheduler process has died
  • 22. 22Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller tries to restart the Scheduler, but its still not running
  • 23. 23Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller tries to restart the Scheduler on a different node
  • 24. 24Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller succeeds to restart the scheduler and the cluster is back to normal
  • 26. 26Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Everything is running as expected
  • 27. 27Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (standby) Master Node 1 dies and all the processes running on it are gone
  • 28. 28Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (active) Failover Controller on Master 2 becomes active because the one running on Master Node 1 has stopped sending a heart beat
  • 29. 29Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (active) The newly active Failover Controller tries to check-in with and restart the Scheduler on the daemon the Metadata says its running on and fails.
  • 30. 30Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (active) The Failover Controller then starts it on another node and it succeeds Scheduler
  • 31. 31Page: Failover Controller Process (Node Failure) Master Node 1 Failover Controller (standby) Master Node 2 Failover Controller (active) When Master Node 1 is brought back, the old Failover Controller goes into STANDBY state Scheduler