Airflow Clustering and High Availability

Airflow Clustering
and High Availability
By: Robert Sanders

2Page:
Agenda
• Airflow Daemons
• Single Node Deployment
• Cluster Deployment
• Scaling
• Worker Nodes
• Master Nodes
• Limitations
• Airflow Scheduler Failover Controller
• Failover Controller Procedure

3Page:
Airflow Daemons
• Web Server
• Daemon that runs the Airflow Webserver
• 1 to many gunicorn processes to accept and process requests in
parallel.
• Allows you to track jobs progress, run jobs and more
• Scheduler
• Periodically runs (every X seconds) to determine if a DAG or Task
needs to be ran based off the DAG schedule
• Pushes messages to the Queuing Service to be executed
• Worker
• Daemon runs if you’re using the CeleryExecutors (as opposed to
SequentialExecutor and LocalExecutor)
• 1 to many dedicated celeryd processes which execute functions
• Pulls messages from a Queuing services to determine what
functions to execute

6Page:
Why setup a Cluster Deployment?
• Distributes heavy processes onto many machines for better
use of resources
• More Highly Available Airflow environment
• If you have many Workflows with many Tasks your executors
would not be able to get to all the messages in the queue.
Adding more executors would fix this issue.

7Page:
Scaling Workers
• Horizontally
• Add more machines to the cluster
• No need to register the machines with the master. You
just need to start up the Airflow Worker task on the new
Machine.
• Vertically
• Increase the number of executors (celeryd processes)
per node and restart the workers

9Page:
Limitations
• There can only be one scheduler running at a time
• If you have multiple Scheduler processes running, there's
a possibility that multiple instances of a single task that
will be scheduled to run.
• If the Scheduler Daemon or Machine with the process goes
down then no jobs will get scheduled

10Page:
Airflow Scheduler Failover Controller
• Dedicated Daemon that runs with Airflow on the Master
Nodes
• Ensures that there is always one and only one Scheduler
running on the Master nodes at a time
• Developed Internally and Open Sourced
• https://github.com/teamclairvoyant/airflow-scheduler-
failover-controller
• High Level Steps
• Polls (every x seconds) to check if the scheduler is
running
• If scheduler isn’t running, restart the scheduler
• If it still doesn’t start up, then try starting it up on the
other master nodes

11Page:
Failover Controller Diagram

13Page:
Failover Controller Process (Start Up)
Master Node 1
Failover
Controller
(standby)
Master Node 2
Failover
Controller
(standby)
On startup, the processes start out in STANDBY

14Page:
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
The first one to enter data into the Metastore is elected as the active
controller.

15Page:
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
The Failover controller checks to see if the Scheduler is running, but it
isn’t.

16Page:
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller starts up the Scheduler

17Page:
Scheduler Failure
Scenario

18Page:
Failover Controller Process (Process Failure)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Scheduler process has died

19Page:
Failover Controller Process (Process Failure)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller restarts the Scheduler

20Page:
Scheduler Failure and
Failed Restart Scenario

21Page:
Failover Controller Process (Process Failure 2)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Scheduler process has died

22Page:
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller tries to restart the Scheduler, but its still not running

23Page:
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller tries to restart the Scheduler on a different node

24Page:
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller succeeds to restart the scheduler and the cluster is
back to normal

26Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Everything is running as expected

27Page:
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(standby)
Master Node 1 dies and all the processes running on it are gone

28Page:
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(active)
Failover Controller on Master 2 becomes active because the one running
on Master Node 1 has stopped sending a heart beat

29Page:
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(active)
The newly active Failover Controller tries to check-in with and restart the
Scheduler on the daemon the Metadata says its running on and fails.

30Page:
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(active)
The Failover Controller then starts it on another node and it succeeds
Scheduler

31Page:
Master Node 1
Failover
Controller
(standby)
Master Node 2
Failover
Controller
(active)
When Master Node 1 is brought back, the old Failover Controller goes
into STANDBY state
Scheduler

Airflow Clustering and High Availability

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Airflow Clustering and High Availability

Similaire à Airflow Clustering and High Availability (20)

Dernier

Dernier (20)

Airflow Clustering and High Availability