Introduction to Apache Airflow, it's main concepts and features and an example of a DAG. Afterwards some lessons and best practices learned by from the 3 years I have been using Airflow to power workflows in production.
2. Juan Martín Pampliega
2
Information Engineering @ ITBA
Co Founder @ Mutt Data
Professor @ ITBA
● Working in Data Projects since 2010.
● Globant (Google), Despegar, Socialmetrix, Jampp, Claro, Clarín and other companies.
● Co Founder @ Mutt Data a company specialized in developing projects using Big
Data and Data Science.
● Using Airflow in production since 2015 to manage data workflows in several
companies.
3. Apache Airflow
3
Airflow is a platform to programmatically author, schedule and
monitor workflows.
Started in late 2014 @ Airbnb by Maxime Beauchemin.
Open sourced in mid 2015.
Apache incubation since March 2016, first version March 2017.
Used by HBO, Twitter, ING, Paypal, Reddit, Yahoo, Jampp and more!
Author workflows as directed acyclic graphs (DAGs) of tasks.
The UI makes it easy to visualize workflows’ status, monitor progress,
and troubleshoot issues when needed.
Workflows are defined as Python code which makes them more
maintainable, versionable, testable, collaborative and fosters
abstraction and code reuse.
4. Problems with CRON & similar options
4
● CRON does a poor job at handling task dependencies and viewing
them.
● Poor or no strategy for retrying tasks or backfills.
● Limited data about task times, execution durations and failures.
● Need to ssh into server to check logs and interact.
● No easy way to scale beyond one machine.
● People mostly write jobs in BASH or XML
● Some questions that are hard to answer:
○ Do you know when your CRON jobs fail?
○ Can you spot when your tasks become 3x slower?
○ Can you visualize what's currently running? What's queued?
○ Do you have reusable components you can use across workflows?
5. Terminology
5
● DAG: Directed Acyclic Graph of tasks that you want to run. (workflow)
● Operator: they define what should be executed. Examples: Bash
command, insert data into a table, etc.
● Task: instance of an operator that defines a node in a DAG.
● DAG run: instance of a DAG. When a DAG is triggered, Airflow
orchestrates the execution of operators while respecting
dependences and allocated resources.
● Task instance: specific run of a task for a particular DAG run in a
particular point in time.
8. DAGs are just configuration files that define the DAG
structure as code using Python
DAGs don't do any data processing as such, only the
actual execution of a DAG will
Tasks defined will run in different contexts, different
workers, different points in time and mostly don’t
communicate between each other
Should execute quickly (hundreds of milliseconds)
because they will be evaluated often by the Airflow
Scheduler
DAG Definition Files
8
10. Run a DAG with:
airflow backfill example_bash_operator -s 2015-01-01 -e 2015-01-02
Or enter Airflow UI, refresh the DAG and Airflow will trigger it when it needs to.
Local Installation
10
12. ● SequentialExecutor: default, can only run one task at a time.
● LocalExecutor: can run multiple tasks locally, needs different DB than sqlite.
● CeleryExecutor: uses Celery to execute tasks remotely, needs Celery workers.
● DaskExecutor: similar to Celery but with Dask, lower latency.
● MesosExecutor: runs tasks as containers on a Mesos Cluster
● KubernetesExecutor: tasks are executed as pods
CeleryExecutor is the most widely adopted option when scaling to multiple
machines.
Tip: use Redis as broker and Celery Flower to monitor it. Results should be stored in
Postgres.
Executors
12
14. The scheduler processes iterates all DAGs continually and triggers DAG runs
● execution_date: the period of time for when the data will be processed.
● start_date: the execution_date of the first DAG run.
● end_date: last execution_date that will have a DAG run.
● execution_timeout: maximum time a task will take before failing.
● retries: amount of times a task will be retried before failing.
● retry_delay: minimum time between one task execution and the next after a
failure.
The Scheduler
14
18. Additional Features
18
● Hooks: interfaces to external platforms and databases like Hive, S3,
MySQL, Postgres, etc.
● Pools: help limit the execution parallelism on arbitrary sets of tasks.
Tasks can be assigned to pools and have a priority weight.
● Queues: when using CeleryExecutor tasks can be assigned to a queue
and a worker can listen and execute tasks on one or many queues.
● XComs: enables task to exchange messages of any object that can be
pickled.
19. ● Sensors: operators that wait for a certain condition to be met and
succeed. (e.g. wait for a certain file to appear in a directory)
● Authentication: there are plugins to enable authentication and
authorization through LDAP/Kerberos/other methods.
● Ad Hoc Queries: enables charting and querying configured data
sources.
Additional Features
19
20. Jinja Templating
20
Jinja templating makes available multiple helpful variables and macros to aid in date
manipulation.
The {{ }} brackets tell Airflow that this is a Jinja template, and ds is a variable made
available by Airflow that is replaced by the execution date in the format YYYY-MM-DD.
Thus, in the dag run stamped with 2018-06-04, this would render to:
21. Jinja Templating
21
Another useful variable is ds_nodash, where './run.sh {{ ds_nodash }}' renders to:
execution_date variable is useful, as it is a python datetime object and not a string like ds
22. Plugins
22
Enable defining custom hooks, operators, sensors, macros executors and web views.
Use at many companies to generate DAGs automatically for ETLS, ML, A/B testing, etc.
23. Testing
23
DAGs are code so we have different options to test them
● Test DAG import: iterate through the DAG bag and check that each DAG can be
imported or run de .py file from the commandline.
● Test DAG’s parameters: make sure all DAGs have required parameters like
emails, catchup, etc.
● Unit test Python logic: since the code executed by PythonOperator is a Python
function you can use normal unit tests for them.
24. Installation Best Practices
24
● Install apache-airflow package.
● LocalExecutor is fine to start with.
● Use CeleryExecutor or Dask/Kubernetes to scale.
● Use https://github.com/puckel/docker-airflow if you want to use docker.
● Use PostgreSQL or MySQL for metadata.
● Tune scheduler properties to reduce CPU consumption.
● Remember to copy all config and DAG files to the worker’s/executor’s
location.
25. Best Practices
25
● Try to balance between DAG readability and abstracting code.
● Use depends_on_past and wait_for_downstream for safety.
● Change the name of the DAG when you change the start_date.
● Tasks are processes than run on workers, limit the size of data the
process locally.
● Remember to erase task logs after a certain time.
● Generate custom views for non technical people.
● Abstract duplicated logic!
30. Operator Trigger Rules
30
● Operators have a trigger_rule argument which defines the rule by which the
generated task get triggered Default value for trigger_rule is all_success
● Other options
○ all_failed: all parents are in a failed or upstream_failed state
○ all_done: all parents are done with their execution
○ one_failed: fires as soon as at least one parent has failed, it does not wait for
all parents to be done
○ one_success: fires as soon as at least one parent succeeds, it does not wait
for all parents to be done