Apache Airflow is a platform to programmatically author, schedule and monitor workflows. Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!). Airflow is not in the Spark Streaming or Storm space, it is more comparable to Oozie or Azkaban. It's primary goal is to solve problem nicely described in this XKCD comic (https://xkcd.com/2054/) What's unique about Airflow is that it brings "infrastructure as a code" concept to building scalable, manageable and elegant workflows. Workflows are defined as Python code - thus making dynamic workflow possible. It provides hundreds of out-of-the-box Operators that allow your pipeline to tap into pretty much any resource possible - starting from resources from multiple cloud providers as well as on-the-premises systems of yours. It's super-easy to write your own operators and leverage the power of data pipeline infrastructure provided by Airflow. This talk will be about general concepts behind Airflow - how you can author your workflow, write your own operators and run and monitor your pipelines. It will also explain how you can leverage Kubernetes (in recent release of Airflow) to make use of your on-premises or in-the-cloud infrastructure efficiently. You leave the talk armed with enough knowledge to evaluate if Airflow is good for you to solve your data pipeline problems and get some insight from Airflow contributors in case you are already an Airflow user.
12. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018
class ExampleOperator(BaseOperator):
def execute(self, context):
# Do something
pass
Operator and Sensor
13. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018
class ExampleOperator(BaseOperator):
def execute(self, context):
# Do something
pass
class ExampleSensorOperator(BaseSensorOperator):
def poke(self, context):
# Check if the condition occurred
return True
Operator and Sensor
14. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Operator good practices
● Idempotent
● Atomic
● No direct data sharing
○ Small portions of data between tasks: XCOMs
○ Large amounts of data: S3, GCS, etc.
51. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Single node
Local Executor
Web
server
RDBMS DAGs
Scheduler
Local executors`
Local executors
Local executors
Local executors
multiprocessing