Building an analytics workflow using Apache Airflow

Building an Analytics
Workflow using Apache
Airflow
Yohei Onishi
PyCon APAC 2019, Feb. 23-24 2019

Presenter Profile
● Yohei Onishi
● Twitter: legoboku, Github:
yohei1126
● Data Engineer at a Japanese
retail company
● Based in Singapore since Oct.
2018
● Apache Airflow Contributor
2

Session overview
● Expected audiences: Data engineers
○ who are working on building a pipleline
○ who are looking for a better workflow solution
● Goal: Provide the following so they can use Airflow
○ Airflow overview and how to author workflow
○ Server configuration and CI/CD in my usecase
○ Recommendations for new users (GCP Cloud
Composer)
3

Data pipeline
data source collect ETL analytics data consumer
micro services
enterprise
systems
IoT devices
object storage
message queue
micro services
enterprise
systems
BI tool
4

Our requirements for ETL worflow
● Already built a data lake on AWS S3 to store structured /
unstructured data
● Want to build a batch based analytics platform
● Requirements
○ Workflow generation by code (Python) rather than GUI
○ OSS: avoid vendor lock in
○ Scalable: batch data processing and workflow
○ Simple and easily extensible
○ Workflow visualization 5

Another workflow engine: Apache Nifi
6

Airflow overview
● Brief history
○ Open sourced by Airbnb and Apache top project
○ Cloud Composer: managed Airflow on GCP
● Characteristics
○ Dynamic workflow generation by Python code
○ Easily extensible so you can fit it to your usecase
○ Scalable by using a message queue to orchestrate
arbitrary number of workers
7

Example: Copy a file from s3 bucket to another
export records
as CSV Singapore region
US region
EU region
transfer it to a
regional bucket
8
local region

DEMO: UI and source code
sample code: https://github.com/yohei1126/pycon-apac-2019-airflow-sample 9

Concept: Directed acyclic graph, operator, task, etc
custom_param_per_dag = {'sg': { ... }, 'eu': { ... }, 'us': { ... }}
for region, v in custom_param_per_dag.items():
dag = DAG('shipment_{}'.format(region), ...)
t1 = PostgresToS3Operator(task_id='db_to_s3', ...)
t2 = S3CopyObjectOperator(task_id='s3_to_s3', ...)
t1 >> t2
globals()[dag] = dag
10

template
t1 = PostgresToS3Operator(
task_id='db_to_s3',
sql="SELECT * FROM shipment WHERE region = '{{ params.region }}'
AND ship_date = '{{ execution_date.strftime("%Y-%m-%d") }}'",
bucket=default_args['source_bucket'],
object_key='{{ params.region }}/{{
execution_date.strftime("%Y%m%d%H%M%S") }}.csv',
params={'region':region},
dag=dag) 11

Operator
class PostgresToS3Operator(BaseOperator):
template_fields = ('sql', 'bucket', 'object_key')
def __init__(self, ..., *args, **kwargs):
super(PostgresToS3Operator, self).__init__(*args, **kwargs)
...
def execute(self, context):
...
12

HA Airflow cluster
executor
(1..N)
worker node (1)
executor
(1..N)
worker node (2)
executor
(1..N)
worker node (1)
... scheduler
master node (1)
web
server
master node
(2)
web
server
LB
admin
Airflow metadata DBCelery result backend message broker 13
http://site.clairvoyantsoft.com/setting-apache-airflow-cluster/

CI/CD pipeline
AWS SNS AWS SQS
Github repo
raise / merge
a PR
Airflow worker
polling
run Ansible script
git pull
test
deployment
14

Monitoring
Airflow worker
(EC2)
AWS CloudWatch
notify an error
if DAG fails using
Airflow slack webhook
notify an error if a
CloudWatch Alarm is
triggered slack webhook
15

GCP Cloud Composer
● Fully managed Airflow cluster provided by GCP
○ Fully managed
○ Built in integrated with the other GCP services
● To focus on business logic, you should build Airflow
cluster using GCP composer
16

Create a cluster using CLI
$ gcloud composer environments create ENVIRONMENT_NAME
--location LOCATION
OTHER_ARGUMENTS
● New Airflow cluster will be deployed as Kubenetes cluster on GKE
● We usually specify the following options as OTHER_ARGUMENTS
○ infra: instance type, disk size, VPC network, etc.
○ software configuration: Python version, Airflow version, etc.
17

Deploy your source code to the cluster
$ gcloud composer environments storage dags import
--environment my-environment --location us-central1
--source test-dags/quickstart.py
● This will upload your source code to cluster specific GCS bucket.
○ You can also directly upload your file to the bucket
● Then the file will be automatically deployed
18

monitoring cluster using Stackdriver
19

Demo: GCP Cloud Composer
● Create an environment
● Stackdriver logging
● GKE as backend
20

Summary
● Data Engineers have to build reliable and scalable data
pipeline to accelate data analytics activities
● Airflow is great tool to author and monitor workflow
● HA Airflow cluster is required for high availablity
● GCP Cloud Compose enables us to build a cluster easily
and focus on business logic
21

References
● Apache Airflow
● GCP Cloud Composer
● Airflow: a workflow management platform
● ETL best practices in Airflow 1.8
● Data Science for Startups: Data Pipelines
● Airflow: Tips, Tricks, and Pitfalls
22

Building an analytics workflow using Apache Airflow

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Building an analytics workflow using Apache Airflow

Similaire à Building an analytics workflow using Apache Airflow (20)

Plus de Yohei Onishi

Plus de Yohei Onishi (8)

Dernier

Dernier (20)

Building an analytics workflow using Apache Airflow