SlideShare a Scribd company logo
1 of 28
Download to read offline
Building Data Pipelines Using
Apache Airflow
PURNA CHANDER RAO . KATHULA
Agenda
1. What is a Data Pipeline ?
2. Components of a Data pipeline.
3. Traditional Data Flows and issues
4. Introduction to Apache Airflow
5. Features
6. Core Components
7. Key Components
8. Demo
What is a Data Pipeline
Data Pipeline is a set of data processing elements connected in series, where the
output of one element is the input of the next one. The elements of the pipeline are
often executed in parallel or in time-sliced fashion. The name โ€˜pipelineโ€™ come from
a rough analogy with physical plumbing.
โ— Modern data pipelines are used to ingest & process vast volumes of data in
real time.
โ— Real time processing of data as opposed to traditional ETL / batch modes.
Common Components of a data pipeline
Typical parts of a data pipeline
โ— Data Ingestion
โ— Filtering
โ— Processing
โ— Querying of the data
โ— Data warehousing
โ— Reprocessing capabilities
Typical Requirements
โ— Scalability
โ—‹ Billions of messages and terabytes of
data 24 /7
โ— Availability and redundancy
โ—‹ Across physical Locations
โ— Latency
โ—‹ Real time / Batch
โ— Platform support
Traditional data flow model
Webclients Reporting
Apps
Public Rest API Billing System
Microservices
OLTP
DB
Report
DB
Metrics
DB
$ curl api.example.com | filter.py | psql
Analytics
Messy data flow model ( 6 / 12 months later)
web clients reporting
Apps
Public Rest API Billing System
Microservices
OLTP
DB
Report
DB
Metrics
DB
Analytics
External
cloud
Doc
Store
DWH
Apache Airflow Introduction
โ— Apache Airflow is a way to programatically author, schedule and monitor
workflows
โ— Developed in Python and is open source.
โ— Workflows are configured as Python code.
โ— It uses python as the programming language, where in we can enrich the quality
of data pipelines by using python inbuilt libraries.
โ— Has multiple hooks and operators for handling BigData ecosystem components, (
Hive, Sqoop etc.. ) and DB hooks for RDBMS and Other NOSQL databases.
Features
โ— Cron replacement
โ— Fault tolerant.
โ— Dependency rules.
โ— Beautiful UI.
โ— Handle task failures.
โ— Python Code.
โ— Report / Alert on failures.
โ— Monitor your pipelines from the WebUI.
โ— And etc..
Core Components
โ— Webserver - Apache Airflow WebUI.
โ— Scheduler - Responsible for scheduling your jobs.
โ— Executor - bound to the scheduler , determine the worker process that
executes the the schedule task. ( Sequential , LocalExecutor, CeleryExecutor)
โ— Worker - Process that execute the task , determined by the executor.
โ— Metadatabase - Database were all the metadata related to your jobs are stored
Key Concepts
โ— DAG - Directed Acyclic graph . the graphical representation of your data
pipeline
โ— Operator - describes a single task in your data pipeline
โ— Task - An instance of operator task.
โ— Workflow - DAG + Operator + Task
Overview
โ— What is a DAG?
โ— What is an Operator?
โ— Operator relationships and Bitshift composition
โ— How the scheduler works?
โ— What is a Workflow ?
DAG ( Directed Acyclic Graph)
Simple DAG where we could imagine that
Task 1 - downloading the data.
Task 2 - Sending the data for processing.
Task 3 - monitoring the data processing.
Task 4 - generating the report.
Task 5 - Sending the email to the DAG owner or intended recipients.
Task 1 Task 2 Task 3 Task 4 Task 5
Not a DAG
Task 1 Task 2 Task 3 Task 4 Task 5
Operators
While DAG describes how to run a workflow , Operator defines what actually gets
done.
โ— Operator describes a single task in a workflow.
โ— Operators should be idempotent. ( it should produce the same result
irrespective of how many times it is executed.
โ— Retry Automatically in case of Failure.
Different Operators
โ— Bash Operator
โ—‹ Executes a bash command
โ— Python Operator
โ—‹ Calls an Arbitrary python function
โ— Email Operator
โ—‹ Sends an Email
โ— Mysql Operator, SQLite Operator, Postgres Operator.
โ—‹ Executes the SQL commands
โ— <Custom Operators> Inheriting from the BaseOperator
Types of Operators
There are 3 types of operators
โ— Action Operators
โ—‹ Perform an action ( Bash operator, Python Operator , Email Operator)
โ— Transfer Operators
โ—‹ Moving data from one system to another ( PrestoToMySQL operator, SFTP operator
โ— Sensor Operators
โ—‹ Waiting for the data to arrive at the default location.
Important Properties
โ— DAGโ€™s are defined in Python files placed into Airflows DAG_FOLDER
โ— dag_id serves as a unique identifier for your DAG.
โ— description the description of your DAG.
โ— start_date - tell when your DAG should start.
โ— schedule_interval - define how often your DAG runs.
โ— depend_on_past - run the next DAGRun if the previous one is completed
successfully.
โ— default_args - a dictionary of variables to be used as constructor keyword
parameter when initializing operators
AirFlow WebUI
DAG Code
Python Operator tasks ( fetching_tweet.py)
Python Operator tasks ( cleansing_tweet.py)
Start the DAG ( Toggle the ON/ OFF ) button
Graph View of the Dag
Tree View of the Dag
Executing the DAG and Checking the hive tables
Check Hive table count after the DAG
Questions
THANK YOU

More Related Content

What's hot

What's hot (20)

Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
ย 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
ย 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
ย 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
ย 
Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
ย 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow managementIntro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
ย 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
ย 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
ย 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
ย 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
ย 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
ย 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
ย 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
ย 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
ย 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
ย 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
ย 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
ย 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
ย 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
ย 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
ย 

Similar to Apache airflow

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
ย 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
ย 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
ย 

Similar to Apache airflow (20)

Presto
PrestoPresto
Presto
ย 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
ย 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
ย 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flink
ย 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
ย 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
ย 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
ย 
Data Engineer's Lunch #44: Prefect
Data Engineer's Lunch #44: PrefectData Engineer's Lunch #44: Prefect
Data Engineer's Lunch #44: Prefect
ย 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
ย 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
ย 
Journey and evolution of Presto@Grab
Journey and evolution of Presto@GrabJourney and evolution of Presto@Grab
Journey and evolution of Presto@Grab
ย 
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
Function Mesh for Apache Pulsar, the Way for Simple Streaming SolutionsFunction Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
Function Mesh for Apache Pulsar, the Way for Simple Streaming Solutions
ย 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
ย 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
ย 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
ย 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
ย 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
ย 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
ย 
Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017Data Pipelines with Python - NWA TechFest 2017
Data Pipelines with Python - NWA TechFest 2017
ย 
Sap bodi bods online training course
Sap bodi bods online training courseSap bodi bods online training course
Sap bodi bods online training course
ย 

Recently uploaded

Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
ย 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
ย 
Sonagachi * best call girls in Kolkata | โ‚น,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | โ‚น,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | โ‚น,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | โ‚น,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
ย 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
ย 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
ย 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
ย 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
ย 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
ย 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
kumargunjan9515
ย 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
ย 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
ย 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
ย 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
ย 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
ย 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
ย 

Recently uploaded (20)

Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
ย 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
ย 
Sonagachi * best call girls in Kolkata | โ‚น,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | โ‚น,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | โ‚น,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | โ‚น,9500 Pay Cash 8005736733 Free Home...
ย 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ย 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
ย 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
ย 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
ย 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
ย 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
ย 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
ย 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
ย 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
ย 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
ย 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
ย 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
ย 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
ย 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
ย 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ย 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
ย 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
ย 

Apache airflow

  • 1. Building Data Pipelines Using Apache Airflow PURNA CHANDER RAO . KATHULA
  • 2. Agenda 1. What is a Data Pipeline ? 2. Components of a Data pipeline. 3. Traditional Data Flows and issues 4. Introduction to Apache Airflow 5. Features 6. Core Components 7. Key Components 8. Demo
  • 3. What is a Data Pipeline Data Pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of the pipeline are often executed in parallel or in time-sliced fashion. The name โ€˜pipelineโ€™ come from a rough analogy with physical plumbing. โ— Modern data pipelines are used to ingest & process vast volumes of data in real time. โ— Real time processing of data as opposed to traditional ETL / batch modes.
  • 4. Common Components of a data pipeline Typical parts of a data pipeline โ— Data Ingestion โ— Filtering โ— Processing โ— Querying of the data โ— Data warehousing โ— Reprocessing capabilities Typical Requirements โ— Scalability โ—‹ Billions of messages and terabytes of data 24 /7 โ— Availability and redundancy โ—‹ Across physical Locations โ— Latency โ—‹ Real time / Batch โ— Platform support
  • 5. Traditional data flow model Webclients Reporting Apps Public Rest API Billing System Microservices OLTP DB Report DB Metrics DB $ curl api.example.com | filter.py | psql Analytics
  • 6. Messy data flow model ( 6 / 12 months later) web clients reporting Apps Public Rest API Billing System Microservices OLTP DB Report DB Metrics DB Analytics External cloud Doc Store DWH
  • 7. Apache Airflow Introduction โ— Apache Airflow is a way to programatically author, schedule and monitor workflows โ— Developed in Python and is open source. โ— Workflows are configured as Python code. โ— It uses python as the programming language, where in we can enrich the quality of data pipelines by using python inbuilt libraries. โ— Has multiple hooks and operators for handling BigData ecosystem components, ( Hive, Sqoop etc.. ) and DB hooks for RDBMS and Other NOSQL databases.
  • 8. Features โ— Cron replacement โ— Fault tolerant. โ— Dependency rules. โ— Beautiful UI. โ— Handle task failures. โ— Python Code. โ— Report / Alert on failures. โ— Monitor your pipelines from the WebUI. โ— And etc..
  • 9. Core Components โ— Webserver - Apache Airflow WebUI. โ— Scheduler - Responsible for scheduling your jobs. โ— Executor - bound to the scheduler , determine the worker process that executes the the schedule task. ( Sequential , LocalExecutor, CeleryExecutor) โ— Worker - Process that execute the task , determined by the executor. โ— Metadatabase - Database were all the metadata related to your jobs are stored
  • 10. Key Concepts โ— DAG - Directed Acyclic graph . the graphical representation of your data pipeline โ— Operator - describes a single task in your data pipeline โ— Task - An instance of operator task. โ— Workflow - DAG + Operator + Task
  • 11. Overview โ— What is a DAG? โ— What is an Operator? โ— Operator relationships and Bitshift composition โ— How the scheduler works? โ— What is a Workflow ?
  • 12. DAG ( Directed Acyclic Graph) Simple DAG where we could imagine that Task 1 - downloading the data. Task 2 - Sending the data for processing. Task 3 - monitoring the data processing. Task 4 - generating the report. Task 5 - Sending the email to the DAG owner or intended recipients. Task 1 Task 2 Task 3 Task 4 Task 5
  • 13. Not a DAG Task 1 Task 2 Task 3 Task 4 Task 5
  • 14. Operators While DAG describes how to run a workflow , Operator defines what actually gets done. โ— Operator describes a single task in a workflow. โ— Operators should be idempotent. ( it should produce the same result irrespective of how many times it is executed. โ— Retry Automatically in case of Failure.
  • 15. Different Operators โ— Bash Operator โ—‹ Executes a bash command โ— Python Operator โ—‹ Calls an Arbitrary python function โ— Email Operator โ—‹ Sends an Email โ— Mysql Operator, SQLite Operator, Postgres Operator. โ—‹ Executes the SQL commands โ— <Custom Operators> Inheriting from the BaseOperator
  • 16. Types of Operators There are 3 types of operators โ— Action Operators โ—‹ Perform an action ( Bash operator, Python Operator , Email Operator) โ— Transfer Operators โ—‹ Moving data from one system to another ( PrestoToMySQL operator, SFTP operator โ— Sensor Operators โ—‹ Waiting for the data to arrive at the default location.
  • 17. Important Properties โ— DAGโ€™s are defined in Python files placed into Airflows DAG_FOLDER โ— dag_id serves as a unique identifier for your DAG. โ— description the description of your DAG. โ— start_date - tell when your DAG should start. โ— schedule_interval - define how often your DAG runs. โ— depend_on_past - run the next DAGRun if the previous one is completed successfully. โ— default_args - a dictionary of variables to be used as constructor keyword parameter when initializing operators
  • 20. Python Operator tasks ( fetching_tweet.py)
  • 21. Python Operator tasks ( cleansing_tweet.py)
  • 22. Start the DAG ( Toggle the ON/ OFF ) button
  • 23. Graph View of the Dag
  • 24. Tree View of the Dag
  • 25. Executing the DAG and Checking the hive tables
  • 26. Check Hive table count after the DAG