SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
Learning and Development
2019
Apache Airflow
For Data Engineering
- Sampath Kumar, Principle Engineer
Agenda
Goal of this session is to provide an overview of Airflow capabilities & how to use it
for Data Engineering
● Introduction (DE, Airflow)
● Key Concepts (Key Words, Demo)
● Architecture (System Design)
● Features (Challenges, Loggins, Analytics)
● Challenges & Recommendations
● Q & A
DATA ENGINEERING
Data Engineering is the aspect of data science that focuses on practical applications of data
collection and analysis. Data Engineers are tasked with
● Designing, building, testing, integrating, managing, and optimizing data from a variety of
sources
● Build the infrastructure and architecture that enable data generation
● Primary focus is to build free-flowing data pipelines
by combining a variety of big data technologies
that enable real-time analytics
● Data engineers also write complex queries
to ensure that data is easily accessible
AIRFLOW
● Open-source workflow automation and scheduling system that can be used to author and manage
your data pipelines.
● It started at Airbnb in October 2014 as a solution to manage the company's increasing complex
workflows.
● License: Apache License 2.0
● Written in: Python
● Operating system: Microsoft Windows, macOS, Linux
● Stable release: 1.10.5 / August 30, 2019; 3 months ago
CRON
CRON: Derived from work CRONOS(means time), is a software for unix-like
systems to schedule jobs based on time.
Airflow is cron on steroids: it allows you to schedule tasks to run, run them in a
particular order, and monitor / manage all of your tasks.
KEY CONCEPTS
● DAGS
○ Directed acyclic graphs that represent tasks workflow
● Task
○ Operators - Bash, Python, SSH, Http, MySql, SparkSubmit, Sensors(s3), Docker, Hive, Slack,..
● Hooks
○ To store credentials for services - AWS, GCP, DataBase, Email,..
● Vars & XCom
○ For sharing any global values or inter-task communication
Sample - DAG
Start End
S3Sensor
Build VM
Goal: As and when user files come into AWS S3 bucket, start a high-spec docker
VM and process the job. Prior to start and post process, job status to be notified.
Solutions required to be cost efficient and need to auto-scale if required.
DEMO - UI
Sample - DAG
A data engineering pipeline, where an S3 sensor is used to identify the arrival of
input file, following by several validation checks and then load into ElasticSearch
which will be used for serving clients.
Start
(S3Sensor)
Schema
Validation
Inputs
Validation
Data
Validation
Load to
ElasticSearch
End
Failures
ARCHITECTURE
● MetaDB
● Message Broker
● Airflow Webserver
● Airflow Scheduler
● Airflow Workers
FEATURES
● Airflow Workers are Horizontally scalable
● Airflow Messaging Broker - Celery
● Airflow Integrations - GCP, Azure, AWS, Qubole & Databricks
● Hooks, Connections & Pools - Environment(dev/test/prod) friendly
● DAG - Dynamic sub dags & Branching
ANALYTICS
● Part of being productive with data
is having the right weapons to
profile the data you are working
with.
● Airflow provides a simple query
interface to write SQL and get
results quickly, and a charting
application letting you visualize
data.
LOGGING
● Logging is visible from UI
CHALLENGES
● Airflow is not Apache NIFI or Apache Spark
○ is not - data routing or data transformation system.
● Airflow Workers
○ requires identical access(network, authentication & authorisation)
○ requires similar hardware capabilities
● Airflow Webserver & Scheduler
○ not scalable (Work-around - supervisor, docker health checks)
RECOMMENDATIONS
Airflow is a best suited where
● Agility is important
● Portability
● Segregation b/w compute and workflow mgmt
● Scalability on demand
● Pool/Connection management
● Job Analytics
● Logging visibility
Q&A
Open for discussions,..
REFERENCES
● Roles of Data Engineer - https://inlovewithcode.wordpress.com/2019/05/15/roles-of-data-engineer-required-skill/
● System Design - https://inlovewithcode.wordpress.com/system-design/
● Airflow Tutorials - https://airflow-tutorial.readthedocs.io/en/latest/airflow-intro.html
● Airflow Documentation - https://airflow.apache.org/docs/stable/
● Airflow Dockers - https://towardsdatascience.com/getting-started-with-airflow-using-docker-cd8b44dbff98
Thank you

Contenu connexe

Tendances

Designing and Using Cached Map
Designing and Using Cached Map Designing and Using Cached Map
Designing and Using Cached Map
M.Muneeb Ashraf
 

Tendances (20)

An Introduction to the Heatmap / Histogram Plugin
An Introduction to the Heatmap / Histogram PluginAn Introduction to the Heatmap / Histogram Plugin
An Introduction to the Heatmap / Histogram Plugin
 
Cacique presentation (english)
Cacique presentation (english)Cacique presentation (english)
Cacique presentation (english)
 
Utilizing Esri Out of the Box Tools for Field Data Verification
Utilizing Esri Out of the Box Tools for Field Data VerificationUtilizing Esri Out of the Box Tools for Field Data Verification
Utilizing Esri Out of the Box Tools for Field Data Verification
 
Our works
Our worksOur works
Our works
 
ADSL ppt
ADSL pptADSL ppt
ADSL ppt
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Tour of Dapr
Tour of DaprTour of Dapr
Tour of Dapr
 
Deploy your machine learning models to production with Kubernetes
Deploy your machine learning models to production with KubernetesDeploy your machine learning models to production with Kubernetes
Deploy your machine learning models to production with Kubernetes
 
INTERFACE, by apidays - Apache Cassandra now speaks developer with Stargate ...
INTERFACE, by apidays  - Apache Cassandra now speaks developer with Stargate ...INTERFACE, by apidays  - Apache Cassandra now speaks developer with Stargate ...
INTERFACE, by apidays - Apache Cassandra now speaks developer with Stargate ...
 
Azure Functions VS AWS Lambda: overview and comparison
Azure Functions VS AWS Lambda: overview and comparisonAzure Functions VS AWS Lambda: overview and comparison
Azure Functions VS AWS Lambda: overview and comparison
 
Deploying GraphQL Services as Managed APIs
Deploying GraphQL Services as Managed APIsDeploying GraphQL Services as Managed APIs
Deploying GraphQL Services as Managed APIs
 
Webinar kubernetes and-spark
Webinar  kubernetes and-sparkWebinar  kubernetes and-spark
Webinar kubernetes and-spark
 
Proposed bench test for gis servers
Proposed bench test for gis serversProposed bench test for gis servers
Proposed bench test for gis servers
 
Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901Tordatasci meetup-precima-retail-analytics-201901
Tordatasci meetup-precima-retail-analytics-201901
 
Building Mobile Dashboards With D3 and Google Charts
Building Mobile Dashboards With D3 and Google ChartsBuilding Mobile Dashboards With D3 and Google Charts
Building Mobile Dashboards With D3 and Google Charts
 
NDC London 2021 - Application Autoscaling Made Easy With Kubernetes Event-Dri...
NDC London 2021 - Application Autoscaling Made Easy With Kubernetes Event-Dri...NDC London 2021 - Application Autoscaling Made Easy With Kubernetes Event-Dri...
NDC London 2021 - Application Autoscaling Made Easy With Kubernetes Event-Dri...
 
Facilitez votre transition DevOps grâce à l'automatisation de votre infras...
 Facilitez votre transition DevOps grâce à l'automatisation de votre infras... Facilitez votre transition DevOps grâce à l'automatisation de votre infras...
Facilitez votre transition DevOps grâce à l'automatisation de votre infras...
 
Shubhangi Prasad
Shubhangi PrasadShubhangi Prasad
Shubhangi Prasad
 
Designing and Using Cached Map
Designing and Using Cached Map Designing and Using Cached Map
Designing and Using Cached Map
 
Microsoft Partners - Application Autoscaling Made Easy With Kubernetes Event-...
Microsoft Partners - Application Autoscaling Made Easy With Kubernetes Event-...Microsoft Partners - Application Autoscaling Made Easy With Kubernetes Event-...
Microsoft Partners - Application Autoscaling Made Easy With Kubernetes Event-...
 

Similaire à Airflow techtonic template

Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 

Similaire à Airflow techtonic template (20)

Cloud APIs Overview Tucker
Cloud APIs Overview   TuckerCloud APIs Overview   Tucker
Cloud APIs Overview Tucker
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
 
Microsoft Azure For Solutions Architects
Microsoft Azure For Solutions ArchitectsMicrosoft Azure For Solutions Architects
Microsoft Azure For Solutions Architects
 
AWS Data Pipeline Tutorial | AWS Tutorial For Beginners | AWS Certification T...
AWS Data Pipeline Tutorial | AWS Tutorial For Beginners | AWS Certification T...AWS Data Pipeline Tutorial | AWS Tutorial For Beginners | AWS Certification T...
AWS Data Pipeline Tutorial | AWS Tutorial For Beginners | AWS Certification T...
 
Accenture 2014 AWS re:Invent Enterprise Migration Breakout Session
Accenture 2014 AWS re:Invent Enterprise Migration Breakout SessionAccenture 2014 AWS re:Invent Enterprise Migration Breakout Session
Accenture 2014 AWS re:Invent Enterprise Migration Breakout Session
 
How Hudl and Cloud Cruiser Leverage Sumo Logic's Unified Logs and Metrics
How Hudl and Cloud Cruiser Leverage Sumo Logic's Unified Logs and MetricsHow Hudl and Cloud Cruiser Leverage Sumo Logic's Unified Logs and Metrics
How Hudl and Cloud Cruiser Leverage Sumo Logic's Unified Logs and Metrics
 
Name_Surname_Your primary skill resume.pptx
Name_Surname_Your primary skill resume.pptxName_Surname_Your primary skill resume.pptx
Name_Surname_Your primary skill resume.pptx
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
Migrating Thousands of Workloads to AWS at Enterprise Scale – Chris Wegmann, ...
Migrating Thousands of Workloads to AWS at Enterprise Scale – Chris Wegmann, ...Migrating Thousands of Workloads to AWS at Enterprise Scale – Chris Wegmann, ...
Migrating Thousands of Workloads to AWS at Enterprise Scale – Chris Wegmann, ...
 
Vimala_Gadegi
Vimala_GadegiVimala_Gadegi
Vimala_Gadegi
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Protecting from transient failures in cloud microsoft azure deployments
Protecting from transient failures in cloud microsoft azure deploymentsProtecting from transient failures in cloud microsoft azure deployments
Protecting from transient failures in cloud microsoft azure deployments
 
Introduction to Google Cloud & GCCP Campaign
Introduction to Google Cloud & GCCP CampaignIntroduction to Google Cloud & GCCP Campaign
Introduction to Google Cloud & GCCP Campaign
 
Copy of Hari Intern Presentation.pdf
Copy of Hari Intern Presentation.pdfCopy of Hari Intern Presentation.pdf
Copy of Hari Intern Presentation.pdf
 
(ENT206) Migrating Thousands of Workloads to AWS at Enterprise Scale | AWS re...
(ENT206) Migrating Thousands of Workloads to AWS at Enterprise Scale | AWS re...(ENT206) Migrating Thousands of Workloads to AWS at Enterprise Scale | AWS re...
(ENT206) Migrating Thousands of Workloads to AWS at Enterprise Scale | AWS re...
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
ArchitectNow - Migrating Legacy .NET Apps to Azure
ArchitectNow - Migrating Legacy .NET Apps to AzureArchitectNow - Migrating Legacy .NET Apps to Azure
ArchitectNow - Migrating Legacy .NET Apps to Azure
 
Presto: Query Anything - Data Engineer’s perspective
Presto: Query Anything - Data Engineer’s perspectivePresto: Query Anything - Data Engineer’s perspective
Presto: Query Anything - Data Engineer’s perspective
 

Dernier

Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
David Celestin
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Hung Le
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
ZurliaSoop
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 

Dernier (17)

ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
 
Zone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxZone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptx
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait Cityin kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
 

Airflow techtonic template

  • 1. Learning and Development 2019 Apache Airflow For Data Engineering - Sampath Kumar, Principle Engineer
  • 2. Agenda Goal of this session is to provide an overview of Airflow capabilities & how to use it for Data Engineering ● Introduction (DE, Airflow) ● Key Concepts (Key Words, Demo) ● Architecture (System Design) ● Features (Challenges, Loggins, Analytics) ● Challenges & Recommendations ● Q & A
  • 3. DATA ENGINEERING Data Engineering is the aspect of data science that focuses on practical applications of data collection and analysis. Data Engineers are tasked with ● Designing, building, testing, integrating, managing, and optimizing data from a variety of sources ● Build the infrastructure and architecture that enable data generation ● Primary focus is to build free-flowing data pipelines by combining a variety of big data technologies that enable real-time analytics ● Data engineers also write complex queries to ensure that data is easily accessible
  • 4. AIRFLOW ● Open-source workflow automation and scheduling system that can be used to author and manage your data pipelines. ● It started at Airbnb in October 2014 as a solution to manage the company's increasing complex workflows. ● License: Apache License 2.0 ● Written in: Python ● Operating system: Microsoft Windows, macOS, Linux ● Stable release: 1.10.5 / August 30, 2019; 3 months ago
  • 5. CRON CRON: Derived from work CRONOS(means time), is a software for unix-like systems to schedule jobs based on time. Airflow is cron on steroids: it allows you to schedule tasks to run, run them in a particular order, and monitor / manage all of your tasks.
  • 6. KEY CONCEPTS ● DAGS ○ Directed acyclic graphs that represent tasks workflow ● Task ○ Operators - Bash, Python, SSH, Http, MySql, SparkSubmit, Sensors(s3), Docker, Hive, Slack,.. ● Hooks ○ To store credentials for services - AWS, GCP, DataBase, Email,.. ● Vars & XCom ○ For sharing any global values or inter-task communication
  • 7. Sample - DAG Start End S3Sensor Build VM Goal: As and when user files come into AWS S3 bucket, start a high-spec docker VM and process the job. Prior to start and post process, job status to be notified. Solutions required to be cost efficient and need to auto-scale if required.
  • 9. Sample - DAG A data engineering pipeline, where an S3 sensor is used to identify the arrival of input file, following by several validation checks and then load into ElasticSearch which will be used for serving clients. Start (S3Sensor) Schema Validation Inputs Validation Data Validation Load to ElasticSearch End Failures
  • 10. ARCHITECTURE ● MetaDB ● Message Broker ● Airflow Webserver ● Airflow Scheduler ● Airflow Workers
  • 11. FEATURES ● Airflow Workers are Horizontally scalable ● Airflow Messaging Broker - Celery ● Airflow Integrations - GCP, Azure, AWS, Qubole & Databricks ● Hooks, Connections & Pools - Environment(dev/test/prod) friendly ● DAG - Dynamic sub dags & Branching
  • 12. ANALYTICS ● Part of being productive with data is having the right weapons to profile the data you are working with. ● Airflow provides a simple query interface to write SQL and get results quickly, and a charting application letting you visualize data.
  • 13. LOGGING ● Logging is visible from UI
  • 14. CHALLENGES ● Airflow is not Apache NIFI or Apache Spark ○ is not - data routing or data transformation system. ● Airflow Workers ○ requires identical access(network, authentication & authorisation) ○ requires similar hardware capabilities ● Airflow Webserver & Scheduler ○ not scalable (Work-around - supervisor, docker health checks)
  • 15. RECOMMENDATIONS Airflow is a best suited where ● Agility is important ● Portability ● Segregation b/w compute and workflow mgmt ● Scalability on demand ● Pool/Connection management ● Job Analytics ● Logging visibility
  • 17. REFERENCES ● Roles of Data Engineer - https://inlovewithcode.wordpress.com/2019/05/15/roles-of-data-engineer-required-skill/ ● System Design - https://inlovewithcode.wordpress.com/system-design/ ● Airflow Tutorials - https://airflow-tutorial.readthedocs.io/en/latest/airflow-intro.html ● Airflow Documentation - https://airflow.apache.org/docs/stable/ ● Airflow Dockers - https://towardsdatascience.com/getting-started-with-airflow-using-docker-cd8b44dbff98