SlideShare une entreprise Scribd logo
1  sur  57
Télécharger pour lire hors ligne
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Manageable data pipelines with
Airflow
(and Kubernetes)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Airflow
Airflow is a platform to programmatically author,
schedule and monitor workflows.
Dynamic/Elegant
Extensible
Scalable
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Workflows
Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Companies using Airflow
(>200 officially)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Data Pipeline
https://xkcd.com/2054/
GDG DevFest Warsaw 2018 @higrys, @sprzedwojski
Airflow vs. other workflow platforms
● Programming workflows
○ writing code not XML
○ versioning as usual
○ automated testing as usual
○ complex dependencies between tasks
● Managing workflows
○ aggregate logs in one UI
○ tracking execution
○ re-running, backfilling (run all missed runs)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Airflow use cases
● ETL jobs
● ML pipelines
● Regular operations:
○ Delivering data
○ Performing backups
● ...
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Core concepts - Directed Acyclic Graph (DAG)
Source: https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/example_dags/example_twitter_README.md
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Core concepts - Operators
Source: https://blog.usejournal.com/testing-in-airflow-part-1-dag-validation-tests-dag-definition-tests-and-unit-tests-2aa94970570c
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Operator types
● Action Operators
○ Python, Bash, Docker, GCEInstanceStart, ...
● Sensor Operators
○ S3KeySensor, HivePartitionSensor,
BigtableTableWaitForReplicationOperator , ...
● Transfer Operators
○ MsSqlToHiveTransfer, RedshiftToS3Transfer, …
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
class ExampleOperator(BaseOperator):
def execute(self, context):
# Do something
pass
Operator and Sensor
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
class ExampleOperator(BaseOperator):
def execute(self, context):
# Do something
pass
class ExampleSensorOperator(BaseSensorOperator):
def poke(self, context):
# Check if the condition occurred
return True
Operator and Sensor
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Operator good practices
● Idempotent
● Atomic
● No direct data sharing
○ Small portions of data between tasks: XCOMs
○ Large amounts of data: S3, GCS, etc.
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Core concepts - Tasks, TaskInstances, DagRuns
Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Show me the code!
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Source: https://www.logolynx.com/images/logolynx/0b/0b42e766caee6dcd7355c1c95ddaaa1c.png
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Source: http://www.faicoach.com/wp-content/uploads/2017/10/cash-burn.jpg
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
The solution
Sources:
https://services.garmin.cn/appsLibraryBusinessServices_v0/rest/apps/9b5dabf3-925b
https://malloc.fi/static/images/slack-memory-management.png
https://i.gifer.com/9GXs.gif
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Solution components
● Generic
○ BashOperator
○ PythonOperator
● Specific
○ EmailOperator
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
The DAG
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Initialize DAG
dag = DAG(dag_id='gcp_spy',
...
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Initialize DAG
dag = DAG(dag_id='gcp_spy',
default_args={
'start_date': utils.dates.days_ago(1),
'retries': 1
},
...
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Initialize DAG
dag = DAG(dag_id='gcp_spy',
default_args={
'start_date': utils.dates.days_ago(1),
'retries': 1
},
schedule_interval='0 16 * * *'
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
List of instances
bash_task = BashOperator(
task_id="gcp_service_list_instances_sql",
...
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
List of instances
bash_task = BashOperator(
task_id="gcp_service_list_instances_sql",
bash_command=
"gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' "
"| tr 'n' ' '",
...
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
List of instances
bash_task = BashOperator(
task_id="gcp_service_list_instances_sql",
bash_command=
"gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' "
"| tr 'n' ' '",
xcom_push=True,
...
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
List of instances
bash_task = BashOperator(
task_id="gcp_service_list_instances_sql",
bash_command=
"gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' "
"| tr 'n' ' '",
xcom_push=True,
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
All services
GCP_SERVICES = [
('sql', 'Cloud SQL'),
('spanner', 'Spanner'),
('bigtable', 'BigTable'),
('compute', 'Compute Engine'),
]
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
List of instances - all services
????
bash_task = BashOperator(
task_id="gcp_service_list_instances_sql",
bash_command=
"gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' "
"| tr 'n' ' '",
xcom_push=True,
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
List of instances - all services
for gcp_service in GCP_SERVICES:
bash_task = BashOperator(
task_id="gcp_service_list_instances_{}".format(gcp_service[0]),
bash_command=
"gcloud {} instances list | tail -n +2 | grep -oE '^[^ ]+' "
"| tr 'n' ' '".format(gcp_service[0]),
xcom_push=True,
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Send Slack message
send_slack_msg_task = PythonOperator(
python_callable=send_slack_msg,
provide_context=True,
task_id='send_slack_msg_task',
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Send Slack message
send_slack_msg_task = PythonOperator(
python_callable=send_slack_msg,
provide_context=True,
task_id='send_slack_msg_task',
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
def send_slack_msg(**context):
for gcp_service in GCP_SERVICES:
result = context['task_instance'].
xcom_pull(task_ids='gcp_service_list_instances_{}'
.format(gcp_service[0]))
...
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
def send_slack_msg(**context):
for gcp_service in GCP_SERVICES:
result = context['task_instance'].
xcom_pull(task_ids='gcp_service_list_instances_{}'
.format(gcp_service[0]))
...
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
def send_slack_msg(**context):
for gcp_service in GCP_SERVICES:
result = context['task_instance'].
xcom_pull(task_ids='gcp_service_list_instances_{}'
.format(gcp_service[0]))
data = ...
...
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
def send_slack_msg(**context):
for gcp_service in GCP_SERVICES:
result = context['task_instance'].
xcom_pull(task_ids='gcp_service_list_instances_{}'
.format(gcp_service[0]))
data = ...
requests.post(
url=SLACK_WEBHOOK,
data=json.dumps(data),
headers={'Content-type': 'application/json'}
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Prepare email
prepare_email_task = PythonOperator(
python_callable=prepare_email,
provide_context=True,
task_id='prepare_email_task',
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Prepare email
prepare_email_task = PythonOperator(
python_callable=prepare_email,
provide_context=True,
task_id='prepare_email_task',
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
def prepare_email(**context):
for gcp_service in GCP_SERVICES:
result = context['task_instance'].
xcom_pull(task_ids='gcp_service_list_instances_{}'
.format(gcp_service[0]))
...
html_content = ...
context['task_instance'].xcom_push(key='email', value=html_content)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
def prepare_email(**context):
for gcp_service in GCP_SERVICES:
result = context['task_instance'].
xcom_pull(task_ids='gcp_service_list_instances_{}'
.format(gcp_service[0]))
...
html_content = ...
context['task_instance'].xcom_push(key='email', value=html_content)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Send email
send_email_task = EmailOperator(
task_id='send_email',
to='szymon.przedwojski@polidea.com',
subject=INSTANCES_IN_PROJECT_TITLE,
html_content=...,
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Send email
send_email_task = EmailOperator(
task_id='send_email',
to='szymon.przedwojski@polidea.com',
subject=INSTANCES_IN_PROJECT_TITLE,
html_content=
"{{ task_instance.xcom_pull(task_ids='prepare_email_task', key='email') }}",
dag=dag
)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Dependencies
for gcp_service in GCP_SERVICES:
bash_task = BashOperator(
...
)
bash_task >> send_slack_msg_task
bash_task >> prepare_email_task
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Dependencies
for gcp_service in GCP_SERVICES:
bash_task = BashOperator(
...
)
bash_task >> send_slack_msg_task
bash_task >> prepare_email_task
prepare_email_task >> send_email_task
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Dependencies
for gcp_service in GCP_SERVICES:
bash_task = BashOperator(
...
)
bash_task >> send_slack_msg_task
bash_task >> prepare_email_task
prepare_email_task >> send_email_task
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Demo
https://github.com/PolideaInternal/airflow-gcp-spy
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Complex DAGs
Source: https://speakerdeck.com/pybay/2016-matt-davis-a-practical-introduction-to-airflow?slide=13
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Complex, Manageable, DAGs
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Single node
Local Executor
Web
server
RDBMS DAGs
Scheduler
Local executors`
Local executors
Local executors
Local executors
multiprocessing
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Celery Executor
Controller
Web
server
RDBMS DAGs
Scheduler
Celery Broker
RabbitMQ/Redis/AmazonSQS
Node 1 Node 2
DAGs DAGs
Worker Worker
Sync files
(Chef/Puppet/Ansible/NFS)
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
(Beta): Kubernetes Executor
Controller
Web
server
RDBMS
DAGs
Scheduler
Kubernetes Cluster
Node 1 Node 2
Pod
Sync files
● Git Init
● Persistent Volume
● Baked-in (future)
Package
as pods
Kubernetes Master
DAGs DAGs
Pod
Pod
Pod
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
GCP - Composer
https://github.com/GoogleCloudPlatform/airflow-operator
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Thank You!
Follow us @
polidea.com/blog
Source: https://techflourish.com/images/superman-animated-clipart.gif
https://airflow.apache.org/_images/pin_large.png
GDG DevFest Warsaw 2018 @higrys, @sprzedwojski
@higrys, @sprzedwojskiGDG DevFest Warsaw 2018
Questions? :)
Follow us @
polidea.com/blog
Source: https://techflourish.com/images/superman-animated-clipart.gif
https://airflow.apache.org/_images/pin_large.png

Contenu connexe

Tendances

Leveraging Gradle @ Netflix (Madrid GUG Feb 2, 2021)
Leveraging Gradle @ Netflix (Madrid GUG Feb 2, 2021)Leveraging Gradle @ Netflix (Madrid GUG Feb 2, 2021)
Leveraging Gradle @ Netflix (Madrid GUG Feb 2, 2021)
Roberto Pérez Alcolea
 

Tendances (20)

Importance of GCP: 30 Days of GCP
Importance of GCP: 30 Days of GCPImportance of GCP: 30 Days of GCP
Importance of GCP: 30 Days of GCP
 
GitLab - Java User Group
GitLab - Java User GroupGitLab - Java User Group
GitLab - Java User Group
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CDA GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
 
Introducing GitLab (September 2018)
Introducing GitLab (September 2018)Introducing GitLab (September 2018)
Introducing GitLab (September 2018)
 
Devops Porto - CI/CD at Gitlab
Devops Porto - CI/CD at GitlabDevops Porto - CI/CD at Gitlab
Devops Porto - CI/CD at Gitlab
 
Case Study: Migration to GitLab (from Bitbucket) at AppsFlyer
Case Study: Migration to GitLab (from Bitbucket) at AppsFlyerCase Study: Migration to GitLab (from Bitbucket) at AppsFlyer
Case Study: Migration to GitLab (from Bitbucket) at AppsFlyer
 
Continuous Integration/Deployment with Gitlab CI
Continuous Integration/Deployment with Gitlab CIContinuous Integration/Deployment with Gitlab CI
Continuous Integration/Deployment with Gitlab CI
 
GitLab Frontend and VueJS at GitLab
GitLab Frontend and VueJS at GitLabGitLab Frontend and VueJS at GitLab
GitLab Frontend and VueJS at GitLab
 
Introducing GitLab (September 2018)
Introducing GitLab (September 2018)Introducing GitLab (September 2018)
Introducing GitLab (September 2018)
 
Jenkins vs GitLab CI
Jenkins vs GitLab CIJenkins vs GitLab CI
Jenkins vs GitLab CI
 
Argocd up and running
Argocd up and runningArgocd up and running
Argocd up and running
 
Building Translate on Glass
Building Translate on GlassBuilding Translate on Glass
Building Translate on Glass
 
Quick workflow of a nodejs api
Quick workflow of a nodejs apiQuick workflow of a nodejs api
Quick workflow of a nodejs api
 
Introducing GitLab
Introducing GitLabIntroducing GitLab
Introducing GitLab
 
Leveraging Gradle @ Netflix (Madrid GUG Feb 2, 2021)
Leveraging Gradle @ Netflix (Madrid GUG Feb 2, 2021)Leveraging Gradle @ Netflix (Madrid GUG Feb 2, 2021)
Leveraging Gradle @ Netflix (Madrid GUG Feb 2, 2021)
 
Lesson Learned: Transforming from ClearCase to Git
Lesson Learned: Transforming from ClearCase to GitLesson Learned: Transforming from ClearCase to Git
Lesson Learned: Transforming from ClearCase to Git
 
Introduction of cloud native CI/CD on kubernetes
Introduction of cloud native CI/CD on kubernetesIntroduction of cloud native CI/CD on kubernetes
Introduction of cloud native CI/CD on kubernetes
 
GitOps Toolkit (Cloud Native Nordics Tech Talk)
GitOps Toolkit (Cloud Native Nordics Tech Talk)GitOps Toolkit (Cloud Native Nordics Tech Talk)
GitOps Toolkit (Cloud Native Nordics Tech Talk)
 
Gitops Hands On
Gitops Hands OnGitops Hands On
Gitops Hands On
 

Similaire à Manageable Data Pipelines With Airflow (and kubernetes) - GDG DevFest

Manageable data pipelines with airflow (and kubernetes) november 27, 11 45 ...
Manageable data pipelines with airflow (and kubernetes)   november 27, 11 45 ...Manageable data pipelines with airflow (and kubernetes)   november 27, 11 45 ...
Manageable data pipelines with airflow (and kubernetes) november 27, 11 45 ...
Jarek Potiuk
 

Similaire à Manageable Data Pipelines With Airflow (and kubernetes) - GDG DevFest (20)

"Enabling Googley microservices with gRPC" Riga DevDays 2018 edition
"Enabling Googley microservices with gRPC" Riga DevDays 2018 edition"Enabling Googley microservices with gRPC" Riga DevDays 2018 edition
"Enabling Googley microservices with gRPC" Riga DevDays 2018 edition
 
Usable APIs at Scale
Usable APIs at ScaleUsable APIs at Scale
Usable APIs at Scale
 
Manageable data pipelines with airflow (and kubernetes) november 27, 11 45 ...
Manageable data pipelines with airflow (and kubernetes)   november 27, 11 45 ...Manageable data pipelines with airflow (and kubernetes)   november 27, 11 45 ...
Manageable data pipelines with airflow (and kubernetes) november 27, 11 45 ...
 
API Technical Writing
API Technical WritingAPI Technical Writing
API Technical Writing
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
 
Connecting the Dots: Kong for GraphQL Endpoints
Connecting the Dots: Kong for GraphQL EndpointsConnecting the Dots: Kong for GraphQL Endpoints
Connecting the Dots: Kong for GraphQL Endpoints
 
SETCON'18 - Ilya labacheuski - GraphQL adventures
SETCON'18 - Ilya labacheuski - GraphQL adventuresSETCON'18 - Ilya labacheuski - GraphQL adventures
SETCON'18 - Ilya labacheuski - GraphQL adventures
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needs
 
Microsoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needsMicrosoft Graph: Connect to essential data every app needs
Microsoft Graph: Connect to essential data every app needs
 
apidays LIVE Paris - Automation API Testing by Guillaume Jeannic
apidays LIVE Paris - Automation API Testing by Guillaume Jeannicapidays LIVE Paris - Automation API Testing by Guillaume Jeannic
apidays LIVE Paris - Automation API Testing by Guillaume Jeannic
 
Expanding APIs beyond the Web
Expanding APIs beyond the WebExpanding APIs beyond the Web
Expanding APIs beyond the Web
 
Criteo Infrastructure (Platform) Meetup
Criteo Infrastructure (Platform) MeetupCriteo Infrastructure (Platform) Meetup
Criteo Infrastructure (Platform) Meetup
 
GraphTech Ecosystem - part 3: Graph Visualization
GraphTech Ecosystem - part 3: Graph VisualizationGraphTech Ecosystem - part 3: Graph Visualization
GraphTech Ecosystem - part 3: Graph Visualization
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
 
Tutorial: Building a GraphQL API in PHP
Tutorial: Building a GraphQL API in PHPTutorial: Building a GraphQL API in PHP
Tutorial: Building a GraphQL API in PHP
 
Introduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOSIntroduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOS
 
Serverless survival kit
Serverless survival kitServerless survival kit
Serverless survival kit
 
Scripting Oracle Develop 2007
Scripting Oracle Develop 2007Scripting Oracle Develop 2007
Scripting Oracle Develop 2007
 
A tech writer, a map, and an app
A tech writer, a map, and an appA tech writer, a map, and an app
A tech writer, a map, and an app
 
Building applications with Serverless Framework and AWS Lambda - JavaZone 2019
Building applications with Serverless Framework and AWS Lambda - JavaZone 2019Building applications with Serverless Framework and AWS Lambda - JavaZone 2019
Building applications with Serverless Framework and AWS Lambda - JavaZone 2019
 

Plus de Jarek Potiuk

Plus de Jarek Potiuk (7)

Subtle Differences between Python versions
Subtle Differences between Python versionsSubtle Differences between Python versions
Subtle Differences between Python versions
 
Caching in Docker - the hardest thing in computer science
Caching in Docker - the hardest thing in computer scienceCaching in Docker - the hardest thing in computer science
Caching in Docker - the hardest thing in computer science
 
Off time - how to use social media to be more out of social media
Off time - how to use social media to be more out of social mediaOff time - how to use social media to be more out of social media
Off time - how to use social media to be more out of social media
 
Berlin Apache Con EU Airflow Workshops
Berlin Apache Con EU Airflow WorkshopsBerlin Apache Con EU Airflow Workshops
Berlin Apache Con EU Airflow Workshops
 
Ci for android OS
Ci for android OSCi for android OS
Ci for android OS
 
It's a Breeze to develop Apache Airflow (Apache Con Berlin)
It's a Breeze to develop Apache Airflow (Apache Con Berlin)It's a Breeze to develop Apache Airflow (Apache Con Berlin)
It's a Breeze to develop Apache Airflow (Apache Con Berlin)
 
React native introduction (Mobile Warsaw)
React native introduction (Mobile Warsaw)React native introduction (Mobile Warsaw)
React native introduction (Mobile Warsaw)
 

Dernier

Dernier (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Manageable Data Pipelines With Airflow (and kubernetes) - GDG DevFest

  • 1. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Manageable data pipelines with Airflow (and Kubernetes)
  • 3. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Airflow Airflow is a platform to programmatically author, schedule and monitor workflows. Dynamic/Elegant Extensible Scalable
  • 4. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Workflows Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
  • 5. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Companies using Airflow (>200 officially)
  • 6. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Data Pipeline https://xkcd.com/2054/
  • 7. GDG DevFest Warsaw 2018 @higrys, @sprzedwojski Airflow vs. other workflow platforms ● Programming workflows ○ writing code not XML ○ versioning as usual ○ automated testing as usual ○ complex dependencies between tasks ● Managing workflows ○ aggregate logs in one UI ○ tracking execution ○ re-running, backfilling (run all missed runs)
  • 8. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Airflow use cases ● ETL jobs ● ML pipelines ● Regular operations: ○ Delivering data ○ Performing backups ● ...
  • 9. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Core concepts - Directed Acyclic Graph (DAG) Source: https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/example_dags/example_twitter_README.md
  • 10. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Core concepts - Operators Source: https://blog.usejournal.com/testing-in-airflow-part-1-dag-validation-tests-dag-definition-tests-and-unit-tests-2aa94970570c
  • 11. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Operator types ● Action Operators ○ Python, Bash, Docker, GCEInstanceStart, ... ● Sensor Operators ○ S3KeySensor, HivePartitionSensor, BigtableTableWaitForReplicationOperator , ... ● Transfer Operators ○ MsSqlToHiveTransfer, RedshiftToS3Transfer, …
  • 12. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 class ExampleOperator(BaseOperator): def execute(self, context): # Do something pass Operator and Sensor
  • 13. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 class ExampleOperator(BaseOperator): def execute(self, context): # Do something pass class ExampleSensorOperator(BaseSensorOperator): def poke(self, context): # Check if the condition occurred return True Operator and Sensor
  • 14. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Operator good practices ● Idempotent ● Atomic ● No direct data sharing ○ Small portions of data between tasks: XCOMs ○ Large amounts of data: S3, GCS, etc.
  • 15. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Core concepts - Tasks, TaskInstances, DagRuns Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
  • 16. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Show me the code!
  • 17. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Source: https://www.logolynx.com/images/logolynx/0b/0b42e766caee6dcd7355c1c95ddaaa1c.png
  • 18. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Source: http://www.faicoach.com/wp-content/uploads/2017/10/cash-burn.jpg
  • 19. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 The solution Sources: https://services.garmin.cn/appsLibraryBusinessServices_v0/rest/apps/9b5dabf3-925b https://malloc.fi/static/images/slack-memory-management.png https://i.gifer.com/9GXs.gif
  • 20. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Solution components ● Generic ○ BashOperator ○ PythonOperator ● Specific ○ EmailOperator
  • 21. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 The DAG
  • 22. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Initialize DAG dag = DAG(dag_id='gcp_spy', ... )
  • 23. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Initialize DAG dag = DAG(dag_id='gcp_spy', default_args={ 'start_date': utils.dates.days_ago(1), 'retries': 1 }, ... )
  • 24. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Initialize DAG dag = DAG(dag_id='gcp_spy', default_args={ 'start_date': utils.dates.days_ago(1), 'retries': 1 }, schedule_interval='0 16 * * *' )
  • 25. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 List of instances bash_task = BashOperator( task_id="gcp_service_list_instances_sql", ... )
  • 26. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 List of instances bash_task = BashOperator( task_id="gcp_service_list_instances_sql", bash_command= "gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' " "| tr 'n' ' '", ... )
  • 27. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 List of instances bash_task = BashOperator( task_id="gcp_service_list_instances_sql", bash_command= "gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' " "| tr 'n' ' '", xcom_push=True, ... )
  • 28. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 List of instances bash_task = BashOperator( task_id="gcp_service_list_instances_sql", bash_command= "gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' " "| tr 'n' ' '", xcom_push=True, dag=dag )
  • 29. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 All services GCP_SERVICES = [ ('sql', 'Cloud SQL'), ('spanner', 'Spanner'), ('bigtable', 'BigTable'), ('compute', 'Compute Engine'), ]
  • 30. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 List of instances - all services ???? bash_task = BashOperator( task_id="gcp_service_list_instances_sql", bash_command= "gcloud sql instances list | tail -n +2 | grep -oE '^[^ ]+' " "| tr 'n' ' '", xcom_push=True, dag=dag )
  • 31. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 List of instances - all services for gcp_service in GCP_SERVICES: bash_task = BashOperator( task_id="gcp_service_list_instances_{}".format(gcp_service[0]), bash_command= "gcloud {} instances list | tail -n +2 | grep -oE '^[^ ]+' " "| tr 'n' ' '".format(gcp_service[0]), xcom_push=True, dag=dag )
  • 32. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Send Slack message send_slack_msg_task = PythonOperator( python_callable=send_slack_msg, provide_context=True, task_id='send_slack_msg_task', dag=dag )
  • 33. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Send Slack message send_slack_msg_task = PythonOperator( python_callable=send_slack_msg, provide_context=True, task_id='send_slack_msg_task', dag=dag )
  • 34. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 def send_slack_msg(**context): for gcp_service in GCP_SERVICES: result = context['task_instance']. xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) ...
  • 35. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 def send_slack_msg(**context): for gcp_service in GCP_SERVICES: result = context['task_instance']. xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) ...
  • 36. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 def send_slack_msg(**context): for gcp_service in GCP_SERVICES: result = context['task_instance']. xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) data = ... ...
  • 37. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 def send_slack_msg(**context): for gcp_service in GCP_SERVICES: result = context['task_instance']. xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) data = ... requests.post( url=SLACK_WEBHOOK, data=json.dumps(data), headers={'Content-type': 'application/json'} )
  • 38. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Prepare email prepare_email_task = PythonOperator( python_callable=prepare_email, provide_context=True, task_id='prepare_email_task', dag=dag )
  • 39. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Prepare email prepare_email_task = PythonOperator( python_callable=prepare_email, provide_context=True, task_id='prepare_email_task', dag=dag )
  • 40. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 def prepare_email(**context): for gcp_service in GCP_SERVICES: result = context['task_instance']. xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) ... html_content = ... context['task_instance'].xcom_push(key='email', value=html_content)
  • 41. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 def prepare_email(**context): for gcp_service in GCP_SERVICES: result = context['task_instance']. xcom_pull(task_ids='gcp_service_list_instances_{}' .format(gcp_service[0])) ... html_content = ... context['task_instance'].xcom_push(key='email', value=html_content)
  • 42. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Send email send_email_task = EmailOperator( task_id='send_email', to='szymon.przedwojski@polidea.com', subject=INSTANCES_IN_PROJECT_TITLE, html_content=..., dag=dag )
  • 43. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Send email send_email_task = EmailOperator( task_id='send_email', to='szymon.przedwojski@polidea.com', subject=INSTANCES_IN_PROJECT_TITLE, html_content= "{{ task_instance.xcom_pull(task_ids='prepare_email_task', key='email') }}", dag=dag )
  • 44. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Dependencies for gcp_service in GCP_SERVICES: bash_task = BashOperator( ... ) bash_task >> send_slack_msg_task bash_task >> prepare_email_task
  • 45. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Dependencies for gcp_service in GCP_SERVICES: bash_task = BashOperator( ... ) bash_task >> send_slack_msg_task bash_task >> prepare_email_task prepare_email_task >> send_email_task
  • 46. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Dependencies for gcp_service in GCP_SERVICES: bash_task = BashOperator( ... ) bash_task >> send_slack_msg_task bash_task >> prepare_email_task prepare_email_task >> send_email_task
  • 47. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Demo https://github.com/PolideaInternal/airflow-gcp-spy
  • 48. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Complex DAGs Source: https://speakerdeck.com/pybay/2016-matt-davis-a-practical-introduction-to-airflow?slide=13
  • 49. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Complex, Manageable, DAGs
  • 50. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Source: https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
  • 51. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Single node Local Executor Web server RDBMS DAGs Scheduler Local executors` Local executors Local executors Local executors multiprocessing
  • 52. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Celery Executor Controller Web server RDBMS DAGs Scheduler Celery Broker RabbitMQ/Redis/AmazonSQS Node 1 Node 2 DAGs DAGs Worker Worker Sync files (Chef/Puppet/Ansible/NFS)
  • 53. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 (Beta): Kubernetes Executor Controller Web server RDBMS DAGs Scheduler Kubernetes Cluster Node 1 Node 2 Pod Sync files ● Git Init ● Persistent Volume ● Baked-in (future) Package as pods Kubernetes Master DAGs DAGs Pod Pod Pod
  • 54. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 GCP - Composer https://github.com/GoogleCloudPlatform/airflow-operator
  • 55. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Thank You! Follow us @ polidea.com/blog Source: https://techflourish.com/images/superman-animated-clipart.gif https://airflow.apache.org/_images/pin_large.png
  • 56. GDG DevFest Warsaw 2018 @higrys, @sprzedwojski
  • 57. @higrys, @sprzedwojskiGDG DevFest Warsaw 2018 Questions? :) Follow us @ polidea.com/blog Source: https://techflourish.com/images/superman-animated-clipart.gif https://airflow.apache.org/_images/pin_large.png