Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Airflow
Insane power in a tiny box
A BRIEF HISTORY OF DATA
PIPELINES
F o r r e a l , t h i s i s h o w w e u s e d t o d o i t …
The dev’s answer to
EVERYTHING
C r o n / c r o n t a b
This works great for some use cases,
but lacks in many other ways.
...
It keeps tasks alive.
S u p e r v i s o r / S u p e r v i s o r d
Fantastic utility, works as expected and
optionally embe...
Some one said… we can do better.
Airflow is a “workflow
management system”
created by airbnb.com
“Today, we are proud to announce that
we are open sourcing...
What IS Airflow?
B U T R E A L LY …
Dependency Control
Task Management
Task Recovery
Charting
Logging
Alerting
History
Fol...
Airflow is NOT…
…perfect
https://airflow.apache.org/
So contribute, and help
it get better!
Webserver / UI
The Airflow Architecture
Scheduler Worker
WITH VERY LITTLE
WORK…
A i r f l o w c a n b e r u n l o c a l l y
O r b e r u n i n m u c h m o r e c o m p l e x c o n f...
Master / Slave / UI
Configuration
W i t h l o g s b e i n g f e d t o G C S .
How we provision
Airflow.
We place it all on a
single Google Compute
Engine VM.
No bull!
E x c u s e m e ?
CPU: n1-standard-2
2 vCPUs, 7.5 GB memory...
LET’S 

TALK ABOUT

AIRFLOW

DAGs
A few key Airflow
concepts.
DAGs
Directed Acyclic Graph – is a collection of all the tasks you
want to run, organized in a...
Stop doing things the way you
have, think dynamically.
You can automate your task by
reading source code or listing
files ...
Airflow really shines with
dynamic tasks.
Dictionary (array) of Dependencies
What if you made a script that parsed all you...
Airflow really shines with
dynamic tasks.
T h e c o d e t o r u n i t a l l
Top Level Dependencies
Top level dependencies ...
What does that code do?This is real code being used today.
Dovy Paukstys
Consultant at Caserta
#geek #bigdata #redux
How can I help?
http://dovy.io
http://twitter.com/simplerain
dov...
Airflow - Insane power in a Tiny Box
Airflow - Insane power in a Tiny Box
Airflow - Insane power in a Tiny Box
Airflow - Insane power in a Tiny Box
Prochain SlideShare
Chargement dans…5
×
Prochain SlideShare
What to Upload to SlideShare
Suivant
Télécharger pour lire hors ligne et voir en mode plein écran

1

Partager

Télécharger pour lire hors ligne

Airflow - Insane power in a Tiny Box

Télécharger pour lire hors ligne

A walk through of what Airflow is and isn't. Also how to use airflow to construct dynamic tasks and automate your entire ETL process. Presentation can be seen here: http://dovy.io/airflow/airflow-strength-and-weaknesses-and-dynamic-tasks

Livres associés

Gratuit avec un essai de 30 jours de Scribd

Tout voir

Airflow - Insane power in a Tiny Box

  1. 1. Airflow Insane power in a tiny box
  2. 2. A BRIEF HISTORY OF DATA PIPELINES F o r r e a l , t h i s i s h o w w e u s e d t o d o i t …
  3. 3. The dev’s answer to EVERYTHING C r o n / c r o n t a b This works great for some use cases, but lacks in many other ways. Works great, provided the computer is on. Will manage at the time you set every time it can. No recovery, logs self managed, not sure when it runs. Can only execute on one computer.
  4. 4. It keeps tasks alive. S u p e r v i s o r / S u p e r v i s o r d Fantastic utility, works as expected and optionally embedded UI and CLI util. Keeps everything up and let’s you see what’s going on. Even rotates logs and allows groups. Still executes on the one computer. Isn’t more than it advertises to be. Limited scope.
  5. 5. Some one said… we can do better.
  6. 6. Airflow is a “workflow management system” created by airbnb.com “Today, we are proud to announce that we are open sourcing and sharing Airflow, our workflow management platform.” June 2, 2016 https://medium.com/airbnb-engineering/ airflow-a-workflow-management- platform-46318b977fd8 And it’s all written in Python!
  7. 7. What IS Airflow? B U T R E A L LY … Dependency Control Task Management Task Recovery Charting Logging Alerting History Folder Watching Trending Dynamic Tasks ANYTHING your pipeline may need…
  8. 8. Airflow is NOT… …perfect https://airflow.apache.org/ So contribute, and help it get better!
  9. 9. Webserver / UI The Airflow Architecture Scheduler Worker
  10. 10. WITH VERY LITTLE WORK… A i r f l o w c a n b e r u n l o c a l l y O r b e r u n i n m u c h m o r e c o m p l e x c o n f i g u r a t i o n s .
  11. 11. Master / Slave / UI Configuration W i t h l o g s b e i n g f e d t o G C S .
  12. 12. How we provision Airflow.
  13. 13. We place it all on a single Google Compute Engine VM. No bull! E x c u s e m e ? CPU: n1-standard-2 2 vCPUs, 7.5 GB memory HD: 30 GB Standard Persistant Disk (Non-SSD)
  14. 14. LET’S 
 TALK ABOUT
 AIRFLOW
 DAGs
  15. 15. A few key Airflow concepts. DAGs Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. Written in python. 01 Describes how a single task performs in a workflow (dag). 
 There are many types of operators: BashOperator, PythonOperator, EmailOperator, HTTPOperator, MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, Sensor, DockerOperator 02 Operators Tasks Once an operator is instantiated it’s referred to as a task. 03 dag = DAG( dag_id='example_python_operator', schedule_interval=None ) def my_sleeping_function(random_base): '''This is a function that will run within the DAG execution''' time.sleep(random_base) def print_context(ds, **kwargs): pprint(kwargs) print(ds) return 'Whatever you return gets printed in the logs’ run_this = PythonOperator( task_id='print_the_context', provide_context=True, python_callable=print_context, dag=dag) for i in range(10): ''' Generating 10 sleeping task, sleeping from 0 to 9 seconds respectively ''' task = PythonOperator( task_id='sleep_for_'+str(i), python_callable=my_sleeping_function, op_kwargs={'random_base': float(i)/10}, dag=dag ) task.set_upstream(run_this)
  16. 16. Stop doing things the way you have, think dynamically. You can automate your task by reading source code or listing files in a directory. You don’t have to worry about execution order, you only need to present airflow with relationships. Think in terms of how you can remove human error. Let airflow work for you.
  17. 17. Airflow really shines with dynamic tasks. Dictionary (array) of Dependencies What if you made a script that parsed all your jobs, and detected all dependencies automatically.
 
 Now what if you took that dictionary, and fed it into airflow?
 
 How would that simplify your pipeline? dependencies = { 'topic_billing_frequency': [ ‘dim_billing_frequency’, ‘dim_account' ], 'topic_payment_method': ‘dim_credit_card_type’, ‘dim_payment_accounts’ ] } Let’s take a look… L e t m e s h o w y o u …
  18. 18. Airflow really shines with dynamic tasks. T h e c o d e t o r u n i t a l l Top Level Dependencies Top level dependencies are created. Each of these tasks, depends on creating and deleting the cluster. 01 Now each child dependency is iterated over, and a task is created for each. Each is given the delete task as a “downstream” so delete cluster will never run until the tasks are complete. 02 Child Dependencies Connect children to parents Now set the parent task as an upstream for each child task. 03 all_tasks = {} # Create all parent tasks, top level for key, value in dependencies.all_dependencies.iteritems(): if key not in all_tasks: all_tasks[key] = PythonOperator( task_id=key, python_callable=process, op_kwargs={}, provide_context=True, dag=dag, retries=30, retry_delay=timedelta(minutes=10), on_retry_callback=airflow_retry_function, on_failure_callback=airflow_error_function, on_success_callback=airflow_success_function, ) all_tasks[key].set_upstream(task_create_cluster) all_tasks[key].set_downstream(task_delete_cluster) # Create all nested dependency tasks for key, value in dependencies.all_dependencies.iteritems(): for item in value: if item not in all_tasks: if key in all_tasks: continue all_tasks[item] = PythonOperator( task_id=item, python_callable=process, op_kwargs={}, provide_context=True, dag=dag, retries=30, retry_delay=timedelta(minutes=10), on_retry_callback=airflow_retry_function, on_failure_callback=airflow_error_function, on_success_callback=airflow_success_function, ) all_tasks[item].set_downstream(task_delete_cluster) all_tasks[item].set_downstream(all_tasks[key])
  19. 19. What does that code do?This is real code being used today.
  20. 20. Dovy Paukstys Consultant at Caserta #geek #bigdata #redux How can I help? http://dovy.io http://twitter.com/simplerain dovy.paukstys@caserta.com http://reduxframework.com https://github.com/dovy/ http://linkedin.com/in/dovyp
  • devosp

    Jun. 2, 2018

A walk through of what Airflow is and isn't. Also how to use airflow to construct dynamic tasks and automate your entire ETL process. Presentation can be seen here: http://dovy.io/airflow/airflow-strength-and-weaknesses-and-dynamic-tasks

Vues

Nombre de vues

541

Sur Slideshare

0

À partir des intégrations

0

Nombre d'intégrations

1

Actions

Téléchargements

7

Partages

0

Commentaires

0

Mentions J'aime

1

×