SlideShare une entreprise Scribd logo
1  sur  43
April 10th 2019
Tao Feng | @feng-tao | Software Engineer, Lyft
Airflow @ Lyft
2
Who
● Tao Feng
● Engineer at Lyft Data Platform
● Apache Airflow PMC and Committer
● Working on different data products (Airflow,
Amundsen, etc)
Agenda
• Airflow in general
• Airflow @ Lyft
• Upstream @ Lyft
• Next Step
• Summary
3
Airflow in general
4
Airflow in general
5
• Airflow just became an Apache top level project(TLP) recently.
‒ Total 20 PMCs / committers
• Most recent release 1.10, 1.10.1, 1.10.2 (1.10.3 is coming).
‒ New Features: Airflow RBAC, Airflow K8S integration, etc
• New Process in Airflow for proposing architecture change.
‒ Airflow Improvements Proposals (currently 19+ proposals)
• Recent community conducted Airflow user survey (link).
11k+
github
stars
740+
contributors
250+
Companies
using
Airflow @ Lyft
6
7
Core Infra high level architecture @ Lyft
Airflow Architecture @ Lyft
8
Airflow Architecture @ Lyft
• WebUI: the portal for users to view the related status of the DAGs.
• Metadata DB: the Airflow metastore for storing various job status.
• Scheduler: a multi-process which parses the DAG bag, creates a DAG object and
triggers executor to execute those dependency met tasks.
• Executor: A message queuing process that orchestrates worker processes to execute
tasks. We uses CeleryExecutor at Lyft.
• TARS: Airflow development / backfill environment, which provides access to production
data. 9
Airflow Architecture @ Lyft
10
• Main Cluster Config: Apache Airflow 1.8.2 with cherry-picks and numerous in-
house Lyft patches.
• Scale: Three set of ASGs for workers.
‒ ASG #1: 15 worker nodes each of which is the r5.4xlarge (16vcpu, 128g mem) type. This fleet of
workers is for processing low-priority memory intensive tasks.
‒ ASG #2: 3 worker nodes each of which is the m4.4xlarge (16vcpu, 64g mem) )type. This fleet of
workers is dedicated for those DAGs with a strict SLA.
‒ ASG #3: 1 worker node with m4.10xlarge (40vcpu, 160g mem) type. The single node is used to
process the compute-intensive workloads from a critical team’s DAGs.
‒ Backfill Box(TARS): 1 node with m4.16xlarge (64vcpu, 256g mem) )type. This box is used for fast
DAG prototyping and backfill.
Airflow daily stats @ Lyft
11
600+
DAGs
800+
DagRuns
25k+
TIs
Airflow Monitoring @
Lyft
12
Airflow Availability
• Scheduler and worker health check
‒ Use Canary monitoring DAG.
‒ No task has been scheduled for 10 mins is considered downtime.
• UI health check
‒ Leverage Envoy membership health check.
• Total system Uptime pct
‒ Airflow is down if either scheduler, workers, or web server is down.
13
Schedule Delay
• scheduler delay = TI.start_time - TI.execution_date
14
DAG last run time
• The time that have elapsed since the DAG file was last processed.
• If the time becomes too long, it means scheduler has issues processing the
DAG files.
‒ E.g could due to parser threads occupied by malicious DAG files.
15
Executor parallelism
• Parallelism: control the #. concurrent running tasks.
‒ Please monitor your worker nodes’ cpu utilization before increasing the value.
16
Airflow monitoring @ Lyft
17
Stats Name Meaning
dag_processing.last_run.seconds_ago.<d
ag_file>
Seconds since <dag_file> was last
processed
executor.open_slots Number of of open slots on executor
(parallelism - # running task)
executor.queued_tasks Number of queued tasks on executor
executor.running_tasks Number of running tasks on executor
pool.starving_tasks.<pool_name> Number of starving tasks in the pool.
Check how many tasks are starving due to
pool count limitation.
…...
Airflow Customization
@ Lyft
18
Airflow customization @ Lyft
• UI auditing
• Extra link for task instance UI panel (AIRFLOW-161)
19
Airflow customization @ Lyft
• DAG dependency graph
20
Improve Airflow
Reliability @ Lyft
21
Improving Airflow Performance @ Lyft
• Reduce Airflow UI page load time
‒ Change default_dag_run_display_number to 5.
• Tunables that impacts tasks execution parallelisms
‒ Parallelism
‒ Concurrency
‒ Max_active_runs
‒ Pool
22
Improving Airflow Reliability at Lyft
• Source Control For Pools
‒ All Airflow pools are defined in a source controlled github source file.
‒ Airflow pools are configured based on the file in runtime.
• Integration tests for DAG to enforce best practice and improve reliability
‒ All the DAGs should be loadable within time threshold.
‒ All the DAGs should have valid pools associated.
‒ External task sensors should be valid((dag_id, task_id) exists).
‒ Each pool is used by at least by one DAG.
‒ The sensor has a reasonable timeout.
‒ Each DAG has a non dynamic start date.
• Secure UI access
23
Production Debug @
Lyft
24
Production Debug @ Lyft
• We document every production issue investigation in the doc.
• Couples of methodologies:
‒ View the centralized Airflow dashboard.
‒ Identify whether it is UI or Airflow scheduler(backend) issues.
‒ View the webserver log or scheduler log.
∙ If the log is not available in machine, check the log in kibana.
∙ To further identify issues, we sometimes even look at logs in S3
‒ Use different tools for further investigation
∙ If exceptions is thrown, understand which part of Airflow code throws the exception.
∙ If CPU / memory alarm, use top to identify which DAG causes the issue.
∙ If failure related to celery, login to celery flower UI to further investigate.
∙ ...
25
Airflow Gotchas @ Lyft
26
Airflow Gotchas at Lyft
• DST
‒ UI doesn’t have timezone support even in upstream.
‒ Scheduler internal version has no timezone support.
• DAGs with dynamic start date.
‒ Hard to predict when the DAG is scheduled
• Long running external task sensors that don’t have valid external tasks.
• HivePartitionSensor doesn’t work for partial partition
‒ It only checks whether data exists, not check whether data fully loaded.
• Backfill experience
‒ We use local executor to backfill.
• Long running sensor occupies task slot of the pool
• User confused with DAG level argument vs Task level argument
‒ E.g Put max_active_run in default task argument
• Legacy high abstraction framework over Airflow
‒ Hard to debug for the user and us. 27
Upstream @ Lyft
28
Improve backfill
experience
29
Improve backfill experience
30
• New options for backfill
‒ --reset_dagruns: if used, Airflow will first check if there are any existing dag_runs /
task_instances associated with the backfill date range. If yes, it will prompt user whether the
user wants to clear those task_instances first. (AIRFLOW-2718)
‒ --rerun_failed_tasks: if used, Airflow automatically try to rerun those failed tasks again
without requiring any user intervention. (AIRFLOW-2566)
• Backfill respects pool for isolation (AIRFLOW-1557)
Improve backfill experience
Support batch backfill
• Use {{ prev_ds }} and {{ ds }} in SQL
‒ Prev_ds equals to ds -
schedule_interval
‒ User could change the
schedule_interval in the DAG file
during backfill.
• Use could override dag param with -c
options during backfill.
31
INSERT OVERWRITE TABLE {{
dest_db(default.superhero_data) }}
SELECT supe.superhero_name AS superhero_name,
pop.popularity AS popularity
FROM
{{ source_table(events.superheroes) }} supe
WHERE {{ prev_ds }} >= ds AND ds < {{ ds }}
airflow backfill superheroes -s 2018-05-01 -e
2018-05-08 -c {‘hive_cluster’:
‘backfill_cluster’}
Airflow DAG level
access
33
Airflow DAG level access @ Lyft
34
• DAG access control has always been a real need at Lyft
‒ HR data, Financial data, etc
‒ The workaround is to build an isolated dedicated cluster for each use case.
• Airflow introduces the RBAC feature in 1.10
‒ Airflow new webserver is based on Flask-Appbuilder.
‒ Ships with 5 static roles(Admin, User, Op, Viewer, Public).
‒ ...
• Airflow DAG level access (AIRFLOW-2267)
‒ Provides additional granular access control on DAG level.
Airflow DAG level access @ Lyft
• New Airflow UI migrates from Flask-Admin to Flask-Appbuilder(FAB).
• FAB’s security model.
35
Airflow DAG level access @ Lyft
• Which Airflow includes the change?
‒ 1.10.2 includes initial implementation
‒ 1.10.3(upcoming) includes the enhancements
• How it works
‒ Two new perms: can_dag_read (read), can_dag_edit (write).
‒ DAG level role could be created through cli / UI by Admin (doc).
‒ DAG level role could only see the viewable DAGs.
‒ User could declare permissions in DAG file (AIRFLOW-2694).
36
Airflow DAG level access @ Lyft
37
• We build a new cluster based on
Airflow master branch and
onboard couples of new sensitive
data use cases.
‒ Each use case has its own repo.
‒ User role relationship source
controlled in a YAML file.
• DAG owners specify the access
control info in the DAG files.
• Gotchas
‒ New user onboarding
‒ Integration between FAB and
google authentication(OAUTH)
‒ Integration with internal ACL
service
‒ ...
User registration flow
Next Step
38
Next Step
• Support Airflow DAG level access feature in beta internally.
• Integrate Airflow RBAC / DAG level feature with internal ACL service(FAB issue).
• Migrate all the existing DAGs to this new cluster.
• Explore running Airflow with k8s executor internally.
39
Summary
40
Summary
41
• Airflow community has been growing a lot!
• We share our experience on operating Airflow at Lyft.
• We share some of our upstream work
‒ Improve Airflow backfill experience
‒ Support Airflow DAG level Access
Acknowledgement
42
• Members who maintain Airflow at Lyft
‒ Alagappan Sethuraman
‒ Andrew Stahlman
‒ Chao-han Tsai
‒ Jinhyuk Chang
‒ Junda Yang
‒ Max Payton
‒ Tao Feng
• Special thanks to Maxime Beauchemin who provides numerous suggestions for
us.
Tao Feng | @feng-tao
Slides at TBD
Blog at go.lyft.com/airflowblog
Icons under Creative Commons License from https://thenounproject.com/ 43
Backup
44

Contenu connexe

Tendances

Data platform data pipeline(Airflow, Kubernetes)
Data platform data pipeline(Airflow, Kubernetes)Data platform data pipeline(Airflow, Kubernetes)
Data platform data pipeline(Airflow, Kubernetes)창언 정
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_onpko89403
 
Apache Airflow | What Is An Operator
Apache Airflow | What Is An OperatorApache Airflow | What Is An Operator
Apache Airflow | What Is An OperatorMarc Lamberti
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습동현 강
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflowmutt_data
 
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기AWSKRUG - AWS한국사용자모임
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)SANG WON PARK
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
 
Scalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessScalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessDerek Collison
 
Airflow를 이용한 데이터 Workflow 관리
Airflow를 이용한  데이터 Workflow 관리Airflow를 이용한  데이터 Workflow 관리
Airflow를 이용한 데이터 Workflow 관리YoungHeon (Roy) Kim
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
 

Tendances (20)

Airflow 101
Airflow 101Airflow 101
Airflow 101
 
Data platform data pipeline(Airflow, Kubernetes)
Data platform data pipeline(Airflow, Kubernetes)Data platform data pipeline(Airflow, Kubernetes)
Data platform data pipeline(Airflow, Kubernetes)
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Apache Airflow | What Is An Operator
Apache Airflow | What Is An OperatorApache Airflow | What Is An Operator
Apache Airflow | What Is An Operator
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 
Scalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessScalable and Available, Patterns for Success
Scalable and Available, Patterns for Success
 
Airflow를 이용한 데이터 Workflow 관리
Airflow를 이용한  데이터 Workflow 관리Airflow를 이용한  데이터 Workflow 관리
Airflow를 이용한 데이터 Workflow 관리
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 

Similaire à Airflow at lyft

Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceTao Feng
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupKaxil Naik
 
airflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptxairflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptxVIJAYAPRABAP
 
(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environment(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environmentBIOVIA
 
airflow web UI and CLI.pptx
airflow web UI and CLI.pptxairflow web UI and CLI.pptx
airflow web UI and CLI.pptxVIJAYAPRABAP
 
Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?Kaxil Naik
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2Kaxil Naik
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
DataPipelineApacheAirflow.pptx
DataPipelineApacheAirflow.pptxDataPipelineApacheAirflow.pptx
DataPipelineApacheAirflow.pptxJohn J Zhao
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Bolke de Bruin
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Productionconfluent
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpNathan Handler
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierDatabricks
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowTatiana Al-Chueyr
 

Similaire à Airflow at lyft (20)

Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow MeetupWhat's coming in Airflow 2.0? - NYC Apache Airflow Meetup
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
 
airflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptxairflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptx
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environment(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environment
 
GoDocker presentation
GoDocker presentationGoDocker presentation
GoDocker presentation
 
airflow web UI and CLI.pptx
airflow web UI and CLI.pptxairflow web UI and CLI.pptx
airflow web UI and CLI.pptx
 
Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?Why Airflow? & What's new in Airflow 2.3?
Why Airflow? & What's new in Airflow 2.3?
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
sun solaris
sun solarissun solaris
sun solaris
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
DataPipelineApacheAirflow.pptx
DataPipelineApacheAirflow.pptxDataPipelineApacheAirflow.pptx
DataPipelineApacheAirflow.pptx
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
 

Plus de Tao Feng

Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Tao Feng
 
Effective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkEffective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkTao Feng
 
A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...Tao Feng
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopTao Feng
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeTao Feng
 

Plus de Tao Feng (7)

Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)
 
Effective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkEffective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza Framework
 
A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
 

Dernier

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 

Dernier (20)

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 

Airflow at lyft

  • 1. April 10th 2019 Tao Feng | @feng-tao | Software Engineer, Lyft Airflow @ Lyft
  • 2. 2 Who ● Tao Feng ● Engineer at Lyft Data Platform ● Apache Airflow PMC and Committer ● Working on different data products (Airflow, Amundsen, etc)
  • 3. Agenda • Airflow in general • Airflow @ Lyft • Upstream @ Lyft • Next Step • Summary 3
  • 5. Airflow in general 5 • Airflow just became an Apache top level project(TLP) recently. ‒ Total 20 PMCs / committers • Most recent release 1.10, 1.10.1, 1.10.2 (1.10.3 is coming). ‒ New Features: Airflow RBAC, Airflow K8S integration, etc • New Process in Airflow for proposing architecture change. ‒ Airflow Improvements Proposals (currently 19+ proposals) • Recent community conducted Airflow user survey (link). 11k+ github stars 740+ contributors 250+ Companies using
  • 7. 7 Core Infra high level architecture @ Lyft
  • 9. Airflow Architecture @ Lyft • WebUI: the portal for users to view the related status of the DAGs. • Metadata DB: the Airflow metastore for storing various job status. • Scheduler: a multi-process which parses the DAG bag, creates a DAG object and triggers executor to execute those dependency met tasks. • Executor: A message queuing process that orchestrates worker processes to execute tasks. We uses CeleryExecutor at Lyft. • TARS: Airflow development / backfill environment, which provides access to production data. 9
  • 10. Airflow Architecture @ Lyft 10 • Main Cluster Config: Apache Airflow 1.8.2 with cherry-picks and numerous in- house Lyft patches. • Scale: Three set of ASGs for workers. ‒ ASG #1: 15 worker nodes each of which is the r5.4xlarge (16vcpu, 128g mem) type. This fleet of workers is for processing low-priority memory intensive tasks. ‒ ASG #2: 3 worker nodes each of which is the m4.4xlarge (16vcpu, 64g mem) )type. This fleet of workers is dedicated for those DAGs with a strict SLA. ‒ ASG #3: 1 worker node with m4.10xlarge (40vcpu, 160g mem) type. The single node is used to process the compute-intensive workloads from a critical team’s DAGs. ‒ Backfill Box(TARS): 1 node with m4.16xlarge (64vcpu, 256g mem) )type. This box is used for fast DAG prototyping and backfill.
  • 11. Airflow daily stats @ Lyft 11 600+ DAGs 800+ DagRuns 25k+ TIs
  • 13. Airflow Availability • Scheduler and worker health check ‒ Use Canary monitoring DAG. ‒ No task has been scheduled for 10 mins is considered downtime. • UI health check ‒ Leverage Envoy membership health check. • Total system Uptime pct ‒ Airflow is down if either scheduler, workers, or web server is down. 13
  • 14. Schedule Delay • scheduler delay = TI.start_time - TI.execution_date 14
  • 15. DAG last run time • The time that have elapsed since the DAG file was last processed. • If the time becomes too long, it means scheduler has issues processing the DAG files. ‒ E.g could due to parser threads occupied by malicious DAG files. 15
  • 16. Executor parallelism • Parallelism: control the #. concurrent running tasks. ‒ Please monitor your worker nodes’ cpu utilization before increasing the value. 16
  • 17. Airflow monitoring @ Lyft 17 Stats Name Meaning dag_processing.last_run.seconds_ago.<d ag_file> Seconds since <dag_file> was last processed executor.open_slots Number of of open slots on executor (parallelism - # running task) executor.queued_tasks Number of queued tasks on executor executor.running_tasks Number of running tasks on executor pool.starving_tasks.<pool_name> Number of starving tasks in the pool. Check how many tasks are starving due to pool count limitation. …...
  • 19. Airflow customization @ Lyft • UI auditing • Extra link for task instance UI panel (AIRFLOW-161) 19
  • 20. Airflow customization @ Lyft • DAG dependency graph 20
  • 22. Improving Airflow Performance @ Lyft • Reduce Airflow UI page load time ‒ Change default_dag_run_display_number to 5. • Tunables that impacts tasks execution parallelisms ‒ Parallelism ‒ Concurrency ‒ Max_active_runs ‒ Pool 22
  • 23. Improving Airflow Reliability at Lyft • Source Control For Pools ‒ All Airflow pools are defined in a source controlled github source file. ‒ Airflow pools are configured based on the file in runtime. • Integration tests for DAG to enforce best practice and improve reliability ‒ All the DAGs should be loadable within time threshold. ‒ All the DAGs should have valid pools associated. ‒ External task sensors should be valid((dag_id, task_id) exists). ‒ Each pool is used by at least by one DAG. ‒ The sensor has a reasonable timeout. ‒ Each DAG has a non dynamic start date. • Secure UI access 23
  • 25. Production Debug @ Lyft • We document every production issue investigation in the doc. • Couples of methodologies: ‒ View the centralized Airflow dashboard. ‒ Identify whether it is UI or Airflow scheduler(backend) issues. ‒ View the webserver log or scheduler log. ∙ If the log is not available in machine, check the log in kibana. ∙ To further identify issues, we sometimes even look at logs in S3 ‒ Use different tools for further investigation ∙ If exceptions is thrown, understand which part of Airflow code throws the exception. ∙ If CPU / memory alarm, use top to identify which DAG causes the issue. ∙ If failure related to celery, login to celery flower UI to further investigate. ∙ ... 25
  • 26. Airflow Gotchas @ Lyft 26
  • 27. Airflow Gotchas at Lyft • DST ‒ UI doesn’t have timezone support even in upstream. ‒ Scheduler internal version has no timezone support. • DAGs with dynamic start date. ‒ Hard to predict when the DAG is scheduled • Long running external task sensors that don’t have valid external tasks. • HivePartitionSensor doesn’t work for partial partition ‒ It only checks whether data exists, not check whether data fully loaded. • Backfill experience ‒ We use local executor to backfill. • Long running sensor occupies task slot of the pool • User confused with DAG level argument vs Task level argument ‒ E.g Put max_active_run in default task argument • Legacy high abstraction framework over Airflow ‒ Hard to debug for the user and us. 27
  • 30. Improve backfill experience 30 • New options for backfill ‒ --reset_dagruns: if used, Airflow will first check if there are any existing dag_runs / task_instances associated with the backfill date range. If yes, it will prompt user whether the user wants to clear those task_instances first. (AIRFLOW-2718) ‒ --rerun_failed_tasks: if used, Airflow automatically try to rerun those failed tasks again without requiring any user intervention. (AIRFLOW-2566) • Backfill respects pool for isolation (AIRFLOW-1557)
  • 31. Improve backfill experience Support batch backfill • Use {{ prev_ds }} and {{ ds }} in SQL ‒ Prev_ds equals to ds - schedule_interval ‒ User could change the schedule_interval in the DAG file during backfill. • Use could override dag param with -c options during backfill. 31 INSERT OVERWRITE TABLE {{ dest_db(default.superhero_data) }} SELECT supe.superhero_name AS superhero_name, pop.popularity AS popularity FROM {{ source_table(events.superheroes) }} supe WHERE {{ prev_ds }} >= ds AND ds < {{ ds }} airflow backfill superheroes -s 2018-05-01 -e 2018-05-08 -c {‘hive_cluster’: ‘backfill_cluster’}
  • 33. Airflow DAG level access @ Lyft 34 • DAG access control has always been a real need at Lyft ‒ HR data, Financial data, etc ‒ The workaround is to build an isolated dedicated cluster for each use case. • Airflow introduces the RBAC feature in 1.10 ‒ Airflow new webserver is based on Flask-Appbuilder. ‒ Ships with 5 static roles(Admin, User, Op, Viewer, Public). ‒ ... • Airflow DAG level access (AIRFLOW-2267) ‒ Provides additional granular access control on DAG level.
  • 34. Airflow DAG level access @ Lyft • New Airflow UI migrates from Flask-Admin to Flask-Appbuilder(FAB). • FAB’s security model. 35
  • 35. Airflow DAG level access @ Lyft • Which Airflow includes the change? ‒ 1.10.2 includes initial implementation ‒ 1.10.3(upcoming) includes the enhancements • How it works ‒ Two new perms: can_dag_read (read), can_dag_edit (write). ‒ DAG level role could be created through cli / UI by Admin (doc). ‒ DAG level role could only see the viewable DAGs. ‒ User could declare permissions in DAG file (AIRFLOW-2694). 36
  • 36. Airflow DAG level access @ Lyft 37 • We build a new cluster based on Airflow master branch and onboard couples of new sensitive data use cases. ‒ Each use case has its own repo. ‒ User role relationship source controlled in a YAML file. • DAG owners specify the access control info in the DAG files. • Gotchas ‒ New user onboarding ‒ Integration between FAB and google authentication(OAUTH) ‒ Integration with internal ACL service ‒ ... User registration flow
  • 38. Next Step • Support Airflow DAG level access feature in beta internally. • Integrate Airflow RBAC / DAG level feature with internal ACL service(FAB issue). • Migrate all the existing DAGs to this new cluster. • Explore running Airflow with k8s executor internally. 39
  • 40. Summary 41 • Airflow community has been growing a lot! • We share our experience on operating Airflow at Lyft. • We share some of our upstream work ‒ Improve Airflow backfill experience ‒ Support Airflow DAG level Access
  • 41. Acknowledgement 42 • Members who maintain Airflow at Lyft ‒ Alagappan Sethuraman ‒ Andrew Stahlman ‒ Chao-han Tsai ‒ Jinhyuk Chang ‒ Junda Yang ‒ Max Payton ‒ Tao Feng • Special thanks to Maxime Beauchemin who provides numerous suggestions for us.
  • 42. Tao Feng | @feng-tao Slides at TBD Blog at go.lyft.com/airflowblog Icons under Creative Commons License from https://thenounproject.com/ 43

Notes de l'éditeur

  1. Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows. workflows are defined as code Growing community Todo: first mention about the stat then about the fact.
  2. What does the architecture for our core infra look like? Mobile application primarily… Raw events can come either from the client… or from the back end events triggered in the server… the data comes to our message bus… Kinesis/Kafka and then with light ELTing written to S3 where it persists… today we keep all the data in archival… then we develop data models and transform raw events to tables in Hive. We use Hive from long running queries and Presto for interactive queries… People build dashboards on top of Hive and visualize for exploratory analysis in Presto… Airflow is used for scheduling (executive dashboard, metric aggregation, derived data generation, machine learning feature computation)
  3. https://github.com/dpgaspar/Flask-AppBuilder/issues/518 We are not the only team manage Airflow, but we are the biggest team which manage Airflow at Lyft. Previously there are some other teams which has security requirement which they will have a separate cluster for their own use case.
  4. Parallelism set to 200 r5.4xlarge type(16vcpu, 128g mem) m4.4xlarge(16vcpu,64g) m4.10xlarge(40vcpu,160g) m4.16xlarge type(64vcpu, 256g)
  5. Canary monitoring dag: When we do Airflow maintance, we check whether the canary dag is running as the signal to see whether there is any issues.
  6. Scheduler delay roughly equals to the time that scheduler picks up the tasks(depends on scheduling loop, task priority) + the time celery worker picks up the task from celery broker Measure with canary monitoring dag
  7. Open slots, running tasks, queue tasks
  8. At Lyft we used externalTaskSensor and hivePartitionSensor mostly. This is one of our Intern’s summer project which built a DAG dependency graph which based externalTaskSensor and hivePartitionSensor . The info is generated in a daily Airflow DAG.
  9. Parallelism: This variable controls the number of task instances that the Airflow worker can run simultaneously. Users could increase the parallelism variable in the Airflow.cfg. We normally suggest users increase this value when doing backfill. Concurrency: The Airflow scheduler will run no more than concurrency task instances for your DAG at any given time. Concurrency is defined in your Airflow DAG as a DAG input argument. If you do not set the concurrency on your DAG, the scheduler will use the default value from the dag_concurrency entry in your Airflow.cfg. max_active_runs: Airflow will run no more than max_active_runs DagRuns of your DAG at a given time. If you do not set the max_active_runs on your DAG, Airflow will use the default value from the max_active_runs_per_dag entry in your Airflow.cfg. We suggest users not to set depends_on_past to true and increase this configuration during backfill. Pool: Airflow pool is used to limit the execution parallelism. Users could increase the priority_weight for the task if it is a critical one.
  10. Todo: mention pool source control Todo: need to have some examples for reliablity
  11. Todo: mention pool source control Todo: need to have some examples for reliablity
  12. Data engineering handbook
  13. Provide util to allow user to easy promote the partition to the table in dest schema.
  14. Talk about backfill improvement