In Data Engineer's Lunch 89, Obioma Anomnachi will discuss how to manage and schedule Machine Learning operations via Airflow. Learn how you can write complete end-to-end pipelines starting with retrieving raw data to serving ML predictions to end-users, entirely in Airflow.
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
1. Version 1.0
Machine Learning Orchestration
with Airflow
Using Apache Airflow to manage and schedule machine
learning tasks.
Obioma Anomnachi
Engineer @ Anant
2. Airflow
● A tool for scheduling and automating workflows
● Good for automating repeated processes
● Write workflows in Python
○ Tasks are defined in python but can include Operators for all sorts of external tools
○ Define dependencies between the tasks that make up a workflow
○ Create directed-acyclic-graphs of tasks
● Schedule workflows or execute processes manually
○ Cron syntax or frequency tasks
● Monitor task progress and view logs in Airflow UI
3. ML Pipelines
● Machine Learning processes get broken down into small repeatable chunks
○ Most ML associated tasks are batch processing
■ Batch processing tasks work on blocks of data at a time
● Even predictions can be bundled together into batches if results aren't time sensitive
○ This structure lines up really well with Airflow’s ability to schedule, automate, and manage dependencies for
tasks
○ Main sections are data processing, model training, and deployment
4. Data Preparation
● Data prep covers a number of data transformations that bring raw data in line with the needs of the
model training process
○ Basically a set of ETL jobs
■ Different models require data to be in different forms
○ Still preferable to keep stuff involving actual data processing separate from the dag code. Airflow Scheduler
and/or worker processes can get bogged down
■ Best to start separate processes on other systems like Spark
5. Train/Test Split
● Involves randomly splitting up processed data in preparation for model training
● In order to determine the efficacy of an ML model we need to be able to test it on data that we
know the real label for, that we also didn't train on
○ Standard method is a simple test train split
○ More complex methods like cross validation can produce better models
● Split obviously needs to be redone when new data comes in
6. Model Training
● For the majority of ML algorithms model training is a standard batch process since the model needs
to be trained on all the data and isn’t usable until that training is complete
● This step also has to do with the storage of the model
○ Permanent model methods have the trained model stored somewhere, and new prediction requests use
whatever the most recent model version is
■ Models get saved to a central location
■ There are a number of formats for saving ml models to disk. Python objects are all savable via pickling.
Some models may even be small enough to
○ Transient model methods have each request trigger the training of a new model, which is used to make the
prediction (which is stored). The model itself gets deleted afterwards.
7. Deployment
● Deployment involves using the completed model to serve predictions on inputs that were not part
of the original data set
○ You could have airflow answer those requests but it isn't really meant for it
○ Better to use airflow to keep the model up to date in an external system
■ If deployment in your system means to use the model on blocks of data at once (doing analytics) the
process can be scheduled similar to data preprocessing
■ If individual requests come in and require service, best to build an api that will use the model to
service those requests
8. ML Ops
● ML Ops covers all the topics we've talked
about plus some extras
○ Three main phases
■ Design - understanding the business
use case, the available data, and the
structure of the software
■ Development - includes the data
engineering for preprocessing steps
and model selection and tuning
■ Operations - to deliver (and be able to
continue to deliver) developed ml
models through testing, monitoring,
and versioning