Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow

Version 1.0
Machine Learning Orchestration
with Airﬂow
Using Apache Airﬂow to manage and schedule machine
learning tasks.
Obioma Anomnachi
Engineer @ Anant

Airflow
● A tool for scheduling and automating workflows
● Good for automating repeated processes
● Write workflows in Python
○ Tasks are defined in python but can include Operators for all sorts of external tools
○ Define dependencies between the tasks that make up a workflow
○ Create directed-acyclic-graphs of tasks
● Schedule workflows or execute processes manually
○ Cron syntax or frequency tasks
● Monitor task progress and view logs in Airflow UI

ML Pipelines
● Machine Learning processes get broken down into small repeatable chunks
○ Most ML associated tasks are batch processing
■ Batch processing tasks work on blocks of data at a time
● Even predictions can be bundled together into batches if results aren't time sensitive
○ This structure lines up really well with Airﬂow’s ability to schedule, automate, and manage dependencies for
tasks
○ Main sections are data processing, model training, and deployment

Data Preparation
● Data prep covers a number of data transformations that bring raw data in line with the needs of the
model training process
○ Basically a set of ETL jobs
■ Different models require data to be in different forms
○ Still preferable to keep stuff involving actual data processing separate from the dag code. Airﬂow Scheduler
and/or worker processes can get bogged down
■ Best to start separate processes on other systems like Spark

Train/Test Split
● Involves randomly splitting up processed data in preparation for model training
● In order to determine the efﬁcacy of an ML model we need to be able to test it on data that we
know the real label for, that we also didn't train on
○ Standard method is a simple test train split
○ More complex methods like cross validation can produce better models
● Split obviously needs to be redone when new data comes in

Model Training
● For the majority of ML algorithms model training is a standard batch process since the model needs
to be trained on all the data and isn’t usable until that training is complete
● This step also has to do with the storage of the model
○ Permanent model methods have the trained model stored somewhere, and new prediction requests use
whatever the most recent model version is
■ Models get saved to a central location
■ There are a number of formats for saving ml models to disk. Python objects are all savable via pickling.
Some models may even be small enough to
○ Transient model methods have each request trigger the training of a new model, which is used to make the
prediction (which is stored). The model itself gets deleted afterwards.

Deployment
● Deployment involves using the completed model to serve predictions on inputs that were not part
of the original data set
○ You could have airﬂow answer those requests but it isn't really meant for it
○ Better to use airﬂow to keep the model up to date in an external system
■ If deployment in your system means to use the model on blocks of data at once (doing analytics) the
process can be scheduled similar to data preprocessing
■ If individual requests come in and require service, best to build an api that will use the model to
service those requests

ML Ops
● ML Ops covers all the topics we've talked
about plus some extras
○ Three main phases
■ Design - understanding the business
use case, the available data, and the
structure of the software
■ Development - includes the data
engineering for preprocessing steps
and model selection and tuning
■ Operations - to deliver (and be able to
continue to deliver) developed ml
models through testing, monitoring,
and versioning

Resources
● https://ml-ops.org/

Strategy: Scalable Fast Data
Architecture: Cassandra, Spark, Kafka
Engineering: Node, Python, JVM,CLR
Operations: Cloud, Container
Rescue: Downtime!! I need help.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037

Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow

Recommended

Recommended

More Related Content

Similar to Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow

Similar to Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow (20)

More from Anant Corporation

More from Anant Corporation (20)

Recently uploaded

Recently uploaded (20)

Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow