SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Introduce MLflow with
Databricks
Liangjun Jiang, 10/03/20189
A tutorial to help you use Mlflow in your work
Problems of Machine Learning Workflow
DIFFICULT TO KEEP TRACK OF
EXPERIMENTS
DIFFICULT TO REPRODUCE CODE NO STANDARD WAY TO PACKAGE
AND DEPLOY MODELS
Outline
MLFLOW OVERVIEW MLFLOW WITH
DATABRICKS
CICD WITH DATABRICKS
Part I: Mlflow Overview
Open source platform for the
machine learning lifecycle
https://github.com/mlflow/mlflow
https://mlflow.org/
Mlflow Components
Scalability and Big Data
MLfow supports scaling in three dimensions
1. An individual MLflow run can execute on a distributed cluster, for example,
using Apache Spark. You can launch runs on the distributed infrastructure of your
choice and report results to a Tracking Server to compare them. MLflow includes a
built-in API to launch runs on Databricks.
2. MLflow supports launching multiple runs in parallel with different parameters, for
example, for hyperparameter tuning. You can simply use the Projects API to start
multiple runs and the Tracking API to track them.
3. MLflow Projects can take input from, and write output to, distributed storage
systems such as AWS S3 and DBFS. MLflow can automatically download such files
locally for projects that can only run on local files, or give the project a distributed
storage URI if it supports that. This means that you can write projects that build
large datasets, such as featurizing a 100 TB file.
MLflow Components - tracking
• - is an API and UI for logging parameters,
code versions, metrics, and output files when
running your machine learning code and for
later visualizing the results. You can use
MLflow Tracking in any environment (for
example, a standalone script or a notebook)
to log results to local files or to a server, then
compare multiple runs. Teams can also use it
to compare results from different users.
Tracking – cont’d
Multisteps Tracking
• A typical flow
• Step 1: download data from a url
• Step 2: transform and load the downloaded dataset to another place
• Step 3: use Spark (Databricks) to train your model
• Step 4: share your model with application developers
You can use MLFlow Tracking API to track information of each steps
MLflow Components - Projects
• are a standard format for packaging reusable data science code.
Each project is simply a directory with code or a Git repository,
and uses a descriptor file or simply convention to specify its
dependencies and how to run the code. For example, projects
can contain a conda.yaml file for specifying a
Python Conda environment. When you use the MLflow Tracking
API in a Project, MLflow automatically remembers the project
version (for example, Git commit) and any parameters. You can
easily run existing MLflow Projects from GitHub or your own Git
repository, and chain them into multi-step workflows.
mlflow run https://github.com/mlflow/mlflow-
example.git -P alpha=0.4
MLflow
Components
- models
• offer a convention for packaging machine learning models in multiple
flavors, and a variety of tools to help you deploy them. Each Model is
saved as a directory containing arbitrary files and a descriptor file
that lists several “flavors” the model can be used in. For example, a
TensorFlow model can be loaded as a TensorFlow DAG, or as a
Python function to apply to input data. MLflow provides tools to
deploy many common model types to diverse platforms: for
example, any model supporting the “Python function” flavor can be
deployed to a Docker-based REST server, to cloud platforms such as
Azure ML and AWS SageMaker, and as a user-defined function in
Apache Spark for batch and streaming inference. If you output
MLflow Models using the Tracking API, MLflow also automatically
remembers which Project and run they came from.
MLflow Models
• Storage Format
• MLflow defines several “standard” flavors that all of its built-in
deployment tools support, such as a “Python function” flavor that
describes how to run the model as a Python function.
• However, libraries can also define and use other flavors. For example,
MLflow’s mlflow.sklearn library allows loading models back as a scikit-
learn Pipeline object for use in code that is aware of scikit-learn, or as
a generic Python function for use in tools that just need to apply the
model (for example, the mlflow sagemaker tool for deploying models
to Amazon SageMaker).
Built-in Model Flavors
• Python Function
• R Funtion
• H2O
• Keras
• Mleap
• PyTorch
• Scikit-learn
• Spark Mlib
• TensorFlow
• ONNX
Saving & Serving Models
• MLflow includes a generic MLmodel format for saving models from a
variety of tools in diverse flavors. For example, many models can be
served as Python functions, so an MLmodel file can declare how each
model should be interpreted as a Python function in order to let
various tools serve it. MLflow also includes tools for running such
models locally and exporting them to Docker containers or
commercial serving platforms.
• for example, batch inference on Apache Spark and real-time serving
through a REST API
mlflow models serve -m runs:/<RUN_ID>/model
• A local restful service demo
Modeling wine preferences by data mining from physicochemical pro
perties
• mlflow models serve -m runs:/ea388ca349964193a17f2823480fc6bf/model --
port 5001
• curl -d '{"columns":["x"], "data":[[1], [-1]]}' -H 'Content-Type: application/json;
format=pandas-split' -X POST localhost:5001/invocations
Modeling Learning Model as a Service
More about model service
• Restful Service for Batch or low latency Inference with or w/o
databricks (Spark)
• Deploy with Azure ML
• The mlflow.azureml module can package python_function models into Azure
ML container images.
• Example workflow using the Python API
• https://www.mlflow.org/docs/latest/models.html#built-in-deployment-tools
Summary: Use Cases
• Individual Data Scientists can use MLflow Tracking to track experiments locally on their machine, organize code in projects for
future reuse, and output models that production engineers can then deploy using MLflow’s deployment tools. MLflow Tracking
just reads and writes files to the local file system by default, so there is no need to deploy a server.
• Data Science Teams can deploy an MLflow Tracking server to log and compare results across multiple users working on the same
problem. By setting up a convention for naming their parameters and metrics, they can try different algorithms to tackle the same
problem and then run the same algorithms again on new data to compare models in the future. Moreover, anyone can download
and run another model.
• Large Organizations can share projects, models, and results using MLflow. Any team can run another team’s code using MLflow
Projects, so organizations can package useful training and data preparation steps that other teams can use, or compare results
from many teams on the same task. Moreover, engineering teams can easily move workflows from R&D to staging to production.
• Production Engineers can deploy models from diverse ML libraries in the same way, store the models as files in a management
system of their choice, and track which run a model came from.
• Researchers and Open Source Developers can publish code to GitHub in the MLflow Project format, making it easy for anyone to
run their code using the mlflow run github.com/... command.
• ML Library Developers can output models in the MLflow Model format to have them automatically support deployment using
MLflow’s built-in tools. In addition, deployment tool developers (for example, a cloud vendor building a serving platform) can
automatically support a large variety of models.
Part II: Working
with Databricks
• MLflow on Databricks integrates with the complete
Databricks Unified Analytics Platform, including
Notebooks, Jobs, Databricks Delta, and the Databricks
security model, enabling you to run your existing MLflow
jobs at scale in a secure, production-ready manner.
mlflow run git@github.com:mlflow/mlflow-example.git -P
alpha=0.5 -b databricks --backend-config json-cluster-spec.json
Step by step guide: https://medium.com/@liangjunjiang/install-
mlflow-on-databricks-55b11bc023fa
Extra: CI/CD on Databricks
1. https://thedataguy.blog/ci-cd-with-databricks-and-azure-devops/
2. https://databricks.com/blog/2017/10/30/continuous-integration-continuous-delivery-databricks.html
• Databricks doesn’t support
Enteprise Github (yet)
• Azure DevOps can do per file git
tracking
• Coding with Databricks needs
supports of 1. your code. 2.
libraries you used 3.
machine(cluster) you will run
CI/CD with Databricks - Development
• A two-step process for development
1. Use git manage your project (as usual) locally . Your project should have data,
your notebook, the library you will use, etc
2. Copy your project content to Databricks Workspace
3. Once you are done with your notebook coding, export from Databricks
Workspace to your local
4. Use git command to push to the remote repo.
CI/CD with Databricks – Unit Test or
Integration Test
• Solution 1:
• Rewrite your notebook code to Java/Scala Classes or Python Packages using
IDE
• Write unit test of those classes with IDE
• Remember to split the core logic and your library
• Import your library to databricks and let your notebooks to interact with it
• In the end, your code are two parts: libraries and notebooks
• Solution 2:
• Everything are in a package, use Spark run it
CI/CD with Databricks – Build
• Similar to the development process
• Assign dedicated production cluster
Alternative to this Presentation
• You Probably should just watch this Mlflow Spark
+ AI Summit Keynote video
• https://vimeo.com/274266886#t=33s

Contenu connexe

Tendances

MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
Databricks
 

Tendances (20)

Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflow
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
 
MLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine Learning
 
Pythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowPythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlow
 
Productionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingProductionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model Serving
 
Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML Lifecycle
 
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowProductionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflow
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
 
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and PrometheusRobust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflow
 
How to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-SourceHow to Build a ML Platform Efficiently Using Open-Source
How to Build a ML Platform Efficiently Using Open-Source
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
KFServing and Feast
KFServing and FeastKFServing and Feast
KFServing and Feast
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
MLOps.pptx
MLOps.pptxMLOps.pptx
MLOps.pptx
 

Similaire à Mlflow with databricks

Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
Gianmario Spacagna
 
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and KubeflowKostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
IT Arena
 

Similaire à Mlflow with databricks (20)

"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"
 
MLFlow 1.0 Meetup
MLFlow 1.0 Meetup MLFlow 1.0 Meetup
MLFlow 1.0 Meetup
 
Use MLflow to manage and deploy Machine Learning model on Spark
Use MLflow to manage and deploy Machine Learning model on Spark Use MLflow to manage and deploy Machine Learning model on Spark
Use MLflow to manage and deploy Machine Learning model on Spark
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
 
MLOps pipelines using MLFlow - From training to production
MLOps pipelines using MLFlow - From training to productionMLOps pipelines using MLFlow - From training to production
MLOps pipelines using MLFlow - From training to production
 
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
DAIS Europe Nov. 2020 presentation on MLflow Model ServingDAIS Europe Nov. 2020 presentation on MLflow Model Serving
DAIS Europe Nov. 2020 presentation on MLflow Model Serving
 
MLlib with MLFlow.pdf
MLlib with MLFlow.pdfMLlib with MLFlow.pdf
MLlib with MLFlow.pdf
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
 
Training and deploying ML models with Google Cloud Platform
Training and deploying ML models with Google Cloud PlatformTraining and deploying ML models with Google Cloud Platform
Training and deploying ML models with Google Cloud Platform
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
 
MLflow Model Serving - DAIS 2021
MLflow Model Serving - DAIS 2021MLflow Model Serving - DAIS 2021
MLflow Model Serving - DAIS 2021
 
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and KubeflowKostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
Kostiantyn Bokhan, N-iX. CD4ML based on Azure and Kubeflow
 
DevBCN Vertex AI - Pipelines for your MLOps workflows
DevBCN Vertex AI - Pipelines for your MLOps workflowsDevBCN Vertex AI - Pipelines for your MLOps workflows
DevBCN Vertex AI - Pipelines for your MLOps workflows
 
Why is dev ops for machine learning so different
Why is dev ops for machine learning so differentWhy is dev ops for machine learning so different
Why is dev ops for machine learning so different
 
Do you know what your Drupal is doing Observe it! (DrupalCon Prague 2022)
Do you know what your Drupal is doing Observe it! (DrupalCon Prague 2022)Do you know what your Drupal is doing Observe it! (DrupalCon Prague 2022)
Do you know what your Drupal is doing Observe it! (DrupalCon Prague 2022)
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
 
Running Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using KubernetesRunning Apache Spark Jobs Using Kubernetes
Running Apache Spark Jobs Using Kubernetes
 
Azure Resource Manager templates: Improve deployment time and reusability
Azure Resource Manager templates: Improve deployment time and reusabilityAzure Resource Manager templates: Improve deployment time and reusability
Azure Resource Manager templates: Improve deployment time and reusability
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Mlflow with databricks

  • 1. Introduce MLflow with Databricks Liangjun Jiang, 10/03/20189 A tutorial to help you use Mlflow in your work
  • 2. Problems of Machine Learning Workflow DIFFICULT TO KEEP TRACK OF EXPERIMENTS DIFFICULT TO REPRODUCE CODE NO STANDARD WAY TO PACKAGE AND DEPLOY MODELS
  • 3. Outline MLFLOW OVERVIEW MLFLOW WITH DATABRICKS CICD WITH DATABRICKS
  • 4. Part I: Mlflow Overview Open source platform for the machine learning lifecycle https://github.com/mlflow/mlflow https://mlflow.org/
  • 6. Scalability and Big Data MLfow supports scaling in three dimensions 1. An individual MLflow run can execute on a distributed cluster, for example, using Apache Spark. You can launch runs on the distributed infrastructure of your choice and report results to a Tracking Server to compare them. MLflow includes a built-in API to launch runs on Databricks. 2. MLflow supports launching multiple runs in parallel with different parameters, for example, for hyperparameter tuning. You can simply use the Projects API to start multiple runs and the Tracking API to track them. 3. MLflow Projects can take input from, and write output to, distributed storage systems such as AWS S3 and DBFS. MLflow can automatically download such files locally for projects that can only run on local files, or give the project a distributed storage URI if it supports that. This means that you can write projects that build large datasets, such as featurizing a 100 TB file.
  • 7. MLflow Components - tracking • - is an API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results. You can use MLflow Tracking in any environment (for example, a standalone script or a notebook) to log results to local files or to a server, then compare multiple runs. Teams can also use it to compare results from different users.
  • 9. Multisteps Tracking • A typical flow • Step 1: download data from a url • Step 2: transform and load the downloaded dataset to another place • Step 3: use Spark (Databricks) to train your model • Step 4: share your model with application developers You can use MLFlow Tracking API to track information of each steps
  • 10. MLflow Components - Projects • are a standard format for packaging reusable data science code. Each project is simply a directory with code or a Git repository, and uses a descriptor file or simply convention to specify its dependencies and how to run the code. For example, projects can contain a conda.yaml file for specifying a Python Conda environment. When you use the MLflow Tracking API in a Project, MLflow automatically remembers the project version (for example, Git commit) and any parameters. You can easily run existing MLflow Projects from GitHub or your own Git repository, and chain them into multi-step workflows. mlflow run https://github.com/mlflow/mlflow- example.git -P alpha=0.4
  • 11. MLflow Components - models • offer a convention for packaging machine learning models in multiple flavors, and a variety of tools to help you deploy them. Each Model is saved as a directory containing arbitrary files and a descriptor file that lists several “flavors” the model can be used in. For example, a TensorFlow model can be loaded as a TensorFlow DAG, or as a Python function to apply to input data. MLflow provides tools to deploy many common model types to diverse platforms: for example, any model supporting the “Python function” flavor can be deployed to a Docker-based REST server, to cloud platforms such as Azure ML and AWS SageMaker, and as a user-defined function in Apache Spark for batch and streaming inference. If you output MLflow Models using the Tracking API, MLflow also automatically remembers which Project and run they came from.
  • 12. MLflow Models • Storage Format • MLflow defines several “standard” flavors that all of its built-in deployment tools support, such as a “Python function” flavor that describes how to run the model as a Python function. • However, libraries can also define and use other flavors. For example, MLflow’s mlflow.sklearn library allows loading models back as a scikit- learn Pipeline object for use in code that is aware of scikit-learn, or as a generic Python function for use in tools that just need to apply the model (for example, the mlflow sagemaker tool for deploying models to Amazon SageMaker).
  • 13. Built-in Model Flavors • Python Function • R Funtion • H2O • Keras • Mleap • PyTorch • Scikit-learn • Spark Mlib • TensorFlow • ONNX
  • 14. Saving & Serving Models • MLflow includes a generic MLmodel format for saving models from a variety of tools in diverse flavors. For example, many models can be served as Python functions, so an MLmodel file can declare how each model should be interpreted as a Python function in order to let various tools serve it. MLflow also includes tools for running such models locally and exporting them to Docker containers or commercial serving platforms. • for example, batch inference on Apache Spark and real-time serving through a REST API mlflow models serve -m runs:/<RUN_ID>/model
  • 15. • A local restful service demo Modeling wine preferences by data mining from physicochemical pro perties • mlflow models serve -m runs:/ea388ca349964193a17f2823480fc6bf/model -- port 5001 • curl -d '{"columns":["x"], "data":[[1], [-1]]}' -H 'Content-Type: application/json; format=pandas-split' -X POST localhost:5001/invocations Modeling Learning Model as a Service
  • 16. More about model service • Restful Service for Batch or low latency Inference with or w/o databricks (Spark) • Deploy with Azure ML • The mlflow.azureml module can package python_function models into Azure ML container images. • Example workflow using the Python API • https://www.mlflow.org/docs/latest/models.html#built-in-deployment-tools
  • 17. Summary: Use Cases • Individual Data Scientists can use MLflow Tracking to track experiments locally on their machine, organize code in projects for future reuse, and output models that production engineers can then deploy using MLflow’s deployment tools. MLflow Tracking just reads and writes files to the local file system by default, so there is no need to deploy a server. • Data Science Teams can deploy an MLflow Tracking server to log and compare results across multiple users working on the same problem. By setting up a convention for naming their parameters and metrics, they can try different algorithms to tackle the same problem and then run the same algorithms again on new data to compare models in the future. Moreover, anyone can download and run another model. • Large Organizations can share projects, models, and results using MLflow. Any team can run another team’s code using MLflow Projects, so organizations can package useful training and data preparation steps that other teams can use, or compare results from many teams on the same task. Moreover, engineering teams can easily move workflows from R&D to staging to production. • Production Engineers can deploy models from diverse ML libraries in the same way, store the models as files in a management system of their choice, and track which run a model came from. • Researchers and Open Source Developers can publish code to GitHub in the MLflow Project format, making it easy for anyone to run their code using the mlflow run github.com/... command. • ML Library Developers can output models in the MLflow Model format to have them automatically support deployment using MLflow’s built-in tools. In addition, deployment tool developers (for example, a cloud vendor building a serving platform) can automatically support a large variety of models.
  • 18. Part II: Working with Databricks • MLflow on Databricks integrates with the complete Databricks Unified Analytics Platform, including Notebooks, Jobs, Databricks Delta, and the Databricks security model, enabling you to run your existing MLflow jobs at scale in a secure, production-ready manner. mlflow run git@github.com:mlflow/mlflow-example.git -P alpha=0.5 -b databricks --backend-config json-cluster-spec.json Step by step guide: https://medium.com/@liangjunjiang/install- mlflow-on-databricks-55b11bc023fa
  • 19. Extra: CI/CD on Databricks 1. https://thedataguy.blog/ci-cd-with-databricks-and-azure-devops/ 2. https://databricks.com/blog/2017/10/30/continuous-integration-continuous-delivery-databricks.html • Databricks doesn’t support Enteprise Github (yet) • Azure DevOps can do per file git tracking • Coding with Databricks needs supports of 1. your code. 2. libraries you used 3. machine(cluster) you will run
  • 20. CI/CD with Databricks - Development • A two-step process for development 1. Use git manage your project (as usual) locally . Your project should have data, your notebook, the library you will use, etc 2. Copy your project content to Databricks Workspace 3. Once you are done with your notebook coding, export from Databricks Workspace to your local 4. Use git command to push to the remote repo.
  • 21. CI/CD with Databricks – Unit Test or Integration Test • Solution 1: • Rewrite your notebook code to Java/Scala Classes or Python Packages using IDE • Write unit test of those classes with IDE • Remember to split the core logic and your library • Import your library to databricks and let your notebooks to interact with it • In the end, your code are two parts: libraries and notebooks • Solution 2: • Everything are in a package, use Spark run it
  • 22. CI/CD with Databricks – Build • Similar to the development process • Assign dedicated production cluster
  • 23. Alternative to this Presentation • You Probably should just watch this Mlflow Spark + AI Summit Keynote video • https://vimeo.com/274266886#t=33s