SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
ParallelM , Strata Data 2017
The Unspoken Truths of Deploying and
Scaling ML in Production
Nisha Talagala
CTO, ParallelM
2Confidential
Growth of Machine Learning and Deep Learning
• Data growth
• Easy access to scalable compute
• Open source algorithms, engines and tools
3Confidential
The ML Development and Deployment Cycle
• Bulk of effort today is in the left side of this process (development)
• Many tools, libraries, etc.
• Democratization of Data Science
• Auto-ML
4Confidential
• Challenges of Production ML (and DL)
• Our approach and solution
• Demo
• Reference designs and more info
In this talk
5Confidential
• ML ‘black box’ into which many inputs (algorithmic, human, dataset etc.) go to
provide output.
• Difficult to have reproducible, deterministically ‘correct’ result as input data
changes
• ML in production may behave differently than in developer sandbox because live
data ≠ training data
What makes ML uniquely challenging in production?
Part I : Dataset dependency
6Confidential
• Public dataset for SLA Violation detection (https://arxiv.org/pdf/1509.01386.pdf)
Example
Load Scenario LibSVM Accuracy Pegasos SVM Accuracy
flashcrowd_load 0.843 0.915
periodic_load 0.788 0.867
constant_load 0.999 0.999
poisson_load 0.963 0.963
Load (shift) scenario LibSVM Accuracy Pegasos SVM
Accuracy
Flashcrowd to
Periodic
ACC1 0.356 0.356
ACC2 0.47 0.47
Periodic to
Flashcrowd
ACC1 0.826 0.558
ACC2 0.766 0.805
When  trained  to  right  dataset,  both  algorithms  do  well.
When  dataset  switches,  accuracy  suffers  in  (algorithm  specific)  ways
• ACC1-­ load  2  load  only  
• ACC2  – both  loads
7Confidential
• Retraining required to keep up with changing data - manage training & inference
pipelines in parallel
• Feature engineering pipelines must match for Training and Inference
• Pipelines need to be orchestrated factoring in such dependencies
• Further complexity if ensembles etc. are used
What makes ML uniquely challenging in production?
Part II : Training/Inference
8Confidential
• Possibly differing engines (Spark, TensorFlow, Caffe, PyTorch, Sci-kit Learn, etc. )
• Different languages (Python, Java, Scala, R ..)
• Inference vs Training engines
• Training can be frequently batch
• Inference (Prediction, Model Serving) can be REST endpoint/custom code,
streaming engine, micro-batch, etc.
• Feature manipulation done at training needs to be replicated (or factored in) at
inference
What makes ML uniquely challenging in production?
Part III : Heterogeneity in Training/Inference
9Confidential
Collaboration:
• Expertise mismatch between Data Science & Ops complicates handoff and
continuous management and optimization
Process:
• Many objects to be tracked and managed (algorithms, models, pipelines, versions
etc.)
• Emerging requirements for reproducibility, process audit etc.
What makes ML uniquely challenging in production?
Part IV : Collaboration, Process
10Confidential
What we need
• Accelerate deployment & facilitate collaboration between Data & Ops teams
• Monitor validity of ML predictions, diagnose data and ML performance issues
• Orchestrate training, update, and configuration of ML pipelines across
distributed, heterogeneous infrastructure with tracking
11Confidential
Our Approach
MLOps Workspace visualizes pipeline
deployment, business impact, alerts, and
ML Health predictions
MLOps Agents
• Attaches to each engine instance
• Agents for Spark, TensorFlow, Flink etc.
• Manages local communication with Engine
MLOps Server builds & deploys ML Apps,
processes data, orchestrates policies, and
manages distributed MLOps Agents
Workspace
Analytics  
Engine
Analytics  
Engine
Analytics  
Engine
Analytics  
Engine
MLOps  Server  
MLOps  
Agent
MLOps  
Agent
MLOps  
Agent
MLOps  
Agent
12Confidential
• Link pipelines (training and inference) via an
“Intelligence Overlay Network (ION)”
• Basically a Directed Graph representation
with allowance for cycles
•Pipelines are DAGs within each engine
• Distributed execution over heterogeneous
engines, programming languages and
geographies
Operational Abstraction
Always
Update
Example  – KMeans Batch  Training
Plus  Streaming  Inference  
Anomaly  Detection
13Confidential
Integrating with Analytics Engines (Spark) - Examples
• Job Management
• Via SparkLauncher: A library to control launching, monitoring and terminating jobs
• PM Agent communicates with Spark through this library for job management (also uses Java
API to launch child processes)
• Statistics
• Via SparkListener: A Spark-driver callback service
• SparkListener taps into all accumulators which, is one of the popular ways to expose statistics
• PM agent communicates with the Spark driver and exposes statistics via a REST endpoint
• ML Health / Model collection and updates
• PM Agent delivers and receives health events, health objects and models via sockets from
custom PM components in the ML Pipeline
14Confidential
Integrating with Analytics Engines (TensorFlow) - Examples
• Job Management
• TensorFlow Python programs run as standalone applications
• Standard process control mechanisms based on the OS is used to monitor and
control TensorFlow programs
• Statistics Collection
• PM Agent parses contents via TensorBoard log files to extract meaningful statistics
and events that data scientists added
• ML Health / Model collection
• Generation of models and health objects is recorded on a shared medium
15Confidential
Demo
16Confidential
DEMO Configuration
• MLOps Server (3 nodes)
• MLOps Agent + Spark Engine (1 node)
• MLOps Agent + Flink Engine (1 node)
• MLOPs Center
FlinkSpark
NFS
Dataset :
Training: 10 attributes, 600K samples
Inference:
A: 1Mil samples @~1K samples/sec
B: 1Mil samples @~1K samples/sec
MLOps
Server
MLOps
Agent
MLOps
Agent
MLOps Workspace
17Confidential
Demo example use case: Anomaly Detection
HDFS   Feature  Engineering K-­Means  (Training) Saved  Model  (PMML)
K-­Means
Anomaly  Detection
Multivariate Anomaly Detection
Feature  
Engineering
Kafka  
Always
Update
18Confidential
Demo Baseline
K-
Means
Training
Anomaly
Detection
K-Means
ML Ops
Server
NFS / HDFS
Feeder
ML Ops
Center
Spark - Batch
Flink - Streaming
Inference samples
Anomalous Inference samples • ION  launched  and  run
• MLOps  Center  orchestrates  Spark  and  Flink  
pipelines
• Spark  training  pipeline  periodically  generates  
new  trained  models  
• Trained  models  are  sent  to  and  updated  into  
inference  pipeline  .
Model
We  use  a  “feeder”  program  to  send  in  different    
datasets  into  the  above  flow  to  generate  various  
event  types  and  show  MLOps  Center  features
19Confidential
• Image recognition use cases in
security, retail etc.
• Utilizing DL algorithms with
TensorFlow 1.3 and Flink
• Hardware configuration optimized for
accelerated distributed training with
support for leading tools and
frameworks right out of the box
• The ability to rapidly move to and
manage in production while ensuring
ML prediction quality in a dynamic
environment
High Performance Deep Learning in Production
Reference Design: Mellanox and ParallelM
https://community.mellanox.com/docs/DOC-­3001
20Confidential
For  more  information
http://www.parallelm.com/
“Deploying  A  Scalable  Deep  Learning  Solution  in  Production  with  Tensorflow:  A  
Reference  Design  with  Mellanox and  ParallelM”
https://community.mellanox.com/docs/DOC-­3001
“TensorFlow:  Tips  for  Getting  Started”  at  ParallelM booth  at  Strata  and  online
Reference  design  for  Edge/Cloud  ML  https://www.linkedin.com/pulse/showcasing-­
edgecloud-­machine-­learning-­management-­mec-­2017-­
das/?trackingId=nuAps6ixyIcobHNJbnHe5g%3D%3D
Examples  of  Spark  and  Flink  scaling  with  Online  ML  algorithms:  http://sf.flink-­
forward.org/kb_sessions/experiences-­with-­streaming-­vs-­micro-­batch-­for-­online-­
learning/
New  O-­Reilly  book:  “Deep  Learning  with  TensorFlow"
21Confidential
Thank You
nisha.talagala@parallelm.com

Contenu connexe

Tendances

Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In ProductionSamir Bessalah
 
NLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated TrainingNLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated TrainingDatabricks
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowDatabricks
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...Databricks
 
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowDatabricks
 
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...Databricks
 
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Sri Ambati
 
AI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data ModelingAI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data ModelingDatabricks
 
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementMLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementDatabricks
 
What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?Matei Zaharia
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreMoritz Meister
 
Detecting Financial Fraud at Scale with Machine Learning
Detecting Financial Fraud at Scale with Machine LearningDetecting Financial Fraud at Scale with Machine Learning
Detecting Financial Fraud at Scale with Machine LearningDatabricks
 
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...Justin Basilico
 
Accelerating Production Machine Learning with MLflow with Matei Zaharia
Accelerating Production Machine Learning with MLflow with Matei ZahariaAccelerating Production Machine Learning with MLflow with Matei Zaharia
Accelerating Production Machine Learning with MLflow with Matei ZahariaDatabricks
 

Tendances (20)

Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
NLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated TrainingNLP Text Recommendation System Journey to Automated Training
NLP Text Recommendation System Journey to Automated Training
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
 
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...High Performance Transfer Learning for Classifying Intent of Sales Engagement...
High Performance Transfer Learning for Classifying Intent of Sales Engagement...
 
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 
AI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data ModelingAI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data Modeling
 
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle ManagementMLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
 
What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
 
Detecting Financial Fraud at Scale with Machine Learning
Detecting Financial Fraud at Scale with Machine LearningDetecting Financial Fraud at Scale with Machine Learning
Detecting Financial Fraud at Scale with Machine Learning
 
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
 
Accelerating Production Machine Learning with MLflow with Matei Zaharia
Accelerating Production Machine Learning with MLflow with Matei ZahariaAccelerating Production Machine Learning with MLflow with Matei Zaharia
Accelerating Production Machine Learning with MLflow with Matei Zaharia
 

Similaire à Strata parallel m-ml-ops_sept_2017

Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedLaurenz Wuttke
 
Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesNick Pentreath
 
Legion - AI Runtime Platform
Legion -  AI Runtime PlatformLegion -  AI Runtime Platform
Legion - AI Runtime PlatformAlexey Kharlamov
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...Databricks
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1Bill Liu
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Databricks
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks
 
Data Production Pipelines: Legacy, practices, and innovation
Data Production Pipelines: Legacy, practices, and innovationData Production Pipelines: Legacy, practices, and innovation
Data Production Pipelines: Legacy, practices, and innovationNatalino Busa
 
Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Knoldus Inc.
 
EPAM ML/AI Accelerator - ODAHU
EPAM ML/AI Accelerator - ODAHUEPAM ML/AI Accelerator - ODAHU
EPAM ML/AI Accelerator - ODAHUDmitrii Suslov
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureDatabricks
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowLviv Startup Club
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowEdunomica
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsLightbend
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Databricks
 

Similaire à Strata parallel m-ml-ops_sept_2017 (20)

Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons Learned
 
Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI Pipelines
 
Legion - AI Runtime Platform
Legion -  AI Runtime PlatformLegion -  AI Runtime Platform
Legion - AI Runtime Platform
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ... MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
MLflow: Infrastructure for a Complete Machine Learning Life Cycle with Mani ...
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest Airports
 
Data Production Pipelines: Legacy, practices, and innovation
Data Production Pipelines: Legacy, practices, and innovationData Production Pipelines: Legacy, practices, and innovation
Data Production Pipelines: Legacy, practices, and innovation
 
Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)Databricks for MLOps Presentation (AI/ML)
Databricks for MLOps Presentation (AI/ML)
 
EPAM ML/AI Accelerator - ODAHU
EPAM ML/AI Accelerator - ODAHUEPAM ML/AI Accelerator - ODAHU
EPAM ML/AI Accelerator - ODAHU
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with Kubeflow
 
Mohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with KubeflowMohamed Sabri: Operationalize machine learning with Kubeflow
Mohamed Sabri: Operationalize machine learning with Kubeflow
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
 

Dernier

Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 

Dernier (20)

Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 

Strata parallel m-ml-ops_sept_2017

  • 1. ParallelM , Strata Data 2017 The Unspoken Truths of Deploying and Scaling ML in Production Nisha Talagala CTO, ParallelM
  • 2. 2Confidential Growth of Machine Learning and Deep Learning • Data growth • Easy access to scalable compute • Open source algorithms, engines and tools
  • 3. 3Confidential The ML Development and Deployment Cycle • Bulk of effort today is in the left side of this process (development) • Many tools, libraries, etc. • Democratization of Data Science • Auto-ML
  • 4. 4Confidential • Challenges of Production ML (and DL) • Our approach and solution • Demo • Reference designs and more info In this talk
  • 5. 5Confidential • ML ‘black box’ into which many inputs (algorithmic, human, dataset etc.) go to provide output. • Difficult to have reproducible, deterministically ‘correct’ result as input data changes • ML in production may behave differently than in developer sandbox because live data ≠ training data What makes ML uniquely challenging in production? Part I : Dataset dependency
  • 6. 6Confidential • Public dataset for SLA Violation detection (https://arxiv.org/pdf/1509.01386.pdf) Example Load Scenario LibSVM Accuracy Pegasos SVM Accuracy flashcrowd_load 0.843 0.915 periodic_load 0.788 0.867 constant_load 0.999 0.999 poisson_load 0.963 0.963 Load (shift) scenario LibSVM Accuracy Pegasos SVM Accuracy Flashcrowd to Periodic ACC1 0.356 0.356 ACC2 0.47 0.47 Periodic to Flashcrowd ACC1 0.826 0.558 ACC2 0.766 0.805 When  trained  to  right  dataset,  both  algorithms  do  well. When  dataset  switches,  accuracy  suffers  in  (algorithm  specific)  ways • ACC1-­ load  2  load  only   • ACC2  – both  loads
  • 7. 7Confidential • Retraining required to keep up with changing data - manage training & inference pipelines in parallel • Feature engineering pipelines must match for Training and Inference • Pipelines need to be orchestrated factoring in such dependencies • Further complexity if ensembles etc. are used What makes ML uniquely challenging in production? Part II : Training/Inference
  • 8. 8Confidential • Possibly differing engines (Spark, TensorFlow, Caffe, PyTorch, Sci-kit Learn, etc. ) • Different languages (Python, Java, Scala, R ..) • Inference vs Training engines • Training can be frequently batch • Inference (Prediction, Model Serving) can be REST endpoint/custom code, streaming engine, micro-batch, etc. • Feature manipulation done at training needs to be replicated (or factored in) at inference What makes ML uniquely challenging in production? Part III : Heterogeneity in Training/Inference
  • 9. 9Confidential Collaboration: • Expertise mismatch between Data Science & Ops complicates handoff and continuous management and optimization Process: • Many objects to be tracked and managed (algorithms, models, pipelines, versions etc.) • Emerging requirements for reproducibility, process audit etc. What makes ML uniquely challenging in production? Part IV : Collaboration, Process
  • 10. 10Confidential What we need • Accelerate deployment & facilitate collaboration between Data & Ops teams • Monitor validity of ML predictions, diagnose data and ML performance issues • Orchestrate training, update, and configuration of ML pipelines across distributed, heterogeneous infrastructure with tracking
  • 11. 11Confidential Our Approach MLOps Workspace visualizes pipeline deployment, business impact, alerts, and ML Health predictions MLOps Agents • Attaches to each engine instance • Agents for Spark, TensorFlow, Flink etc. • Manages local communication with Engine MLOps Server builds & deploys ML Apps, processes data, orchestrates policies, and manages distributed MLOps Agents Workspace Analytics   Engine Analytics   Engine Analytics   Engine Analytics   Engine MLOps  Server   MLOps   Agent MLOps   Agent MLOps   Agent MLOps   Agent
  • 12. 12Confidential • Link pipelines (training and inference) via an “Intelligence Overlay Network (ION)” • Basically a Directed Graph representation with allowance for cycles •Pipelines are DAGs within each engine • Distributed execution over heterogeneous engines, programming languages and geographies Operational Abstraction Always Update Example  – KMeans Batch  Training Plus  Streaming  Inference   Anomaly  Detection
  • 13. 13Confidential Integrating with Analytics Engines (Spark) - Examples • Job Management • Via SparkLauncher: A library to control launching, monitoring and terminating jobs • PM Agent communicates with Spark through this library for job management (also uses Java API to launch child processes) • Statistics • Via SparkListener: A Spark-driver callback service • SparkListener taps into all accumulators which, is one of the popular ways to expose statistics • PM agent communicates with the Spark driver and exposes statistics via a REST endpoint • ML Health / Model collection and updates • PM Agent delivers and receives health events, health objects and models via sockets from custom PM components in the ML Pipeline
  • 14. 14Confidential Integrating with Analytics Engines (TensorFlow) - Examples • Job Management • TensorFlow Python programs run as standalone applications • Standard process control mechanisms based on the OS is used to monitor and control TensorFlow programs • Statistics Collection • PM Agent parses contents via TensorBoard log files to extract meaningful statistics and events that data scientists added • ML Health / Model collection • Generation of models and health objects is recorded on a shared medium
  • 16. 16Confidential DEMO Configuration • MLOps Server (3 nodes) • MLOps Agent + Spark Engine (1 node) • MLOps Agent + Flink Engine (1 node) • MLOPs Center FlinkSpark NFS Dataset : Training: 10 attributes, 600K samples Inference: A: 1Mil samples @~1K samples/sec B: 1Mil samples @~1K samples/sec MLOps Server MLOps Agent MLOps Agent MLOps Workspace
  • 17. 17Confidential Demo example use case: Anomaly Detection HDFS   Feature  Engineering K-­Means  (Training) Saved  Model  (PMML) K-­Means Anomaly  Detection Multivariate Anomaly Detection Feature   Engineering Kafka   Always Update
  • 18. 18Confidential Demo Baseline K- Means Training Anomaly Detection K-Means ML Ops Server NFS / HDFS Feeder ML Ops Center Spark - Batch Flink - Streaming Inference samples Anomalous Inference samples • ION  launched  and  run • MLOps  Center  orchestrates  Spark  and  Flink   pipelines • Spark  training  pipeline  periodically  generates   new  trained  models   • Trained  models  are  sent  to  and  updated  into   inference  pipeline  . Model We  use  a  “feeder”  program  to  send  in  different     datasets  into  the  above  flow  to  generate  various   event  types  and  show  MLOps  Center  features
  • 19. 19Confidential • Image recognition use cases in security, retail etc. • Utilizing DL algorithms with TensorFlow 1.3 and Flink • Hardware configuration optimized for accelerated distributed training with support for leading tools and frameworks right out of the box • The ability to rapidly move to and manage in production while ensuring ML prediction quality in a dynamic environment High Performance Deep Learning in Production Reference Design: Mellanox and ParallelM https://community.mellanox.com/docs/DOC-­3001
  • 20. 20Confidential For  more  information http://www.parallelm.com/ “Deploying  A  Scalable  Deep  Learning  Solution  in  Production  with  Tensorflow:  A   Reference  Design  with  Mellanox and  ParallelM” https://community.mellanox.com/docs/DOC-­3001 “TensorFlow:  Tips  for  Getting  Started”  at  ParallelM booth  at  Strata  and  online Reference  design  for  Edge/Cloud  ML  https://www.linkedin.com/pulse/showcasing-­ edgecloud-­machine-­learning-­management-­mec-­2017-­ das/?trackingId=nuAps6ixyIcobHNJbnHe5g%3D%3D Examples  of  Spark  and  Flink  scaling  with  Online  ML  algorithms:  http://sf.flink-­ forward.org/kb_sessions/experiences-­with-­streaming-­vs-­micro-­batch-­for-­online-­ learning/ New  O-­Reilly  book:  “Deep  Learning  with  TensorFlow"