SlideShare une entreprise Scribd logo
1  sur  40
A survey on
Machine Learning In Production
Arnab Biswas
July 2018
Model Development != Software Development
• The moment a model is in production it starts degrading
• Same model rarely gets deployed twice
• It's hard to know how well the model is doing
• Often, the real modeling work only starts in production
Source: https://www.slideshare.net/DavidTalby/when-models-go-rogue-hard-earned-lessons-about-using-
machine-learning-in-production
Learning & Prediction
• Learning
• Offline
• Model trained with historical data. Re-train to keep it fresh.
• Online
• Model is constantly being updated as new data arrives.
• Predictions
• Batch
• Based on its input data, generates a table of predictions. Schedule a service to run
regularly & output predictions to DB.
• On demand
• Predictions being made in real-time using input data that is available at time prediction.
Web Service, Real Time Streaming Analytics.
ML E2E Work Flow
1. GenerateData
1. FetchData
2. Clean Data
3. Prepare/TransformData
2. Train Model
1. Train Model
2. Evaluate Model
3. Deploy Model
Source: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-mlconcepts.html
ML Work Flow - 1
Source :
https://www.slideshare.net/fridiculo
us/managing-and-versioning-
machine-learning-models-in-python
ML Work Flow - 2
Source :
https://www.slideshare.net/fridiculo
us/managing-and-versioning-
machine-learning-models-in-python
ML Work Flow - 3
ML Work Flow - 4
ML Work Flow - 5
ML Work Flow - 7
Amazon Sagemaker : ML Work Flow
- JupyterNotebook is used for data exploration,
preprocessing,trainingand evaluation
- Model is deployed independent, decoupled from app
code (As a service)
- After model deployment
- Monitorpredictions,collect "ground truth"
- Evaluate model to identify drift
- Updatetraining datato include newly collected
ground truth
- Retrain the model with the new dataset(Increased
accuracy).
Amazon Sagemaker
- Two components
- Model Training
- Model Deployment
- Training
- Posttraining model artifactwill be saved in S3
- Validation
- Offline Testing
- Deploy the trained model to alpha endpoint
- Use historicaldatato test
- Online Testing
- Candidate model is actually deployed
- 10% of live trafficgoes to the model
- Deployment
- Create model using artifactin S3(trained model)
- Model for inference becomes available over
HTTP
- Endpoint can be configured to elastically scale
the deployed ML computeinstances
- Append user’s input data & ground truth (if
available) to already existing trainingdata
- Multiple variants of same model may get
deployed
Source: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf
Uber - Michelangelo
- Provides scalable, reliable, reproducible,
easy-to-use,and automatedtools to
address
- Managedata
- Train models
- Evaluate models
- Deploy models
- Make predictions
- Monitorpredictions
- Tech Stack
- HDFS,Spark, Samza,Cassandra,MLLib,
XGBoost& TensorFlow
https://eng.uber.com/michelangelo/
Michelangelo : Get Data
- Data Pipeline
- Generatesfeature & label (outcome)data sets for (re)training
- Generatesfeature-only data sets for predicting
- Monitorfor data quality
- Offline Pipeline
- Batchmodel training& prediction
- Transactional& log data into HDFS datalake
- Regular Spark, SQL jobs compute features & published to feature
store
- Online Pipeline
- Online, low latency prediction
- Online models can’t accessHDFS
- Batchpre-compute:load historicalfeatures from HDFS to
Cassandra(few hours/once a day)
- Near Real Time compute : Storelow latency features to Cassandra
& logged back to HDFS
- Same data generation/preparationprocessduring trainingand predicting
• Allows teams to share, discover, use a highly curatedset of features
• Reduces duplicity & increase dataquality
• Users can add a feature with metadata(owner, description,SLA)
• Featurecan be invoked by model using canonicalnames
• Systemhandles joining in the correctHDFS data sets for model
trainingor batch predictionand fetching the right value from
Cassandrafor online predictions
• FeatureStoreare automaticallycalculatedand updateddaily
• Custom DSL has been createdfor feature selection and
transformation
• Filling missing values, normalization
• Create different features from time-stamp
Michelangelo : Feature Store
Michelangelo : Train Models
- Offline, large-scale,distributedtraining
- DT, GLM,K-Means,ARIMA, DL
- SparkMLib,XGBoost,Café, Tensorflow
- Model identified by Model Configuration
- Model Type
- Hyper Parameter
- Data Source
- FeatureDSL expression
- Compute resource
- Training Job run on Yarn or Mesos cluster
- Trained model is evaluated based on error
metrics
- Performancemetrics saved in
evaluation report
- Save model to repository
- Model configuration
- Learned parameters
- Evaluation report
- (re)TrainingJob can be
configured/managedusingWebUI or API
(Using Jupyter Notebooks)
Michelangelo : Evaluate Models
- Hundreds of models are trained before
deciding on the final model
- Each model is versioned and stored in
Cassandra
- Owner’s name
- Start/Endtime of training
- Model configuration
- Reference to training & test data set
- Distributionof relative importanceof
each feature
- Full Learned Parameters
- AccuracyMetrics
- StandardChartsfor each model type
(ROC Curve, PR Curve, Confusion
Matrix)
- Multiple models should be comparable
over WebUI, API
Michelangelo : Deploy Models
- Model deployment using UI/API
- Offline Deployment
- Deployed to offline container.Used for
batchprediction
- Online Deployment
- Used by online prediction service
- Library Deployment
- Used by Java App as library
- Model artifacts(metadata files,model
parameterfiles, compiled DSL expressions)
are packagedin a ZIP archive and copied to
the relevant hosts
- The prediction containersautomatically
load the new models from disk and start
handling prediction requests.
Michelangelo : Make Predictions
- Make predictionsbased on feature data
loaded from pipeline or from client service
- Raw features passes throughDSL
expressions to modify raw features, fetch
additional features
- Final feature vector is passed to the model
for scoring
- Online models return predictions to client
- Offline models write back predictions to
hive and kafka
Michelangelo : Monitor Predictions
- Monitoringensures
- Pipelines are sending accuratedata
- Productiondatahas not changed
much from trainingdata
- Live measurement of accuracy
- Holds back a certain percentageof
prediction
- Later join these predictionsto
observedoutcomesgeneratedby
pipeline
- Calculate error metrics (R-Squared,
RMSLE,RMSE etc)
Michelangelo : Partitioned Models
- One model per city.
- Hierarchyof models : city, country, global
- When one level has insufficientdata to
train a model, it will fall back to it’sparent
or ancestornode.
Michelangelo : Referencing Models
• More than one model can be deployed on a serving container
• Transition from old to new models, side by side A/B Testing
• A model is specified by UUID & tag (optional) during deployment
• Online Model
• Client sends feature vector along with model UUID or tag
• For tag, model most recently deployed to that tag will make prediction
• Batch Model
• All deployed models are used to score each batch data set
• Prediction records contain the model UUID and optional tag
• User can replace an old model with a new model without changing tag
• Model gets updated without changing client code
• User can upload a new model (new UUID)
• Traffic can be gradually switched from old model to new one
• A/B testing of models
• Users deploy competing models either via UUIDs or tags
• Send portions of the traffic to each model
• Track performance metrics
Michelangelo : Scaling
• ML Models are stateless
• Online Models
• Add more nodes to cluster
• Offline Models
• Add more Spark executors
Michelangelo : Learnings
• There could be failure at every stage. Debuggability is important.
• Fault tolerance is important
• Abstract Machine Learning libraries
• Ease of Use : Automating ML WorkFlow through UI. Ability to train,
test, deploy models with mouse clicks
• ML Systems must be extensible & easy for other systems to interact
with programmatically (REST).
Best Practices By Google
• TensorFlow Extended (TFX), a TensorFlow based general-purpose
machine learning platform implemented at Google [KDD : 2017]
• Challenges of building Machine Learning Platforms
• One machine learning platform for many different learning tasks (in terms of
data representation,storage infrastructure, and machine learning tasks)
• Continuous Training and Serving
• Continuous training over evolving data (e.g. a moving window over the latest n days)
• Human in loop
• Simple interface for deployment and monitoring
• Production-levelreliability and scalability
• Resilient to inconsistent data, failures in underlying execution environment
• Should be able to handle high data volume during training
• Should be able to handle increase in traffic in serving system
http://www.kdd.org/kdd2017/papers/view/tfx-a-tensorflow-based-production-scale-machine-learning-platform
TFX : Pipeline
• Models are only as good as their training data
• Understanding data and finding anomalies early is critical for
preventing data errors downstream
• Rigorous checks for data quality should be a part of any long running
development of a machine learning platform
• Small bugs in the data can significantly degrade model quality over a
period of time in a way that is hard to detect and diagnose
TFX : Data Analysis, Transformation & Validation
TFX : Data Analysis
• Component processes each dataset fed to the system and generates a set
of descriptivestatisticson the included features
• Statistics aboutpresence of each feature in the data
• Number of examples with and withoutthe feature
• Distribution of the number of valuesper example
• Statistics overfeature values
• For continuous features,quantiles,equi-width histograms,mean, standarddeviation
• For discretefeatures,top-K valuesby frequency
• Statistics aboutslices of data (e.g., negative and positiveclasses)
• Cross-featurestatistics (correlationand covariancebetween features)
TFX : Data Transformation
• A suite of data transformations to allow feature wrangling for model
training and serving
• Transformation logic must be consistent during training and serving
TFX : Data Validation
• Is the data healthy or are there anomalies to be flagged?
• Validation is done using a versioned schema
• Features present in the data
• Expected type of each feature
• Expected presence of each feature (minimum count & fraction
of examples that must contain feature)
• Expected valency of the feature in each example (minimum and
maximum number of values)
• Expected domain of a feature (Values for a string feature or
range for an integer feature)
• Validate the properties of specific (training and serving)
datasets, flag any deviations from the schema as
potential anomalies
• Provide actionable suggestions to fix the anomaly
• Teams are responsible for maintaining the schema
• Users should treat data issues as bugs
• Different versions of the schema shows the evolution of
the data
TFX : Model Training
• Automate and streamline training of
production quality models
• Often training needs huge datasets (Time &
Resource intensive)
• Training dataset of 100B data points. Takes
several days to train
• Warm Starting (Continuous Training)
• High Level Model Specification API
• Higher level abstraction layer that hides
implementation
TFX : Model Evaluation & Validation
• A reusable component that automatically evaluates and validates models
to ensure that they are "good" before serving them to users
• Can help prevent unexpected degradations in the user experience
• Avoids real issue in production
• Definition of A Good Model
• Model is safe to serve
• Model should not crash, cause errors in the serving system when being loaded or when sent
unexpected inputs
• Model shouldn't use too many resources (CPU, RAM)
• Has desired production quality
• User satisfaction and product health on live traffic
• Better measure than the objective function on the training data
TFX : Model Evaluation
• Evaluation : Human Facing Metrics of model quality
• A/B testing on live traffic is costly and time consuming
• Models are evaluated offline on held-out data to determine if they are promising
enough to start an online A/B testing
• Provides AUC or cost-weightederror
• Once satisfied with models' offline performance, A/B testing can be performed with live
traffic
• Validation : Machine Facing judgment of model goodness
• Once model is launched in production, automated validation is used to ensure updated
model is good
• Updated Model receives a small amount of traffic (canary process)
• Prediction quality is evaluated by comparing the model quality against a fixed threshold as
well as against a baseline model (e.g., the current production model)
• Any new model failing these checks is not pushed to serving
• Challenges
• Canary process will not catch all potential errors.
• With changing training data, some variation in models behavior is expected, but is hard to distinguish
• Slicing
• Useful in both evaluating and validating models
• Metrics is computed on a small slice of data with particular feature value (e.g.,
country=="US")
• Metrics on the entire dataset can fail to reflect performance on small slices
TFX : Model Validation
• TensorFlow Serving provides a complete serving solution for machine-
learned models to be deployed in production environments
• Serving systems for production environments require low latency,
high efficiency, horizontal scalability, reliability & robustness
• Multitenancy
• A single instance of the server to serve multiple machine-learned models
concurrently
• Performance characteristics of one model has minimal impact on other
models
TFX : Model Serving
Other ML Tools (As of July, 2018)
• Google DataLabs
• TensorFlow Serving
• Azure ML Service (Not Azure ML Studio)
• AzureML Experimentation
• AzureML ModelManagement
• AzureML Workbench
• MLFlow by Databricks
• prediction.io
• clipper.ai
• SKIL - deeplearning4j
• & More…
Model Management (As of July, 2018)
• ML : Rapid experimentation + iteration
• Model History tracks the progress
• Versioning, Management, Monitoring, Deployment, Hosting
• Reproducibility, Rollback
• Open Source
• ModelDB
• MLFlow - E2E ML
• Pachyderm
• H2O Steam (No active development)
• comet.ml (Cloud Service)
• Do it yourself
• Source Code
• SW/HW Environment
• Data
• Visualizations/Reports
• Weight Files (Trained Models)
• Configurations (Model Type, Hyperparameters, Features)
• Performance Metrics
• …..
Best Practices
• Separate training and prediction/serving pipeline
• Automate model re-training. Keep the model fresh.
• Save actual data used at prediction time to DB. Keep expanding it.
• Use this data for retraining
• Evaluate a trained model before deployment
• Verify the model crosses a threshold error score on a predefined dataset
• Post deployment
• Monitor the performance of the model
• Business Objective
• Error Metrics
• Identify cases where the model is struggling
• Check if those use cases are needed?
• Improve data collection?
• Monitor customer feedback to identify cases to improve upon
• Monitor distribution of features fed into model at training/prediction time
• Identify errors at the data extractors
• Nature of the data may be eventually changing
• A/B Testing
• Between training and deployment,there should be staging environment
Questions to Ask
• What kind of business problem are we going to solve?
• Description
• What kind of ML (Batch/Online)?
• Scalability?
• How to ensure quality of labeled data?
• How do I find out the effectiveness of the model?
• What do we need?
• CompleteML pipeline coveringthe whole lifecycle OR
• Just separate model training fromprediction usingexisting frameworks
• How do we want to use ML (As a service or as a library)?
• How importantis reproducibility &interpretation?
• How to handle customer specific data/behavior?
Questions?

Contenu connexe

Tendances

ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to productionHerman Wu
 
What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?Matei Zaharia
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Sparkcarl_pulley
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowDatabricks
 
[AI] ML Operationalization with Microsoft Azure
[AI] ML Operationalization with Microsoft Azure[AI] ML Operationalization with Microsoft Azure
[AI] ML Operationalization with Microsoft AzureKorkrid Akepanidtaworn
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflowDatabricks
 
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Unifying Twitter around a single ML platform  - Twitter AI Platform 2019Unifying Twitter around a single ML platform  - Twitter AI Platform 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019Karthik Murugesan
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreMoritz Meister
 
Managing the Machine Learning Lifecycle with MLOps
Managing the Machine Learning Lifecycle with MLOpsManaging the Machine Learning Lifecycle with MLOps
Managing the Machine Learning Lifecycle with MLOpsFatih Baltacı
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleMLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleDatabricks
 
ML-Ops: Philosophy, Best-Practices and Tools
ML-Ops:Philosophy, Best-Practices and ToolsML-Ops:Philosophy, Best-Practices and Tools
ML-Ops: Philosophy, Best-Practices and ToolsJorge Davila-Chacon
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycleDatabricks
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
 
Agile Machine Learning for Real-time Recommender Systems
Agile Machine Learning for Real-time Recommender SystemsAgile Machine Learning for Real-time Recommender Systems
Agile Machine Learning for Real-time Recommender SystemsJohann Schleier-Smith
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowDatabricks
 
Use MLflow to manage and deploy Machine Learning model on Spark
Use MLflow to manage and deploy Machine Learning model on Spark Use MLflow to manage and deploy Machine Learning model on Spark
Use MLflow to manage and deploy Machine Learning model on Spark Herman Wu
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowProductionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowDatabricks
 

Tendances (20)

ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to production
 
What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Spark
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflow
 
[AI] ML Operationalization with Microsoft Azure
[AI] ML Operationalization with Microsoft Azure[AI] ML Operationalization with Microsoft Azure
[AI] ML Operationalization with Microsoft Azure
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
 
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
Unifying Twitter around a single ML platform  - Twitter AI Platform 2019Unifying Twitter around a single ML platform  - Twitter AI Platform 2019
Unifying Twitter around a single ML platform - Twitter AI Platform 2019
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
 
Managing the Machine Learning Lifecycle with MLOps
Managing the Machine Learning Lifecycle with MLOpsManaging the Machine Learning Lifecycle with MLOps
Managing the Machine Learning Lifecycle with MLOps
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleMLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
 
ML-Ops: Philosophy, Best-Practices and Tools
ML-Ops:Philosophy, Best-Practices and ToolsML-Ops:Philosophy, Best-Practices and Tools
ML-Ops: Philosophy, Best-Practices and Tools
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycle
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Agile Machine Learning for Real-time Recommender Systems
Agile Machine Learning for Real-time Recommender SystemsAgile Machine Learning for Real-time Recommender Systems
Agile Machine Learning for Real-time Recommender Systems
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
 
Use MLflow to manage and deploy Machine Learning model on Spark
Use MLflow to manage and deploy Machine Learning model on Spark Use MLflow to manage and deploy Machine Learning model on Spark
Use MLflow to manage and deploy Machine Learning model on Spark
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Productionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflowProductionalizing Models through CI/CD Design with MLflow
Productionalizing Models through CI/CD Design with MLflow
 

Similaire à A survey on Machine Learning In Production (July 2018)

DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-useltonrodriguez11
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on DatabricksDataScienceConferenc1
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
Magdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine LearningMagdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine LearningLviv Startup Club
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Microsoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenMicrosoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenGoDataDriven
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Sotrender
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Neotys_Partner
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...Big Data Value Association
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learningNEEVEE Technologies
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
MLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptxMLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptxKnoldus Inc.
 
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...DataScienceConferenc1
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018Adam Gibson
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...DataWorks Summit
 

Similaire à A survey on Machine Learning In Production (July 2018) (20)

DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-us
 
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
[DSC Europe 23] Petar Zecevic - ML in Production on Databricks
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Magdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine LearningMagdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine Learning
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Microsoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenMicrosoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDriven
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
Jonathon Wright - Intelligent Performance Cognitive Learning (AIOps)
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest Airports
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
MLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptxMLops on Vertex AI Presentation (AI/ML).pptx
MLops on Vertex AI Presentation (AI/ML).pptx
 
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...
 

Dernier

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 

Dernier (20)

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 

A survey on Machine Learning In Production (July 2018)

  • 1. A survey on Machine Learning In Production Arnab Biswas July 2018
  • 2. Model Development != Software Development • The moment a model is in production it starts degrading • Same model rarely gets deployed twice • It's hard to know how well the model is doing • Often, the real modeling work only starts in production Source: https://www.slideshare.net/DavidTalby/when-models-go-rogue-hard-earned-lessons-about-using- machine-learning-in-production
  • 3. Learning & Prediction • Learning • Offline • Model trained with historical data. Re-train to keep it fresh. • Online • Model is constantly being updated as new data arrives. • Predictions • Batch • Based on its input data, generates a table of predictions. Schedule a service to run regularly & output predictions to DB. • On demand • Predictions being made in real-time using input data that is available at time prediction. Web Service, Real Time Streaming Analytics.
  • 4. ML E2E Work Flow 1. GenerateData 1. FetchData 2. Clean Data 3. Prepare/TransformData 2. Train Model 1. Train Model 2. Evaluate Model 3. Deploy Model Source: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-mlconcepts.html
  • 5. ML Work Flow - 1 Source : https://www.slideshare.net/fridiculo us/managing-and-versioning- machine-learning-models-in-python
  • 6. ML Work Flow - 2 Source : https://www.slideshare.net/fridiculo us/managing-and-versioning- machine-learning-models-in-python
  • 11. Amazon Sagemaker : ML Work Flow - JupyterNotebook is used for data exploration, preprocessing,trainingand evaluation - Model is deployed independent, decoupled from app code (As a service) - After model deployment - Monitorpredictions,collect "ground truth" - Evaluate model to identify drift - Updatetraining datato include newly collected ground truth - Retrain the model with the new dataset(Increased accuracy).
  • 12. Amazon Sagemaker - Two components - Model Training - Model Deployment - Training - Posttraining model artifactwill be saved in S3 - Validation - Offline Testing - Deploy the trained model to alpha endpoint - Use historicaldatato test - Online Testing - Candidate model is actually deployed - 10% of live trafficgoes to the model - Deployment - Create model using artifactin S3(trained model) - Model for inference becomes available over HTTP - Endpoint can be configured to elastically scale the deployed ML computeinstances - Append user’s input data & ground truth (if available) to already existing trainingdata - Multiple variants of same model may get deployed Source: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf
  • 13. Uber - Michelangelo - Provides scalable, reliable, reproducible, easy-to-use,and automatedtools to address - Managedata - Train models - Evaluate models - Deploy models - Make predictions - Monitorpredictions - Tech Stack - HDFS,Spark, Samza,Cassandra,MLLib, XGBoost& TensorFlow https://eng.uber.com/michelangelo/
  • 14. Michelangelo : Get Data - Data Pipeline - Generatesfeature & label (outcome)data sets for (re)training - Generatesfeature-only data sets for predicting - Monitorfor data quality - Offline Pipeline - Batchmodel training& prediction - Transactional& log data into HDFS datalake - Regular Spark, SQL jobs compute features & published to feature store - Online Pipeline - Online, low latency prediction - Online models can’t accessHDFS - Batchpre-compute:load historicalfeatures from HDFS to Cassandra(few hours/once a day) - Near Real Time compute : Storelow latency features to Cassandra & logged back to HDFS - Same data generation/preparationprocessduring trainingand predicting
  • 15. • Allows teams to share, discover, use a highly curatedset of features • Reduces duplicity & increase dataquality • Users can add a feature with metadata(owner, description,SLA) • Featurecan be invoked by model using canonicalnames • Systemhandles joining in the correctHDFS data sets for model trainingor batch predictionand fetching the right value from Cassandrafor online predictions • FeatureStoreare automaticallycalculatedand updateddaily • Custom DSL has been createdfor feature selection and transformation • Filling missing values, normalization • Create different features from time-stamp Michelangelo : Feature Store
  • 16. Michelangelo : Train Models - Offline, large-scale,distributedtraining - DT, GLM,K-Means,ARIMA, DL - SparkMLib,XGBoost,Café, Tensorflow - Model identified by Model Configuration - Model Type - Hyper Parameter - Data Source - FeatureDSL expression - Compute resource - Training Job run on Yarn or Mesos cluster - Trained model is evaluated based on error metrics - Performancemetrics saved in evaluation report - Save model to repository - Model configuration - Learned parameters - Evaluation report - (re)TrainingJob can be configured/managedusingWebUI or API (Using Jupyter Notebooks)
  • 17. Michelangelo : Evaluate Models - Hundreds of models are trained before deciding on the final model - Each model is versioned and stored in Cassandra - Owner’s name - Start/Endtime of training - Model configuration - Reference to training & test data set - Distributionof relative importanceof each feature - Full Learned Parameters - AccuracyMetrics - StandardChartsfor each model type (ROC Curve, PR Curve, Confusion Matrix) - Multiple models should be comparable over WebUI, API
  • 18. Michelangelo : Deploy Models - Model deployment using UI/API - Offline Deployment - Deployed to offline container.Used for batchprediction - Online Deployment - Used by online prediction service - Library Deployment - Used by Java App as library - Model artifacts(metadata files,model parameterfiles, compiled DSL expressions) are packagedin a ZIP archive and copied to the relevant hosts - The prediction containersautomatically load the new models from disk and start handling prediction requests.
  • 19. Michelangelo : Make Predictions - Make predictionsbased on feature data loaded from pipeline or from client service - Raw features passes throughDSL expressions to modify raw features, fetch additional features - Final feature vector is passed to the model for scoring - Online models return predictions to client - Offline models write back predictions to hive and kafka
  • 20. Michelangelo : Monitor Predictions - Monitoringensures - Pipelines are sending accuratedata - Productiondatahas not changed much from trainingdata - Live measurement of accuracy - Holds back a certain percentageof prediction - Later join these predictionsto observedoutcomesgeneratedby pipeline - Calculate error metrics (R-Squared, RMSLE,RMSE etc)
  • 21. Michelangelo : Partitioned Models - One model per city. - Hierarchyof models : city, country, global - When one level has insufficientdata to train a model, it will fall back to it’sparent or ancestornode.
  • 22. Michelangelo : Referencing Models • More than one model can be deployed on a serving container • Transition from old to new models, side by side A/B Testing • A model is specified by UUID & tag (optional) during deployment • Online Model • Client sends feature vector along with model UUID or tag • For tag, model most recently deployed to that tag will make prediction • Batch Model • All deployed models are used to score each batch data set • Prediction records contain the model UUID and optional tag • User can replace an old model with a new model without changing tag • Model gets updated without changing client code • User can upload a new model (new UUID) • Traffic can be gradually switched from old model to new one • A/B testing of models • Users deploy competing models either via UUIDs or tags • Send portions of the traffic to each model • Track performance metrics
  • 23. Michelangelo : Scaling • ML Models are stateless • Online Models • Add more nodes to cluster • Offline Models • Add more Spark executors
  • 24. Michelangelo : Learnings • There could be failure at every stage. Debuggability is important. • Fault tolerance is important • Abstract Machine Learning libraries • Ease of Use : Automating ML WorkFlow through UI. Ability to train, test, deploy models with mouse clicks • ML Systems must be extensible & easy for other systems to interact with programmatically (REST).
  • 25. Best Practices By Google • TensorFlow Extended (TFX), a TensorFlow based general-purpose machine learning platform implemented at Google [KDD : 2017] • Challenges of building Machine Learning Platforms • One machine learning platform for many different learning tasks (in terms of data representation,storage infrastructure, and machine learning tasks) • Continuous Training and Serving • Continuous training over evolving data (e.g. a moving window over the latest n days) • Human in loop • Simple interface for deployment and monitoring • Production-levelreliability and scalability • Resilient to inconsistent data, failures in underlying execution environment • Should be able to handle high data volume during training • Should be able to handle increase in traffic in serving system http://www.kdd.org/kdd2017/papers/view/tfx-a-tensorflow-based-production-scale-machine-learning-platform
  • 27. • Models are only as good as their training data • Understanding data and finding anomalies early is critical for preventing data errors downstream • Rigorous checks for data quality should be a part of any long running development of a machine learning platform • Small bugs in the data can significantly degrade model quality over a period of time in a way that is hard to detect and diagnose TFX : Data Analysis, Transformation & Validation
  • 28. TFX : Data Analysis • Component processes each dataset fed to the system and generates a set of descriptivestatisticson the included features • Statistics aboutpresence of each feature in the data • Number of examples with and withoutthe feature • Distribution of the number of valuesper example • Statistics overfeature values • For continuous features,quantiles,equi-width histograms,mean, standarddeviation • For discretefeatures,top-K valuesby frequency • Statistics aboutslices of data (e.g., negative and positiveclasses) • Cross-featurestatistics (correlationand covariancebetween features)
  • 29. TFX : Data Transformation • A suite of data transformations to allow feature wrangling for model training and serving • Transformation logic must be consistent during training and serving
  • 30. TFX : Data Validation • Is the data healthy or are there anomalies to be flagged? • Validation is done using a versioned schema • Features present in the data • Expected type of each feature • Expected presence of each feature (minimum count & fraction of examples that must contain feature) • Expected valency of the feature in each example (minimum and maximum number of values) • Expected domain of a feature (Values for a string feature or range for an integer feature) • Validate the properties of specific (training and serving) datasets, flag any deviations from the schema as potential anomalies • Provide actionable suggestions to fix the anomaly • Teams are responsible for maintaining the schema • Users should treat data issues as bugs • Different versions of the schema shows the evolution of the data
  • 31. TFX : Model Training • Automate and streamline training of production quality models • Often training needs huge datasets (Time & Resource intensive) • Training dataset of 100B data points. Takes several days to train • Warm Starting (Continuous Training) • High Level Model Specification API • Higher level abstraction layer that hides implementation
  • 32. TFX : Model Evaluation & Validation • A reusable component that automatically evaluates and validates models to ensure that they are "good" before serving them to users • Can help prevent unexpected degradations in the user experience • Avoids real issue in production • Definition of A Good Model • Model is safe to serve • Model should not crash, cause errors in the serving system when being loaded or when sent unexpected inputs • Model shouldn't use too many resources (CPU, RAM) • Has desired production quality • User satisfaction and product health on live traffic • Better measure than the objective function on the training data
  • 33. TFX : Model Evaluation • Evaluation : Human Facing Metrics of model quality • A/B testing on live traffic is costly and time consuming • Models are evaluated offline on held-out data to determine if they are promising enough to start an online A/B testing • Provides AUC or cost-weightederror • Once satisfied with models' offline performance, A/B testing can be performed with live traffic
  • 34. • Validation : Machine Facing judgment of model goodness • Once model is launched in production, automated validation is used to ensure updated model is good • Updated Model receives a small amount of traffic (canary process) • Prediction quality is evaluated by comparing the model quality against a fixed threshold as well as against a baseline model (e.g., the current production model) • Any new model failing these checks is not pushed to serving • Challenges • Canary process will not catch all potential errors. • With changing training data, some variation in models behavior is expected, but is hard to distinguish • Slicing • Useful in both evaluating and validating models • Metrics is computed on a small slice of data with particular feature value (e.g., country=="US") • Metrics on the entire dataset can fail to reflect performance on small slices TFX : Model Validation
  • 35. • TensorFlow Serving provides a complete serving solution for machine- learned models to be deployed in production environments • Serving systems for production environments require low latency, high efficiency, horizontal scalability, reliability & robustness • Multitenancy • A single instance of the server to serve multiple machine-learned models concurrently • Performance characteristics of one model has minimal impact on other models TFX : Model Serving
  • 36. Other ML Tools (As of July, 2018) • Google DataLabs • TensorFlow Serving • Azure ML Service (Not Azure ML Studio) • AzureML Experimentation • AzureML ModelManagement • AzureML Workbench • MLFlow by Databricks • prediction.io • clipper.ai • SKIL - deeplearning4j • & More…
  • 37. Model Management (As of July, 2018) • ML : Rapid experimentation + iteration • Model History tracks the progress • Versioning, Management, Monitoring, Deployment, Hosting • Reproducibility, Rollback • Open Source • ModelDB • MLFlow - E2E ML • Pachyderm • H2O Steam (No active development) • comet.ml (Cloud Service) • Do it yourself • Source Code • SW/HW Environment • Data • Visualizations/Reports • Weight Files (Trained Models) • Configurations (Model Type, Hyperparameters, Features) • Performance Metrics • …..
  • 38. Best Practices • Separate training and prediction/serving pipeline • Automate model re-training. Keep the model fresh. • Save actual data used at prediction time to DB. Keep expanding it. • Use this data for retraining • Evaluate a trained model before deployment • Verify the model crosses a threshold error score on a predefined dataset • Post deployment • Monitor the performance of the model • Business Objective • Error Metrics • Identify cases where the model is struggling • Check if those use cases are needed? • Improve data collection? • Monitor customer feedback to identify cases to improve upon • Monitor distribution of features fed into model at training/prediction time • Identify errors at the data extractors • Nature of the data may be eventually changing • A/B Testing • Between training and deployment,there should be staging environment
  • 39. Questions to Ask • What kind of business problem are we going to solve? • Description • What kind of ML (Batch/Online)? • Scalability? • How to ensure quality of labeled data? • How do I find out the effectiveness of the model? • What do we need? • CompleteML pipeline coveringthe whole lifecycle OR • Just separate model training fromprediction usingexisting frameworks • How do we want to use ML (As a service or as a library)? • How importantis reproducibility &interpretation? • How to handle customer specific data/behavior?