SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
Microsoft Azure Machine Learning
Anatomy of a machine learning service
Sharat Chikkerur, Senior Software Engineer, Microsoft
(On behalf of AzureML team)
Microsoft Azure Machine Learning (AzureML)
• AzureML is a cloud-hosted tool for creating and deploying machine
learning models
• Browser-based, zero-installation and cross platform
• Describe workflows graphically
• Workflows are versioned and support reproducibility
• Models can be programmatically retrained
• Models can be deployed to Azure as a scalable web service
• Can be scaled to 1000+ end points x 200 response containers per service
• Supports versioning, collaboration & monetization
Outline
• Distinguishing features (functional components) of AzureML
• Architectural components of AzureML
• Implementation details
• Lessons learned
Distinguishing features
MLStudio: Graphical authoring environment
AzureML Entities
Workspaces
Experiments
Graphs
Datasets
Assets
Actions
Web services
Versioning
• Each run of an experiment is versioned
• Can go back in time and examine historical results
• Intermediate results cached across experiments in workspace
• Each dataset has a unique source transformation
Collaboration
• Workspaces can be shared between multiple users
• Two users cannot however edit the same experiment simultaneously
• Any experiment can be pushed to a common AzureML gallery
• Allows experiments, models and transforms to be easily shared with the
AzureML user community
External Language Support
• Full-fidelity support for R, Python and SQL (via SQLite)
• AzureML datasets marshalled transparently
• R models marshalled into AzureML models
• Scripts available as part of operationalized web services
• Code isolation
• External language modules are executed within drawbridge (container)
• “Batteries included”
• R 3.1.0 with ~500 packages, Anaconda Python 2.7 with ~120 packages
• An experiment to be operationalized must be converted into a
“scoring” experiment
• Training and scoring experiments are “linked”
Operationalization
Operationalization
• A successful scoring experiment can be published as a web service
• Published web services are automatically managed, scaled out and load-balanced
• Web service available in two flavors
• Request/Response: Low-latency endpoint for scoring a single row at a time
• Batch: Endpoint for scoring a collection of records from Azure storage
Monetization
• Data marketplace (http://datamarket.azure.com) allows users to
monetize data models
• Supports
• Web services published through AzureML
• Stand alone web services
• Integration
• Python/R modules can query external web services (including marketplace
APIs) allowing functional composition
Architectural components
Component services
• Studio (UX)
• Experimentation Service (ES)
• Comprised of micro-services
• Job Execution Service (JES)
• Single Node Runtime (SNR)
• Request response service (RRS)
• Batch execution service (BES)
UX ES JES SNR
RRS
BES
User
Studio (UX)
• Primary UX layer
• Single page application
• Asset Palette
• Datasets
• Algorithms
• Trained models
• External language modules
• Experiment canvas
• DAG consisting of modules
• Module properties
• Parameters
• Action bar
• Commands to ES
UX ES JES SNR
RRS
BES
User
Experimentation Service (ES)
• Primary backend
• Orchestrates all component services
• Handles events to/from UX
• Programmatic access
• RESTful API (UX communicates this way)
• Features
• Experiment introspection
• Experiment manipulation/creation
• Consists of micro services
• UX, assets, authentication, packing etc.
UX ES JES SNR
RRS
BES
User
Job Execution Service (JES)
• Primary job scheduler
• Dependency tracking
• Experiment DAG defines dependencies between modules.
• Topological sort used to determined order of execution
• Parallel Execution
• Different experiments can be executed in parallel
• Modules that exist at the same depth in the tree can be scheduled in parallel
• Note: JES itself does not execute the task payload. They are
dispatched to a task queue
UX ES JES SNR
RRS
BES
User
Single Node Runtime (SNR)
• Executes tasks dispatched from JES
• Consumes tasks from a queue
• Tasks consists of input specification along with module parameters
• Stateless : Data required for execution is copied over
• Each SNR contains a copy of Runtime + modules
• Runtime-DataTables, Array implementation, IO, BaseClasses etc.
• Modules – machine learning algorithms
• SNR pool shared across deployment
• Size of the pool can be scaled based on demand
UX ES JES SNR
RRS
BES
User
Machine learning algorithms
• Sources of machine learning module assets
• Microsoft research
• Infer.NET (http://research.microsoft.com/en-
us/um/cambridge/projects/infernet/)
• Vowpal wabbit (http://hunch.net)
• OpenSource
• LibSVM
• PegaSOS
• OpenCV
• R
• Scikit-learn
UX ES JES SNR
RRS
BES
User
Category Sub category Module Reference
Supervised Binary Classification Average Perceptron (Freund & Schapire, 1999)
Bayes point machine (Herbrich, Graepel, & Campbell, 2001)
Boosted decision tree (Burges, 2010)
Decision jungle (Shotton et al., 2013)
Locally Deep SVM (Jose & Goyal, 2013)
Logistic regression (Duda, Hart, & Stork, 2000)
Neural network (Bishop, 1995)
Online SVM (Shalev-Shwartz et al., 2011)
Vowpal Wabbit (Langford et al., 2007)
Multiclass Decision Forest (Criminisi, 2011)
Decision Jungle (Shotton et al., 2013)
Multinomial regression (Andrew & Gao, 2007)
Neural network (Bishop, 1995)
One-vs-all (Rifkin & Klautau, 2004)
Vowpal Wabbit (Langford et al., 2007)
Regression Bayesian linear regression (Herbrich et al., 2001)
Boosted decision tree regression (Burges, 2010)
Linear regression (batch and online) (Bottou, 2010)
Decision Forest regression (Criminisi, 2011)
Random forest based quantile Regression (Criminisi, 2011)
Neural network based regression (Bishop, 1995)
Ordinal regression (McCullagh, 1980)
Poisson regression (Nelder & Wedderburn, 1972)
Recommendation Matchbox recommender (Stern et al., 2009)
Unsupervised Clustering K-means clustering (Jain, 2010)
Anomaly detection One class SVM (Schölkopf, Platt, Shawe-Taylor, Smola, &
Williamson, 2001)
PCA based anomaly detection (Duda et al., 2000)
Feature Selection Filter Filter based feature selection (Guyon, Guyon, Elisseeff, & Elisseeff, 2003)
Text analytics Topic modeling Online LDA using Vowpal wabbit (Hoffman, Blei, & Bach, 2010)
Request response service (RRS)
Batch Execution Service (BES)
• RRS
• Handles RESTful requests for single prediction
• Requests may execute full graph
• Can include data transformation before and after prediction
• Distinguishing feature compared to other web services
• Models and required datasets in graph are compiled to a static package
• Executes in-memory and on a single machine
• Can scale based on volume of requests
• BES
• Optimized for batch request. Similar to training workflow
UX ES JES SNR
RRS
BES
User
Implementation details
Implementation details : Data representation
• “DataTable”
• Similar to R/Pandas dataframe
• Column major organization with sliced and random access
• Has a rich schema
• Names: Allows re-ordering
• Purpose: Weights, Features, Labels etc.
• Stored as compressed 2D tiles
• “wide” tiles enable streaming access
• “narrow” tiles enable full column access
• Interoperability
• Can be marshalled in/out as R/Pandas dataframe
• Can be egressed out as CSV, TSV, SQL
Index 1
Block 1
Index 2
Block 2
Index 3
Block 3
Implementation details: Modules
• Functional units in an experiment graph
• Encapsulates: data sources & sinks, models, algorithms,
scripts
• Categories
• Data ingress
• Supported sources: CSV, TSV, ARFF, LibSVM, SQL, Hive
• Type guessing for CSV, TSV (allows override)
• Data manipulation
• Cleaning missing values, SQL Transformation, R & Python scripts
• Modeling
• Machine learning algorithm
• Supervised: binary classification, multiclass classification, linear
regression, ordinal regression, recommendation
• Unsupervised: PCA, k-means
• Optimization
• Parameter sweep
Implementation details: Modules
• Ports
• Define input and output contracts
• Allows multiple input formats per port
• I/O handling is done externally to the
module through pluggable port handlers
• Allows UX to validate inputs at design
time
• Parameters
• Strongly typed
• Supports conditional parameters
• Can be marked as ‘web service’
parameter – substituted at query time
• Supports ranges (for parameter sweep)
Implementation detail: Testing
• Standard tests
• UX tests
• Web services penetration testing
• Services integration test
• AzureML Specific tests
• Module properties tests
• Schema propagation tests
• E2E experiment tests
• Operationalized experiment tests
• “Runners” test
• Machine learning tests
• Accuracy tests
• Fuzz testing (boundary values testing)
• Golden values tests
• Auto-generated tests
Lessons learned
Lesson: Data wrangling is important
• More time is built in data wrangling than model building
• “A data scientist spends nearly 80% of the time cleaning data” – NY Times
(http://nyti.ms/1t8IzfE)
• Data manipulation modules are very popular
• Internal ranking
• “Execute R script”, “SQL Transform” modules are more popular than machine learning modules.
• It is hard to anticipate all data pre-processing needs
• Need to provide custom processing support
• SQL Transform
• Execute R script
• Execute Python script
Lesson: Make big data possible, but small data efficient
• Distributed machine learning comes with a large overhead (Zaharia et al. 2010)
• Typical data science workflows enable exploration with small
amounts of data
• Should make this effortless and intuitive
• AzureML approach: “Make big data possible, but small data efficient”
• Make sure all experiment graphs can handle data size.
• Support ingress of large data – SQL, Azure
• Support features to pre-process big data
• Feature selection
• Feature hashing
• Learning by counts – reduces high dimensional data to lower dimensional historic
counts/rates
• Support streaming algorithms for big data (e.g. “Train Vowpal Wabbit”)
Lesson: Feature gaps are inevitable
• Cannot cover all possible pre-processing scenarios
• Cannot provide all algorithms
• Support for scripting (R, Python, SQL)
• Allow custom data manipulation
• Allow users to bring in external libraries
• Allow users to call into other web services
• Isolate user code
• Support during operationalization
• Support custom modules
• Allow user to author first class “modules”
• Allow use to mix custom modules in the workflow
Lesson: Data science workflows should be reproducible
• Data science workflows are iterative, explorative and collaborative
• Need to provide a way to version and capture the workflow, settings, inputs etc.
• Make it easy to repeat the same experiment
• Reproducibility
• Capture random number seeds as part of the experiment.
• Same settings should produce the same results
• Re-running parts of the graph should be efficient.
• “Determinism”
• Modules are tagged as deterministic (e.g. SQL transform) or non-deterministic (e.g. :hive query)
• A graph can also be labeled as deterministic or non-deterministic
• Caching
• Outputs from deterministic modules are cached to make re-runs efficient.
• Only changed parts of the graph are re-executed.
Summary
• AzureML provides distinguishing features
• Visual authoring
• Versioning and reproducibility
• Collaboration
• Architecture
• Multiple scalable services
• Implementation details
• Extensible data format that can be interoperate with R & Python
• Modules provide a way to package data & code
• Lessons learned
• Data wrangling is important
• Allow user code to mitigate feature gaps
• Support big data but make small data efficient
Logistics: Getting access to AzureML
• http://azure.com/ml
• https://studio.azureml.net
• Guest access w/o sign in
• Free access with sign-in ($200 credit)
• Paid access with azure subscription
• https://manage.windowsazure.com
• Manage end points, storage accounts and workspaces
Thanks
shchikke@microsoft.com
Developing a predictive model is hard
Challenges
• Data processing
• Different sources, formats, schemas
• Missing values, noisy data
• Modeling
• Modeling choice
• Feature engineering
• Parameter tuning
• Tracking & collaboration
• Deployment & Retraining
• Productionizing/deployment of the
model
• Replication, scaling out
Developing a predictive model is hard
Challenges
• Data processing
• Different sources, formats, schemas
• Missing values, noisy data
• Modeling
• Modeling choice
• Feature engineering
• Parameter tuning
• Tracking & collaboration
• Deployment & Retraining
• Productionizing/deployment of the
model
• Replication, scaling out
Solutions
• Data processing
• Languages: SQL, R, python
• Frameworks: dpylr, pandas
• Stacks: Hadoop, Spark, Mapreduce
• Modeling
• Libraries: Weka, VW, ML Lib, LibSVM
• Feature engineering: gensim, NLTK
• Tuning: Spearmint, whetlab
• Tracking & collaboration: ipynb + github
• Deployment & Retraining
• Machine learning web services
Implementation detail: Schema propagation
• Schema is associated with
datasets/learners
• Dataset attributes
• Required columns for learners etc.
• Design time validation
• Module execution has latency overhead
• Schema is computed and propagated before
executing module code.
• Method: pre-determined schema calculus
• Each module class has well defined modification
of the schema
• One-off modules are encoded as exception
JES FE
JES WORKER
SNR FE
SNR WORKERTASKS STATE
USER
WORKSPACE
EXPERIMENTATION
SERVICE
Jobs Queue
Tasks Queue
JOBS STATE
• Stateless design, easy scalability,
failover simplicity
• Optimistic concurrency,
scheduling/locking overhead
• Separate shared storage, holding
transient job/tasks state
• Task cache management to speed
up execution and facilitate
iterative experimentation
• Throttling to limit the resource
usage per customer/workspace
• Plugin architecture for task
handlers and schedulers
JES SNR interaction

Contenu connexe

Tendances

Resume_Achhar_Kalia
Resume_Achhar_KaliaResume_Achhar_Kalia
Resume_Achhar_Kalia
Achhar Kalia
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 

Tendances (20)

Machine Learning Use Cases with Azure
Machine Learning Use Cases with AzureMachine Learning Use Cases with Azure
Machine Learning Use Cases with Azure
 
Apply MLOps at Scale by H&M
Apply MLOps at Scale by H&MApply MLOps at Scale by H&M
Apply MLOps at Scale by H&M
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
Resume_Achhar_Kalia
Resume_Achhar_KaliaResume_Achhar_Kalia
Resume_Achhar_Kalia
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
V like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure MLV like Velocity, Predicting in Real-Time with Azure ML
V like Velocity, Predicting in Real-Time with Azure ML
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Deep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive ToolkitDeep Learning with Microsoft Cognitive Toolkit
Deep Learning with Microsoft Cognitive Toolkit
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 
CI/CD for Machine Learning with Daniel Kobran
CI/CD for Machine Learning with Daniel KobranCI/CD for Machine Learning with Daniel Kobran
CI/CD for Machine Learning with Daniel Kobran
 
Azure Machine Learning tutorial
Azure Machine Learning tutorialAzure Machine Learning tutorial
Azure Machine Learning tutorial
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
END-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACKEND-TO-END MACHINE LEARNING STACK
END-TO-END MACHINE LEARNING STACK
 
ML studio overview v1.1
ML studio overview v1.1ML studio overview v1.1
ML studio overview v1.1
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 

En vedette

Short film research
Short film researchShort film research
Short film research
saimaaauddin
 

En vedette (19)

Insider's introduction to microsoft azure machine learning: 201411 Seattle Bu...
Insider's introduction to microsoft azure machine learning: 201411 Seattle Bu...Insider's introduction to microsoft azure machine learning: 201411 Seattle Bu...
Insider's introduction to microsoft azure machine learning: 201411 Seattle Bu...
 
Azure machine learning overview
Azure machine learning overviewAzure machine learning overview
Azure machine learning overview
 
Building Python Applications on Windows Azure
Building Python Applications on Windows AzureBuilding Python Applications on Windows Azure
Building Python Applications on Windows Azure
 
Developing Python Apps on Windows Azure
Developing Python Apps on Windows AzureDeveloping Python Apps on Windows Azure
Developing Python Apps on Windows Azure
 
Microsoft azure machine learning
Microsoft azure machine learningMicrosoft azure machine learning
Microsoft azure machine learning
 
Large scale predictive analytics for anomaly detection - Nicolas Hohn
Large scale predictive analytics for anomaly detection - Nicolas HohnLarge scale predictive analytics for anomaly detection - Nicolas Hohn
Large scale predictive analytics for anomaly detection - Nicolas Hohn
 
Simple machine learning for the masses - Konstantin Davydov
Simple machine learning for the masses - Konstantin DavydovSimple machine learning for the masses - Konstantin Davydov
Simple machine learning for the masses - Konstantin Davydov
 
Azure Machine Learning - A Full Journey
Azure Machine Learning - A Full JourneyAzure Machine Learning - A Full Journey
Azure Machine Learning - A Full Journey
 
DL on Azure ML with Python where type DL = Deep Learning | Deep LOVE
DL on Azure ML with Python where type DL = Deep Learning | Deep LOVEDL on Azure ML with Python where type DL = Deep Learning | Deep LOVE
DL on Azure ML with Python where type DL = Deep Learning | Deep LOVE
 
Short film research
Short film researchShort film research
Short film research
 
U-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for DevelopersU-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for Developers
 
Azure Data Lake and U-SQL
Azure Data Lake and U-SQLAzure Data Lake and U-SQL
Azure Data Lake and U-SQL
 
Cortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data LakeCortana Analytics Workshop: Azure Data Lake
Cortana Analytics Workshop: Azure Data Lake
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
What’s new on the Microsoft Azure Data Platform
What’s new on the Microsoft Azure Data Platform What’s new on the Microsoft Azure Data Platform
What’s new on the Microsoft Azure Data Platform
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePoint
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep Dive
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWS
 
What's New with AWS Lambda
What's New with AWS LambdaWhat's New with AWS Lambda
What's New with AWS Lambda
 

Similaire à [Research] azure ml anatomy of a machine learning service - Sharat Chikkerur

Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...
Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...
Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...
monsonc
 

Similaire à [Research] azure ml anatomy of a machine learning service - Sharat Chikkerur (20)

Azure basics
Azure basicsAzure basics
Azure basics
 
Frameworks Galore: A Pragmatic Review
Frameworks Galore: A Pragmatic ReviewFrameworks Galore: A Pragmatic Review
Frameworks Galore: A Pragmatic Review
 
Node.js
Node.jsNode.js
Node.js
 
Microservices in Azure
Microservices in AzureMicroservices in Azure
Microservices in Azure
 
Cnam azure ze cloud resource manager
Cnam azure ze cloud  resource managerCnam azure ze cloud  resource manager
Cnam azure ze cloud resource manager
 
Microservices in Azure
Microservices in AzureMicroservices in Azure
Microservices in Azure
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
 
Exploring microservices in a Microsoft landscape
Exploring microservices in a Microsoft landscapeExploring microservices in a Microsoft landscape
Exploring microservices in a Microsoft landscape
 
Azure SQL Database
Azure SQL Database Azure SQL Database
Azure SQL Database
 
Oracle OpenWorld 2014 Review Part Four - PaaS Middleware
Oracle OpenWorld 2014 Review Part Four - PaaS MiddlewareOracle OpenWorld 2014 Review Part Four - PaaS Middleware
Oracle OpenWorld 2014 Review Part Four - PaaS Middleware
 
Developing SharePoint Framework Solutions for the Enterprise - SEF 2019
Developing SharePoint Framework Solutions for the Enterprise - SEF 2019Developing SharePoint Framework Solutions for the Enterprise - SEF 2019
Developing SharePoint Framework Solutions for the Enterprise - SEF 2019
 
Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...
Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...
Teradata Partners 2011 - Utilizing Teradata Express For Development And Sandb...
 
DSD-INT 2020 Scripting a Delft-FEWS configuration - Verkade
DSD-INT 2020 Scripting a Delft-FEWS configuration - VerkadeDSD-INT 2020 Scripting a Delft-FEWS configuration - Verkade
DSD-INT 2020 Scripting a Delft-FEWS configuration - Verkade
 
Making Data Scientists Productive in Azure
Making Data Scientists Productive in AzureMaking Data Scientists Productive in Azure
Making Data Scientists Productive in Azure
 
Software variability management - 2017
Software variability management - 2017Software variability management - 2017
Software variability management - 2017
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
8. Software Development Security
8. Software Development Security8. Software Development Security
8. Software Development Security
 
Azure Cosmos DB - The Swiss Army NoSQL Cloud Database
Azure Cosmos DB - The Swiss Army NoSQL Cloud DatabaseAzure Cosmos DB - The Swiss Army NoSQL Cloud Database
Azure Cosmos DB - The Swiss Army NoSQL Cloud Database
 
CosmosDB.pptx
CosmosDB.pptxCosmosDB.pptx
CosmosDB.pptx
 
Paa sing a java ee 6 application kshitiz saxena
Paa sing a java ee 6 application   kshitiz saxenaPaa sing a java ee 6 application   kshitiz saxena
Paa sing a java ee 6 application kshitiz saxena
 

Plus de PAPIs.io

Predictive APIs: What about Banking? - Natalino Busa @ PAPIs Connect
Predictive APIs: What about Banking? - Natalino Busa @ PAPIs ConnectPredictive APIs: What about Banking? - Natalino Busa @ PAPIs Connect
Predictive APIs: What about Banking? - Natalino Busa @ PAPIs Connect
PAPIs.io
 
How to predict the future of shopping - Ulrich Kerzel @ PAPIs Connect
How to predict the future of shopping - Ulrich Kerzel @ PAPIs ConnectHow to predict the future of shopping - Ulrich Kerzel @ PAPIs Connect
How to predict the future of shopping - Ulrich Kerzel @ PAPIs Connect
PAPIs.io
 
The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs C...
The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs C...The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs C...
The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs C...
PAPIs.io
 

Plus de PAPIs.io (20)

Shortening the time from analysis to deployment with ml as-a-service — Luiz A...
Shortening the time from analysis to deployment with ml as-a-service — Luiz A...Shortening the time from analysis to deployment with ml as-a-service — Luiz A...
Shortening the time from analysis to deployment with ml as-a-service — Luiz A...
 
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
 
Extracting information from images using deep learning and transfer learning ...
Extracting information from images using deep learning and transfer learning ...Extracting information from images using deep learning and transfer learning ...
Extracting information from images using deep learning and transfer learning ...
 
Discovering the hidden treasure of data using graph analytic — Ana Paula Appe...
Discovering the hidden treasure of data using graph analytic — Ana Paula Appe...Discovering the hidden treasure of data using graph analytic — Ana Paula Appe...
Discovering the hidden treasure of data using graph analytic — Ana Paula Appe...
 
Deep learning for sentiment analysis — André Barbosa (elo7) @PAPIs Connect — ...
Deep learning for sentiment analysis — André Barbosa (elo7) @PAPIs Connect — ...Deep learning for sentiment analysis — André Barbosa (elo7) @PAPIs Connect — ...
Deep learning for sentiment analysis — André Barbosa (elo7) @PAPIs Connect — ...
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Building machine learning applications locally with Spark — Joel Pinho Lucas ...
Building machine learning applications locally with Spark — Joel Pinho Lucas ...Building machine learning applications locally with Spark — Joel Pinho Lucas ...
Building machine learning applications locally with Spark — Joel Pinho Lucas ...
 
Battery log data mining — Ramon Oliveira (Datart) @PAPIs Connect — São Paulo ...
Battery log data mining — Ramon Oliveira (Datart) @PAPIs Connect — São Paulo ...Battery log data mining — Ramon Oliveira (Datart) @PAPIs Connect — São Paulo ...
Battery log data mining — Ramon Oliveira (Datart) @PAPIs Connect — São Paulo ...
 
A tensorflow recommending system for news — Fabrício Vargas Matos (Hearst tv)...
A tensorflow recommending system for news — Fabrício Vargas Matos (Hearst tv)...A tensorflow recommending system for news — Fabrício Vargas Matos (Hearst tv)...
A tensorflow recommending system for news — Fabrício Vargas Matos (Hearst tv)...
 
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
Scaling machine learning as a service at Uber — Li Erran Li at #papis2016
 
Real-world applications of AI - Daniel Hulme @ PAPIs Connect
Real-world applications of AI - Daniel Hulme @ PAPIs ConnectReal-world applications of AI - Daniel Hulme @ PAPIs Connect
Real-world applications of AI - Daniel Hulme @ PAPIs Connect
 
Past, Present and Future of AI: a Fascinating Journey - Ramon Lopez de Mantar...
Past, Present and Future of AI: a Fascinating Journey - Ramon Lopez de Mantar...Past, Present and Future of AI: a Fascinating Journey - Ramon Lopez de Mantar...
Past, Present and Future of AI: a Fascinating Journey - Ramon Lopez de Mantar...
 
Revolutionizing Offline Retail Pricing & Promotions with ML - Daniel Guhl @ P...
Revolutionizing Offline Retail Pricing & Promotions with ML - Daniel Guhl @ P...Revolutionizing Offline Retail Pricing & Promotions with ML - Daniel Guhl @ P...
Revolutionizing Offline Retail Pricing & Promotions with ML - Daniel Guhl @ P...
 
Demystifying Deep Learning - Roberto Paredes Palacios @ PAPIs Connect
Demystifying Deep Learning - Roberto Paredes Palacios @ PAPIs ConnectDemystifying Deep Learning - Roberto Paredes Palacios @ PAPIs Connect
Demystifying Deep Learning - Roberto Paredes Palacios @ PAPIs Connect
 
Predictive APIs: What about Banking? - Natalino Busa @ PAPIs Connect
Predictive APIs: What about Banking? - Natalino Busa @ PAPIs ConnectPredictive APIs: What about Banking? - Natalino Busa @ PAPIs Connect
Predictive APIs: What about Banking? - Natalino Busa @ PAPIs Connect
 
Microdecision making in financial services - Greg Lamp @ PAPIs Connect
Microdecision making in financial services - Greg Lamp @ PAPIs ConnectMicrodecision making in financial services - Greg Lamp @ PAPIs Connect
Microdecision making in financial services - Greg Lamp @ PAPIs Connect
 
Engineering the Future of Our Choice with General AI - JoEllen Lukavec Koeste...
Engineering the Future of Our Choice with General AI - JoEllen Lukavec Koeste...Engineering the Future of Our Choice with General AI - JoEllen Lukavec Koeste...
Engineering the Future of Our Choice with General AI - JoEllen Lukavec Koeste...
 
Distributed deep learning with spark on AWS - Vincent Van Steenbergen @ PAPIs...
Distributed deep learning with spark on AWS - Vincent Van Steenbergen @ PAPIs...Distributed deep learning with spark on AWS - Vincent Van Steenbergen @ PAPIs...
Distributed deep learning with spark on AWS - Vincent Van Steenbergen @ PAPIs...
 
How to predict the future of shopping - Ulrich Kerzel @ PAPIs Connect
How to predict the future of shopping - Ulrich Kerzel @ PAPIs ConnectHow to predict the future of shopping - Ulrich Kerzel @ PAPIs Connect
How to predict the future of shopping - Ulrich Kerzel @ PAPIs Connect
 
The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs C...
The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs C...The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs C...
The emergent opportunity of Big Data for Social Good - Nuria Oliver @ PAPIs C...
 

Dernier

Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

[Research] azure ml anatomy of a machine learning service - Sharat Chikkerur

  • 1. Microsoft Azure Machine Learning Anatomy of a machine learning service Sharat Chikkerur, Senior Software Engineer, Microsoft (On behalf of AzureML team)
  • 2. Microsoft Azure Machine Learning (AzureML) • AzureML is a cloud-hosted tool for creating and deploying machine learning models • Browser-based, zero-installation and cross platform • Describe workflows graphically • Workflows are versioned and support reproducibility • Models can be programmatically retrained • Models can be deployed to Azure as a scalable web service • Can be scaled to 1000+ end points x 200 response containers per service • Supports versioning, collaboration & monetization
  • 3. Outline • Distinguishing features (functional components) of AzureML • Architectural components of AzureML • Implementation details • Lessons learned
  • 7. Versioning • Each run of an experiment is versioned • Can go back in time and examine historical results • Intermediate results cached across experiments in workspace • Each dataset has a unique source transformation
  • 8. Collaboration • Workspaces can be shared between multiple users • Two users cannot however edit the same experiment simultaneously • Any experiment can be pushed to a common AzureML gallery • Allows experiments, models and transforms to be easily shared with the AzureML user community
  • 9. External Language Support • Full-fidelity support for R, Python and SQL (via SQLite) • AzureML datasets marshalled transparently • R models marshalled into AzureML models • Scripts available as part of operationalized web services • Code isolation • External language modules are executed within drawbridge (container) • “Batteries included” • R 3.1.0 with ~500 packages, Anaconda Python 2.7 with ~120 packages
  • 10. • An experiment to be operationalized must be converted into a “scoring” experiment • Training and scoring experiments are “linked” Operationalization
  • 11. Operationalization • A successful scoring experiment can be published as a web service • Published web services are automatically managed, scaled out and load-balanced • Web service available in two flavors • Request/Response: Low-latency endpoint for scoring a single row at a time • Batch: Endpoint for scoring a collection of records from Azure storage
  • 12. Monetization • Data marketplace (http://datamarket.azure.com) allows users to monetize data models • Supports • Web services published through AzureML • Stand alone web services • Integration • Python/R modules can query external web services (including marketplace APIs) allowing functional composition
  • 14. Component services • Studio (UX) • Experimentation Service (ES) • Comprised of micro-services • Job Execution Service (JES) • Single Node Runtime (SNR) • Request response service (RRS) • Batch execution service (BES) UX ES JES SNR RRS BES User
  • 15. Studio (UX) • Primary UX layer • Single page application • Asset Palette • Datasets • Algorithms • Trained models • External language modules • Experiment canvas • DAG consisting of modules • Module properties • Parameters • Action bar • Commands to ES UX ES JES SNR RRS BES User
  • 16. Experimentation Service (ES) • Primary backend • Orchestrates all component services • Handles events to/from UX • Programmatic access • RESTful API (UX communicates this way) • Features • Experiment introspection • Experiment manipulation/creation • Consists of micro services • UX, assets, authentication, packing etc. UX ES JES SNR RRS BES User
  • 17. Job Execution Service (JES) • Primary job scheduler • Dependency tracking • Experiment DAG defines dependencies between modules. • Topological sort used to determined order of execution • Parallel Execution • Different experiments can be executed in parallel • Modules that exist at the same depth in the tree can be scheduled in parallel • Note: JES itself does not execute the task payload. They are dispatched to a task queue UX ES JES SNR RRS BES User
  • 18. Single Node Runtime (SNR) • Executes tasks dispatched from JES • Consumes tasks from a queue • Tasks consists of input specification along with module parameters • Stateless : Data required for execution is copied over • Each SNR contains a copy of Runtime + modules • Runtime-DataTables, Array implementation, IO, BaseClasses etc. • Modules – machine learning algorithms • SNR pool shared across deployment • Size of the pool can be scaled based on demand UX ES JES SNR RRS BES User
  • 19. Machine learning algorithms • Sources of machine learning module assets • Microsoft research • Infer.NET (http://research.microsoft.com/en- us/um/cambridge/projects/infernet/) • Vowpal wabbit (http://hunch.net) • OpenSource • LibSVM • PegaSOS • OpenCV • R • Scikit-learn UX ES JES SNR RRS BES User
  • 20. Category Sub category Module Reference Supervised Binary Classification Average Perceptron (Freund & Schapire, 1999) Bayes point machine (Herbrich, Graepel, & Campbell, 2001) Boosted decision tree (Burges, 2010) Decision jungle (Shotton et al., 2013) Locally Deep SVM (Jose & Goyal, 2013) Logistic regression (Duda, Hart, & Stork, 2000) Neural network (Bishop, 1995) Online SVM (Shalev-Shwartz et al., 2011) Vowpal Wabbit (Langford et al., 2007) Multiclass Decision Forest (Criminisi, 2011) Decision Jungle (Shotton et al., 2013) Multinomial regression (Andrew & Gao, 2007) Neural network (Bishop, 1995) One-vs-all (Rifkin & Klautau, 2004) Vowpal Wabbit (Langford et al., 2007) Regression Bayesian linear regression (Herbrich et al., 2001) Boosted decision tree regression (Burges, 2010) Linear regression (batch and online) (Bottou, 2010) Decision Forest regression (Criminisi, 2011) Random forest based quantile Regression (Criminisi, 2011) Neural network based regression (Bishop, 1995) Ordinal regression (McCullagh, 1980) Poisson regression (Nelder & Wedderburn, 1972) Recommendation Matchbox recommender (Stern et al., 2009) Unsupervised Clustering K-means clustering (Jain, 2010) Anomaly detection One class SVM (Schölkopf, Platt, Shawe-Taylor, Smola, & Williamson, 2001) PCA based anomaly detection (Duda et al., 2000) Feature Selection Filter Filter based feature selection (Guyon, Guyon, Elisseeff, & Elisseeff, 2003) Text analytics Topic modeling Online LDA using Vowpal wabbit (Hoffman, Blei, & Bach, 2010)
  • 21. Request response service (RRS) Batch Execution Service (BES) • RRS • Handles RESTful requests for single prediction • Requests may execute full graph • Can include data transformation before and after prediction • Distinguishing feature compared to other web services • Models and required datasets in graph are compiled to a static package • Executes in-memory and on a single machine • Can scale based on volume of requests • BES • Optimized for batch request. Similar to training workflow UX ES JES SNR RRS BES User
  • 23. Implementation details : Data representation • “DataTable” • Similar to R/Pandas dataframe • Column major organization with sliced and random access • Has a rich schema • Names: Allows re-ordering • Purpose: Weights, Features, Labels etc. • Stored as compressed 2D tiles • “wide” tiles enable streaming access • “narrow” tiles enable full column access • Interoperability • Can be marshalled in/out as R/Pandas dataframe • Can be egressed out as CSV, TSV, SQL Index 1 Block 1 Index 2 Block 2 Index 3 Block 3
  • 24. Implementation details: Modules • Functional units in an experiment graph • Encapsulates: data sources & sinks, models, algorithms, scripts • Categories • Data ingress • Supported sources: CSV, TSV, ARFF, LibSVM, SQL, Hive • Type guessing for CSV, TSV (allows override) • Data manipulation • Cleaning missing values, SQL Transformation, R & Python scripts • Modeling • Machine learning algorithm • Supervised: binary classification, multiclass classification, linear regression, ordinal regression, recommendation • Unsupervised: PCA, k-means • Optimization • Parameter sweep
  • 25. Implementation details: Modules • Ports • Define input and output contracts • Allows multiple input formats per port • I/O handling is done externally to the module through pluggable port handlers • Allows UX to validate inputs at design time • Parameters • Strongly typed • Supports conditional parameters • Can be marked as ‘web service’ parameter – substituted at query time • Supports ranges (for parameter sweep)
  • 26. Implementation detail: Testing • Standard tests • UX tests • Web services penetration testing • Services integration test • AzureML Specific tests • Module properties tests • Schema propagation tests • E2E experiment tests • Operationalized experiment tests • “Runners” test • Machine learning tests • Accuracy tests • Fuzz testing (boundary values testing) • Golden values tests • Auto-generated tests
  • 28. Lesson: Data wrangling is important • More time is built in data wrangling than model building • “A data scientist spends nearly 80% of the time cleaning data” – NY Times (http://nyti.ms/1t8IzfE) • Data manipulation modules are very popular • Internal ranking • “Execute R script”, “SQL Transform” modules are more popular than machine learning modules. • It is hard to anticipate all data pre-processing needs • Need to provide custom processing support • SQL Transform • Execute R script • Execute Python script
  • 29. Lesson: Make big data possible, but small data efficient • Distributed machine learning comes with a large overhead (Zaharia et al. 2010) • Typical data science workflows enable exploration with small amounts of data • Should make this effortless and intuitive • AzureML approach: “Make big data possible, but small data efficient” • Make sure all experiment graphs can handle data size. • Support ingress of large data – SQL, Azure • Support features to pre-process big data • Feature selection • Feature hashing • Learning by counts – reduces high dimensional data to lower dimensional historic counts/rates • Support streaming algorithms for big data (e.g. “Train Vowpal Wabbit”)
  • 30. Lesson: Feature gaps are inevitable • Cannot cover all possible pre-processing scenarios • Cannot provide all algorithms • Support for scripting (R, Python, SQL) • Allow custom data manipulation • Allow users to bring in external libraries • Allow users to call into other web services • Isolate user code • Support during operationalization • Support custom modules • Allow user to author first class “modules” • Allow use to mix custom modules in the workflow
  • 31. Lesson: Data science workflows should be reproducible • Data science workflows are iterative, explorative and collaborative • Need to provide a way to version and capture the workflow, settings, inputs etc. • Make it easy to repeat the same experiment • Reproducibility • Capture random number seeds as part of the experiment. • Same settings should produce the same results • Re-running parts of the graph should be efficient. • “Determinism” • Modules are tagged as deterministic (e.g. SQL transform) or non-deterministic (e.g. :hive query) • A graph can also be labeled as deterministic or non-deterministic • Caching • Outputs from deterministic modules are cached to make re-runs efficient. • Only changed parts of the graph are re-executed.
  • 32. Summary • AzureML provides distinguishing features • Visual authoring • Versioning and reproducibility • Collaboration • Architecture • Multiple scalable services • Implementation details • Extensible data format that can be interoperate with R & Python • Modules provide a way to package data & code • Lessons learned • Data wrangling is important • Allow user code to mitigate feature gaps • Support big data but make small data efficient
  • 33. Logistics: Getting access to AzureML • http://azure.com/ml • https://studio.azureml.net • Guest access w/o sign in • Free access with sign-in ($200 credit) • Paid access with azure subscription • https://manage.windowsazure.com • Manage end points, storage accounts and workspaces
  • 35. Developing a predictive model is hard Challenges • Data processing • Different sources, formats, schemas • Missing values, noisy data • Modeling • Modeling choice • Feature engineering • Parameter tuning • Tracking & collaboration • Deployment & Retraining • Productionizing/deployment of the model • Replication, scaling out
  • 36. Developing a predictive model is hard Challenges • Data processing • Different sources, formats, schemas • Missing values, noisy data • Modeling • Modeling choice • Feature engineering • Parameter tuning • Tracking & collaboration • Deployment & Retraining • Productionizing/deployment of the model • Replication, scaling out Solutions • Data processing • Languages: SQL, R, python • Frameworks: dpylr, pandas • Stacks: Hadoop, Spark, Mapreduce • Modeling • Libraries: Weka, VW, ML Lib, LibSVM • Feature engineering: gensim, NLTK • Tuning: Spearmint, whetlab • Tracking & collaboration: ipynb + github • Deployment & Retraining • Machine learning web services
  • 37. Implementation detail: Schema propagation • Schema is associated with datasets/learners • Dataset attributes • Required columns for learners etc. • Design time validation • Module execution has latency overhead • Schema is computed and propagated before executing module code. • Method: pre-determined schema calculus • Each module class has well defined modification of the schema • One-off modules are encoded as exception
  • 38. JES FE JES WORKER SNR FE SNR WORKERTASKS STATE USER WORKSPACE EXPERIMENTATION SERVICE Jobs Queue Tasks Queue JOBS STATE • Stateless design, easy scalability, failover simplicity • Optimistic concurrency, scheduling/locking overhead • Separate shared storage, holding transient job/tasks state • Task cache management to speed up execution and facilitate iterative experimentation • Throttling to limit the resource usage per customer/workspace • Plugin architecture for task handlers and schedulers JES SNR interaction