Explore Microsoft's Team Data Science Process, an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently
Learn best practices for managing and creating value from data science projects
5. Our strategy is to build best-in-class
platforms and productivity services for
an intelligent cloud and an
intelligent edge infused with artificial
intelligence (“AI”).
Microsoft Form 10-K 2016
11. What is the business problem that
needs to be solved, independent of
the technology solution?
What is the decision or action has to
be taken that can be informed by
data.
13. Understanding the Decision Process
Key Decision
Should I service
this piece of
equipment?
Data Science Question
What is the probability
this equipment will fail
within the next X days?
15. Business Scenario Key Decision Data Science Question
Energy Forecasting Should I buy or sell energy
contracts?
What will be the long/short term demand
for energy in a region?
Customer Churn Which customers should I
prioritize to reduce churn?
What is probability of churn within X days
for each customer?
Personalized Marketing What product should I offer
first?
What is the probability that customer will
purchase each product?
Product Feedback Which service/product needs
attention?
What is social media sentiment for each
service/product?
Framing Data Science Question based on the Scenario
25. Establish a
Qualitative
Objective
Translate into
Quantifiable
Metric
Quantify the
metric value
improvement
useful (e.g., 10%
fewer failures
savings of
$1MM/year)
Establish a
baseline
(e.g., current
failure rate =
10% per year)
Establish how to
measure the
improvement in
the metric with
the data science
solution (e.g.
80% of the
equipment
maintained
based on
predictive
model)
Using Performance Metrics
27. Tips:
1. Data science team embedded within
the business
2. Allow exploring multiple problem
formulations to get to end metric goal
3. Past goal, go within set time period
4. Ensure reproducibility
29. 1. Set up the end to end solution and
the metrics
2. Launch with a baseline/simple
model
3. Act on the recommendations of
the solution
4. Measure and iterate
32. • Empower ALL to perform like the BEST
• Automate repetitive human tasks
• Embed expert knowledge into the solution
33. • How to interpret the model?
• Importance of Features
• Bias in the model
• Interpreting predictions per instance
• What-if analysis
Users don’t trust black-box models
37. 1. Learn from experiments
• Why?
• Both Successes or Failures
2. Share the learnings
3. Promote successful experiments to production
4. Move on to the next hypothesis to experiment
38. • Failure is a valid outcome of an
experiment
• Learn and refine the next experiment
40. A process specifies a detailed sequence of activities
necessary to perform specific business tasks.
It is used to standardize procedures and
establish best practices.
41. Microsoft’s Team Data Science Process
https://aka.ms/tdsp
Standard Project Lifecycle
Standardized Document
Templates, Project Structure
Shared, Distributed
Resources
Productivity Tools, Shared
Utilities
45. • Data science virtual
machines (DSVMs) as the
fundamental development
platform on cloud
• Use Visual Studio Team
Services (VSTS)
• Work item tracking and scrum planning
• Git repositories
• Shared data science utilities
in Git repository
• Use cloud-based Azure
resources as needed
46.
47. • Terminology:
• Feature: a project
• Story: a stage in the E2E
process of a DS project
• Tasks: specific
coding/documentation/othe
r activities that are needed
to complete a story
• Iteration: usually a 2-week
sprint
48.
49.
50. App Developer Source Control
Cloud Services
CI/CD Pipelines
IDE
Data Scientist
Training Environment
[ { "cat": 0.99218,
"feline": 0.81242,
"puma": 0.45456: } ]
IDE
App code
Apps
Edge Devices
Model Storage
PUBLISHCODE CONSUME
Lifecycle Management
Processes. Templates. Permissions
Embed model
CNTK/TF/SCIKIT
KERAS/ …
Train&
testmodel
Data Lake
App telemetry
A/B
Testing
BUILD & TEST
Training+
testcode
Continuous retraining
Testmodel
+app
51. Model Source Control
• Processes and procedures to make models
reproducible (from source control to data
retention policies)
• Make it easy to work on multiple models
(consistent process)
52. Model Validation
• Unit testing, functional testing and
performance testing
• Validation needs to be performed both
isolation and when embedded in an
application
53. Model Versioning & Storage
• Provide a consistent way to store & share
models, plus a way to track where models are
embedded / running
• Provide a consistent model format
• Provide traceability on where a model came
from (which data, which experiment, where’s
the code / notebook)
• Provide a way to track where model is running
• Control who has access to what models
54. Model Deployment
• Provide an efficient process to get a model build into an
application or service and leveraged to light up an end-user
scenario.
• Simplify the process to interact with the model (through code-
generation, API specifications / interfaces or other methods)
• Support a variety of inferencing targets (cloud / app / edge)
(including FPGAs or dedicated frameworks like CoreML & WinML)
• Provide secrets / service endpoint management to remove
friction from configuring the release process.
56. • Data Exploration
• RFM – User Behavior Modeling
• Hyper parameter tuning
• Auto Featurization
Note: Domain expertise is still
helpful
Building an Org’s Toolbox