Data Science for Business Managers - An intro to ROI for predictive analytics

1
Data Science for Business Managers
Akın Osman Kazakçı
MINES ParisTech
Balazs Kégl
Ecole Polytechnique, CNRS

2
External
Data
Database
X
PredictionEngine
Visualisation
Automated
actions
Notiﬁcations
The value of data is revealed through prediction.
At the heart of the digital
transformation lies the “data”

Levels of transformation
through data
• reporting: what happened in the past? (reﬂection)
• dashboards and real time monitoring: what is
happening now? (reactivity)
• prediction: what will happen next? (pro-activity)

How can we accelerate a digital
transformation process by leveraging
data?

5
Building value-driven data projects
What knowledge would increase our proﬁts?
The following questions need to be answered
in this order:
What data do we need to collect?
What ML methods are appropriate?

• Are standard innovation methodologies ﬁt for digital
transformation projects?
• (Can we CK this?)
6
Discussion

• Do I have all the relevant data?  
Strategic data watch: is there any new source of data I can
use?
• Do I have best predictive accuracy? 
How do I make sure that I’m working with the best
possible predictive models?
7
Two key aspects

8
Data hunt
• Do I have all the relevant data?

9
Exercice: Data hunt
During your transition to predictive analytics, you may need
to update your databases: to include more variables with
potential explicative power
•A travel IT systems company has some
air traffic / passenger data.
•They are interested in predicting
passenger flux between 20 airports in
US.
•Data for 720 days, for each pair of
airports.
•So,“one” variable.
•How can we augment this dataset?
•Which variables can be added?
•Where can we find the data?

Potential sources for relevant factors
K1 Events
K2 Plane
accidents
K3 Calendar
K4 Delays
causes
K5 Alternative
transportation
K6 Safety
K7 Data on
airports
K8 Similar data
K9 Oil price
K10 Average
domestic air
fares
K11 Town’s
population
K12 Town’s
attractiveness
20+ participants (students), analysed byYohann Sitruk

11
Model quality
• Do I have best predictive accuracy?

1. Train & test paradigm
2. Prediction error and quality metrics
3. ROI in data science projects
12
Plan

• …involves a great deal of trial and error
• little if any theory-based, model-based design
• even research (development of new algorithms) is (mostly) trial
and error
• the data scientist’s best friend is a well-designed experimental
studio for facilitating fast iterations of
•How can we control the quality of the ensuing model?
13
Building a data science model

• Data-driven predictors should work well on future
(unseen) data
• use historical data to select and ﬁt a model, then use the model to
make predictions on new data
• but we only have historical data: how do we “simulate” past and future
on existing data?
14
Train & test paradigm

15
Data
Train
Test
Develop a model
on training set
Test the model on
the test set
Change the test
set

16
Data
Train
Test
Cycling through the
data in this manner is
called cross-validation
This is a powerful
and important
concept for building
robust models

Question
• Assume your management has decided to outsource your
predictive model building activity.
• How would you evaluate various partners?

2. Prediction error and (quality) metrics
18
Plan

Back to classiﬁcation
Modèles Standards
Simple linear model,
Many red and blue
items are misclassiﬁed
A complex non linear
model, better
separation of data
(again)
What would be a suitable metric that characterises model
performance in the above case?

Prediction error
Modèles Standards
Number of misclassiﬁed points
(red or blue) ?M1
M2
According to this criteria M1
seems worst than M2
Assuming both models
avoid over/under-ﬁtting
(is this the case here?)

A list of metrics from SciKit
Learn
(a widely used ML software
library)
Choice of the metric is
important. Ideally, it should
be tied to a business
objective.
Model performance

Model performance
- a simple case -
Two basic notions:
- False positives
- False negatives
Ex
1. the model predicts cancer for a patient who does not have
cancer
2. the model predicts a patient does not have cancer while
she actually has
See that the cost of these errors are not identical. This is true in most cases.
Can you give other examples?

2. Prediction error and (quality) metrics
23
Plan

Calculating ROI for improving
predictive accuracy
Think about ad targeting and companies such as Assume, for the
sake of example, the following (fictitious) figures.
The company monitors 100 million page loads per hour by internet users. Within
the short duration of loading the company should predict whether the user will
click on an advertisement.
Company pays 0.10$ for showing the advertisement on the dedicated zone of the
page. It makes, 0.17$ if the user clicks on the ad. How does the model
performance affects profitability?
Assume the model causes 5% false positives and 10% false negatives over 100
million predictions.
17 million mauvaise prédictions - par heure!
Le cout des FPs: 100M x 0.05 x 0.10$ = 500, 000 euros
Le cout des FNs: 100M x 0.10 x 0.07$ = 700, 000 euros

The previous example was for (binary) classiﬁcation
Calculating ROI for improving
predictive accuracy
What happens in case of “regression”?
Example: Predicting remaining lifetime of devices

How to improve predictive
accuracy?

How to reach best predictive accuracy?
Customer Analytics
- Churn
- Pricing
- Lead scoring
- Credit scoring
- Up&cross-sales
Risk & Production
- Fraud / insurance
- Compliance
- Safety analysis
- Cyber-security
- Manufacturing
Operations
- Maintenance
- Fault analysis
- Logistics
- HR
- Procurement
Better Predictions = More Value
Integrating & Increasing data science
capabilities is hard
Finance Sales Marketing
Engineering
Purchasing HR Accounting
Manufacturing
Planning IT DSR&D
- Skill gap: Shortage of data scientists, Not enough skilled people, PhDs are expensive & high
demand, (McKinsey, 2016), unawareness of latest techniques and experimental methods
- Development gap: Lack of adapted infrastructures and systems, limited resources & time, lack of
management practices and appropriate experimental tools
- Deployment gap: It takes months to go from development to deployment, by the time a model is
ready to be deployed in production, the world has changed (distribution shifts; 78% of companies
has no automated procedures, 50% recode from scratch, Dataiku Production Survey Report)
Main obstacles:
Most companies operate with under-performing models
Ex. %10 improvement in sales prediction = %1 decrease in
stock out = 100M€ increase in sales for a retail giant

28
Developing a predictive model is an
experimental process
- Linear Regression
- Logistic Regression
- DecisionTree
- SVM
- Naive Bayes
- KNN
- K-Means
- Random Forest
- Dimensionality Reduction
Algorithms
- Gradient Boost & Adaboost
- …
ML algorithms ML has produced a large variety
of algorithms
each of which has tunable
parameters
The number of such
(hyper)parameters can vary
anywhere from 1 to ~100
Trying every combination
is not possible

B. Kégl / AppStat@LAL Learning to discover
CLASSIFICATION FOR DISCOVERY
20% improvement over
the baseline model used
by physicists (from 3.2 to
3.8) in detecting Higgs
particles
14
A particular instrument for extending the
“search” for best model is crowdsourcing
Hundreds of models
produced and tested by
the participants

30
RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)
http://www.ramp.studio

Amazing improvement
- in just 3 days -

Some numbers
• 100+ participants, working on the same problem
• 411+ models, in just 3 days
• Starting kit scores:
• Combined = 0.131, Err = 0.090, Mare = 0.212
• Final best submission:
• Combined = 0.032 (%75), Err = 0.015 (80%), Mare = 0.065 (~70%)
• Blended model is even better: 0.023 on combined score (better than
Saclay, Hooray!)
• These improvements are amazing

• Assume you are all working in various branches of a same group.
• The executive committee decide to run a company wide initiative to elaborate
a roadmap for accelerating digital transition
• Steps:
• Split into 5 teams of 8 persons
• 30-45m. Each group generates as many prediction problems as possible - with direct relevance
to their work (any of the company branches)
• 60m. Build a list of priority, depending:
• Availability or accessibility of data required
• ROI and potential gain (it’s ok to be approximative, but try to come up with informed estimations)
• 30m. Choose 3 applications, and report to the whole group (debrieﬁng)
39
Workshop

Data Science for Business Managers - An intro to ROI for predictive analytics

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data Science for Business Managers - An intro to ROI for predictive analytics

Similaire à Data Science for Business Managers - An intro to ROI for predictive analytics (20)

Dernier

Dernier (20)

Data Science for Business Managers - An intro to ROI for predictive analytics