This module addresses critical business aspects related to launching a predictive analytics project. How to establish the relationship with business KPIs is discussed. A notion of data hunt, for planning & acquiring external data for better predictions is introduced. Model quality and it's role for ROI of data and prediction tasks are explained. The module is concluded with a glimpse on how collaborative data challenges can improve predictive model quality in no time.
3. Levels of transformation
through data
• reporting: what happened in the past? (reflection)
• dashboards and real time monitoring: what is
happening now? (reactivity)
• prediction: what will happen next? (pro-activity)
4. How can we accelerate a digital
transformation process by leveraging
data?
5. 5
Building value-driven data projects
What knowledge would increase our profits?
The following questions need to be answered
in this order:
What data do we need to collect?
What ML methods are appropriate?
6. • Are standard innovation methodologies fit for digital
transformation projects?
• (Can we CK this?)
6
Discussion
7. • Do I have all the relevant data?
Strategic data watch: is there any new source of data I can
use?
• Do I have best predictive accuracy?
How do I make sure that I’m working with the best
possible predictive models?
7
Two key aspects
9. 9
Exercice: Data hunt
During your transition to predictive analytics, you may need
to update your databases: to include more variables with
potential explicative power
•A travel IT systems company has some
air traffic / passenger data.
•They are interested in predicting
passenger flux between 20 airports in
US.
•Data for 720 days, for each pair of
airports.
•So,“one” variable.
•How can we augment this dataset?
•Which variables can be added?
•Where can we find the data?
10. Potential sources for relevant factors
K1 Events
K2 Plane
accidents
K3 Calendar
K4 Delays
causes
K5 Alternative
transportation
K6 Safety
K7 Data on
airports
K8 Similar data
K9 Oil price
K10 Average
domestic air
fares
K11 Town’s
population
K12 Town’s
attractiveness
20+ participants (students), analysed byYohann Sitruk
12. 1. Train & test paradigm
2. Prediction error and quality metrics
3. ROI in data science projects
12
Plan
13. • …involves a great deal of trial and error
• little if any theory-based, model-based design
• even research (development of new algorithms) is (mostly) trial
and error
• the data scientist’s best friend is a well-designed experimental
studio for facilitating fast iterations of
•How can we control the quality of the ensuing model?
13
Building a data science model
14. • Data-driven predictors should work well on future
(unseen) data
• use historical data to select and fit a model, then use the model to
make predictions on new data
• but we only have historical data: how do we “simulate” past and future
on existing data?
14
Train & test paradigm
15. 15
Train & test paradigm
Data
Train
Test
Develop a model
on training set
Test the model on
the test set
Change the test
set
16. 16
Train & test paradigm
Data
Train
Test
Cycling through the
data in this manner is
called cross-validation
This is a powerful
and important
concept for building
robust models
17. Question
• Assume your management has decided to outsource your
predictive model building activity.
• How would you evaluate various partners?
18. 1. Train & test paradigm
2. Prediction error and (quality) metrics
3. ROI in data science projects
18
Plan
19. Back to classification
Modèles Standards
Simple linear model,
Many red and blue
items are misclassified
A complex non linear
model, better
separation of data
(again)
What would be a suitable metric that characterises model
performance in the above case?
20. Prediction error
Modèles Standards
Number of misclassified points
(red or blue) ?M1
M2
According to this criteria M1
seems worst than M2
Assuming both models
avoid over/under-fitting
(is this the case here?)
21. A list of metrics from SciKit
Learn
(a widely used ML software
library)
Choice of the metric is
important. Ideally, it should
be tied to a business
objective.
Model performance
22. Model performance
- a simple case -
Two basic notions:
- False positives
- False negatives
Ex
1. the model predicts cancer for a patient who does not have
cancer
2. the model predicts a patient does not have cancer while
she actually has
See that the cost of these errors are not identical. This is true in most cases.
Can you give other examples?
23. 1. Train & test paradigm
2. Prediction error and (quality) metrics
3. ROI in data science projects
23
Plan
24. Calculating ROI for improving
predictive accuracy
Think about ad targeting and companies such as Assume, for the
sake of example, the following (fictitious) figures.
The company monitors 100 million page loads per hour by internet users. Within
the short duration of loading the company should predict whether the user will
click on an advertisement.
Company pays 0.10$ for showing the advertisement on the dedicated zone of the
page. It makes, 0.17$ if the user clicks on the ad. How does the model
performance affects profitability?
Assume the model causes 5% false positives and 10% false negatives over 100
million predictions.
17 million mauvaise prédictions - par heure!
Le cout des FPs: 100M x 0.05 x 0.10$ = 500, 000 euros
Le cout des FNs: 100M x 0.10 x 0.07$ = 700, 000 euros
25. The previous example was for (binary) classification
Calculating ROI for improving
predictive accuracy
What happens in case of “regression”?
Example: Predicting remaining lifetime of devices
27. How to reach best predictive accuracy?
Customer Analytics
- Churn
- Pricing
- Lead scoring
- Credit scoring
- Up&cross-sales
Risk & Production
- Fraud / insurance
- Compliance
- Safety analysis
- Cyber-security
- Manufacturing
Operations
- Maintenance
- Fault analysis
- Logistics
- HR
- Procurement
Better Predictions = More Value
Integrating & Increasing data science
capabilities is hard
Finance Sales Marketing
Engineering
Purchasing HR Accounting
Manufacturing
Planning IT DSR&D
- Skill gap: Shortage of data scientists, Not enough skilled people, PhDs are expensive & high
demand, (McKinsey, 2016), unawareness of latest techniques and experimental methods
- Development gap: Lack of adapted infrastructures and systems, limited resources & time, lack of
management practices and appropriate experimental tools
- Deployment gap: It takes months to go from development to deployment, by the time a model is
ready to be deployed in production, the world has changed (distribution shifts; 78% of companies
has no automated procedures, 50% recode from scratch, Dataiku Production Survey Report)
Main obstacles:
Most companies operate with under-performing models
Ex. %10 improvement in sales prediction = %1 decrease in
stock out = 100M€ increase in sales for a retail giant
28. 28
Developing a predictive model is an
experimental process
- Linear Regression
- Logistic Regression
- DecisionTree
- SVM
- Naive Bayes
- KNN
- K-Means
- Random Forest
- Dimensionality Reduction
Algorithms
- Gradient Boost & Adaboost
- …
ML algorithms ML has produced a large variety
of algorithms
each of which has tunable
parameters
The number of such
(hyper)parameters can vary
anywhere from 1 to ~100
Trying every combination
is not possible
29. B. Kégl / AppStat@LAL Learning to discover
CLASSIFICATION FOR DISCOVERY
20% improvement over
the baseline model used
by physicists (from 3.2 to
3.8) in detecting Higgs
particles
14
A particular instrument for extending the
“search” for best model is crowdsourcing
Hundreds of models
produced and tested by
the participants
35. Some numbers
• 100+ participants, working on the same problem
• 411+ models, in just 3 days
• Starting kit scores:
• Combined = 0.131, Err = 0.090, Mare = 0.212
• Final best submission:
• Combined = 0.032 (%75), Err = 0.015 (80%), Mare = 0.065 (~70%)
• Blended model is even better: 0.023 on combined score (better than
Saclay, Hooray!)
• These improvements are amazing
39. • Assume you are all working in various branches of a same group.
• The executive committee decide to run a company wide initiative to elaborate
a roadmap for accelerating digital transition
• Steps:
• Split into 5 teams of 8 persons
• 30-45m. Each group generates as many prediction problems as possible - with direct relevance
to their work (any of the company branches)
• 60m. Build a list of priority, depending:
• Availability or accessibility of data required
• ROI and potential gain (it’s ok to be approximative, but try to come up with informed estimations)
• 30m. Choose 3 applications, and report to the whole group (debriefing)
39
Workshop