SlideShare une entreprise Scribd logo
1  sur  27
Predictive Analytics
Peter Bruce
THE INSTITUTE FOR STATISTICS EDUCATION
at Statistics.com
peter.bruce@statistics.com
About Statistics.com
THE INSTITUTE FOR STATISTICS EDUCATION
• 100+ courses, introductory and advanced
• Traditional statistics, data mining, machine
learning, text mining, clinical
trials, optimization, use of R
• All online
• Typically 4 weeks, scheduled dates
• Don’t need to be online particular times/days
• Private discussion forum with instructors - noted
authors & experts
A man walks into a Target® store…
Predictive Analytics
• In marketing, used for model driven targeted
sales efforts
• Also… will loan default, what diagnosis (given
symptoms), is tax return fraudulent, …
Market Research
• Traditionally surveys, analysis, information
gathering, strategy
• Moving online increases the amount of
data, speeds its flow, and makes it more
accessible
Washington Post (web)
• 35 different reports tracking traffic daily
• Midday report “are we on track for visitors?”
• # visitors from key domains - .gov, .mil, .senate
or .house
Daily Mail (UK web)
• A traditional ingredient is stories about
animals – tracked on web
• “The animals that do best are
monkeys, dogs, and cats, in that order…”
Martin Clark (editor)
Back to Target
Predictive Analytics
• Goes beyond the obvious, capturing
complexity
• Implemented for real-time behavior and
decisions
Pregnant?
• Obvious retail clues – maternity clothes, baby
food, baby clothes, crib …
• These may be too late
• Earlier clues not so obvious –
lotions, supplements, and, esp., combinations
and changes in purchase patterns
• Data mining algorithms can capture these less
obvious, more complex signals
Training the Model
• Bridal registry
• Women of similar demographic not on bridal
registry
• Together, the training set
– Known outcome
– Purchase data over time
Hypothetical Data
Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90 Registry ?
1011 1 1 1 1 0 1 0
1012 1 0 1 0 1 0 1
1013 1 1 0 1 1 0 1
1014 0 1 1 0 1 1 0
1015 1 1 0 1 1 0 0
1016 0 0 1 0 1 0 1
Classification Algorithms
• K-nearest neighbors (involves 3 notions)
– Distance measure
– Centroid
– Majority vote or average
K-NN
Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90
Registry
?
1 1 1 1 1 0 1 0
2 1 1 1 1 0 1 0
3 1 1 0 1 0 1 0
4 0 1 1 0 1 1 0
5 1 1 0 1 1 0 1
6 0 0 1 0 1 0 1
NEW 1 0 1 1 0 1 ?
Classification Algorithms, cont.
• Logistic Regression
• CART
• Discriminant Analysis
• Neural Network
• Naïve Bayes
The Overfit Problem
0
200
400
600
800
1000
1200
1400
1600
0 200 400 600 800 1000
Revenue
Expenditure
Complex function - overfit
0
200
400
600
800
1000
1200
1400
1600
0 200 400 600 800 1000
Revenue
Expenditure
Therefore: Validate the Model
• Partition the original data
– Training
– Validation
• Fit the model to the training data
• Assess performance using the validation data
Performance Metrics
• Continuous
– RMSE
• Categorical (often binary)
– % accurate (confusion matrix)
– Lift
Confusion Matrix and Cutoff Control
Training Data scoring - Summary Report
Cut off Prob.Val. for Success (Updatable) 0.5
Classification Confusion Matrix
Predicted Class
Actual Class 1 0
1 43 8
0 6 247
Lift
• In classifying “pregnant” vs. “not-pregnant”
classifying everyone as “not-pregnant” has
very high overall accuracy
• Need metric that reflects greater importance
of the “pregnant” category, which is rare
• Lift is the model’s improvement over average
random selection
Decile Lift Chart
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9 10
Decilemean/Globalmean
Deciles
Decile-wise lift chart (validation dataset)
Validate the Model
• Compare one model to another
• Avoid overfit
• Solution: apply model to hold-out sample
– Assess performance of different models
– Fine tune parameters of individual models
Partitioning
• Randomly split the initial data into 2 or 3
groups
– Training
– Validation
– Test
• Repeated use of validation data to compare
and fine tune models -> overfit to
validation, in addition to training
– “Test” partition used only once, at the end
Software
• SAS Enterprise Miner $$$$
• IBM SPSS Modeler (Clementine) $$$$
• XLMiner (Excel add-in) $
• Statistica Data Miner $$
• Salford Systems $$
• Rapid Miner $$ (open source free version)
• R open source free
Data Mining - More
• Clustering (segmentation)
• Profiling (explanatory models)
• Time series
• Affinity (recommender systems)
• Text analytics (NLP, sentiment analysis)
Skill Shortage
• McKinsey “Big Data” report
– Supply gap of 140,000-190,000 “deep analytical
talent”
• Emergence of “Analytics” masters programs
(Northwestern, NC State, …)

Contenu connexe

En vedette

Déjeuner Conférence - La maintenance à l'ère du prédictif
Déjeuner Conférence - La maintenance à l'ère du prédictifDéjeuner Conférence - La maintenance à l'ère du prédictif
Déjeuner Conférence - La maintenance à l'ère du prédictif
agileDSS
 
BA Summit 2014 Predictive maintenance: Met big data het lek dichten
BA Summit 2014  Predictive maintenance: Met big data het lek dichtenBA Summit 2014  Predictive maintenance: Met big data het lek dichten
BA Summit 2014 Predictive maintenance: Met big data het lek dichten
Daniel Westzaan
 
Predictive Maintenance with R
Predictive Maintenance with RPredictive Maintenance with R
Predictive Maintenance with R
eoda GmbH
 
Predictive maintenance
Predictive maintenancePredictive maintenance
Predictive maintenance
James Shearer
 

En vedette (20)

Déjeuner Conférence - La maintenance à l'ère du prédictif
Déjeuner Conférence - La maintenance à l'ère du prédictifDéjeuner Conférence - La maintenance à l'ère du prédictif
Déjeuner Conférence - La maintenance à l'ère du prédictif
 
Prospect of non destructive testing and condition monitoring scope in bangladesh
Prospect of non destructive testing and condition monitoring scope in bangladeshProspect of non destructive testing and condition monitoring scope in bangladesh
Prospect of non destructive testing and condition monitoring scope in bangladesh
 
Predictive analysis and modelling
Predictive analysis and modellingPredictive analysis and modelling
Predictive analysis and modelling
 
Le price training presentation
Le price training presentationLe price training presentation
Le price training presentation
 
Predictive analysis
Predictive analysisPredictive analysis
Predictive analysis
 
Using the Industrial Internet to Move From Planned Maintenance to Predictive ...
Using the Industrial Internet to Move From Planned Maintenance to Predictive ...Using the Industrial Internet to Move From Planned Maintenance to Predictive ...
Using the Industrial Internet to Move From Planned Maintenance to Predictive ...
 
Cwin16 tls-faurecia predictive maintenance
Cwin16 tls-faurecia predictive maintenanceCwin16 tls-faurecia predictive maintenance
Cwin16 tls-faurecia predictive maintenance
 
BA Summit 2014 Predictive maintenance: Met big data het lek dichten
BA Summit 2014  Predictive maintenance: Met big data het lek dichtenBA Summit 2014  Predictive maintenance: Met big data het lek dichten
BA Summit 2014 Predictive maintenance: Met big data het lek dichten
 
Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_
 
Predictive Maintenance by analysing acoustic data in an industrial environment
Predictive Maintenance by analysing acoustic data in an industrial environmentPredictive Maintenance by analysing acoustic data in an industrial environment
Predictive Maintenance by analysing acoustic data in an industrial environment
 
What is predictive maintenance?
What is predictive maintenance?What is predictive maintenance?
What is predictive maintenance?
 
Predictive Maintenance
Predictive MaintenancePredictive Maintenance
Predictive Maintenance
 
The Science of Predictive Maintenance: IBM's Predictive Analytics Solution
The Science of Predictive Maintenance: IBM's Predictive Analytics SolutionThe Science of Predictive Maintenance: IBM's Predictive Analytics Solution
The Science of Predictive Maintenance: IBM's Predictive Analytics Solution
 
Predictive Maintenance with R
Predictive Maintenance with RPredictive Maintenance with R
Predictive Maintenance with R
 
Gpao 4 Juste à temps Kanban
Gpao 4 Juste à temps KanbanGpao 4 Juste à temps Kanban
Gpao 4 Juste à temps Kanban
 
GP Chapitre 5 : Le juste à temps et la méthode KANBAN
GP Chapitre 5 : Le juste à temps et la méthode KANBAN GP Chapitre 5 : Le juste à temps et la méthode KANBAN
GP Chapitre 5 : Le juste à temps et la méthode KANBAN
 
Machinery Oil Analysis
Machinery Oil AnalysisMachinery Oil Analysis
Machinery Oil Analysis
 
Predictive maintenance
Predictive maintenancePredictive maintenance
Predictive maintenance
 
DeciLogic, l'envergure d'un projet décisionnel
DeciLogic, l'envergure d'un projet décisionnelDeciLogic, l'envergure d'un projet décisionnel
DeciLogic, l'envergure d'un projet décisionnel
 
Conférence Internet des objets IoT M2M - CCI Bordeaux - 02 04 2015 - Introduc...
Conférence Internet des objets IoT M2M - CCI Bordeaux - 02 04 2015 - Introduc...Conférence Internet des objets IoT M2M - CCI Bordeaux - 02 04 2015 - Introduc...
Conférence Internet des objets IoT M2M - CCI Bordeaux - 02 04 2015 - Introduc...
 

Similaire à Predictive Analysis

Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
AschalewAyele2
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
Sulman Ahmed
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisions
Vivastream
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
ImXaib
 
Mir 2012 13 session #4
Mir 2012 13 session #4Mir 2012 13 session #4
Mir 2012 13 session #4
RichardGroom
 

Similaire à Predictive Analysis (20)

Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
 
Digital transformation in transport and logistics
Digital transformation in transport and logisticsDigital transformation in transport and logistics
Digital transformation in transport and logistics
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
DataAnalyticsIntroduction and its ci.pptx
DataAnalyticsIntroduction and its ci.pptxDataAnalyticsIntroduction and its ci.pptx
DataAnalyticsIntroduction and its ci.pptx
 
What is Machine Learning?
What is Machine Learning?What is Machine Learning?
What is Machine Learning?
 
Mini datathon
Mini datathonMini datathon
Mini datathon
 
Basic Overview of Data Mining
Basic Overview of Data MiningBasic Overview of Data Mining
Basic Overview of Data Mining
 
Scientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talkScientific Revenue USF 2016 talk
Scientific Revenue USF 2016 talk
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptx
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Data Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisionsData Refinement: The missing link between data collection and decisions
Data Refinement: The missing link between data collection and decisions
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache Spark
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
 
Mini datathon - Bengaluru
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - Bengaluru
 
AL slides.ppt
AL slides.pptAL slides.ppt
AL slides.ppt
 
Mir 2012 13 session #4
Mir 2012 13 session #4Mir 2012 13 session #4
Mir 2012 13 session #4
 

Plus de Michael Bystry

Plus de Michael Bystry (11)

Peril and Promise of Social Media
Peril and Promise of Social MediaPeril and Promise of Social Media
Peril and Promise of Social Media
 
Creating Marketing Personas
Creating Marketing PersonasCreating Marketing Personas
Creating Marketing Personas
 
Why Become PRC Certified
Why Become PRC CertifiedWhy Become PRC Certified
Why Become PRC Certified
 
Learning About America from the 2010 Census
Learning About America from the 2010 CensusLearning About America from the 2010 Census
Learning About America from the 2010 Census
 
Brave New World: The End of Survey Research
Brave New World: The End of Survey ResearchBrave New World: The End of Survey Research
Brave New World: The End of Survey Research
 
Exploring Evoving Trends in Viewship
Exploring Evoving Trends in ViewshipExploring Evoving Trends in Viewship
Exploring Evoving Trends in Viewship
 
Online Video - What Does it Mean for National Geographic Channel
Online Video - What Does it Mean for National Geographic ChannelOnline Video - What Does it Mean for National Geographic Channel
Online Video - What Does it Mean for National Geographic Channel
 
Broadcast Television: Trends and Implications
Broadcast Television: Trends and ImplicationsBroadcast Television: Trends and Implications
Broadcast Television: Trends and Implications
 
Predicting College Tuition
Predicting College TuitionPredicting College Tuition
Predicting College Tuition
 
On Campus DvD kiosks
On Campus DvD kiosksOn Campus DvD kiosks
On Campus DvD kiosks
 
Conjoint class project
Conjoint class projectConjoint class project
Conjoint class project
 

Dernier

Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for Viewing
Nauman Safdar
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
daisycvs
 

Dernier (20)

Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
 
Ooty Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Avail...
Ooty Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Avail...Ooty Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Avail...
Ooty Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Avail...
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024
 
CROSS CULTURAL NEGOTIATION BY PANMISEM NS
CROSS CULTURAL NEGOTIATION BY PANMISEM NSCROSS CULTURAL NEGOTIATION BY PANMISEM NS
CROSS CULTURAL NEGOTIATION BY PANMISEM NS
 
Chennai Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Av...
Chennai Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Av...Chennai Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Av...
Chennai Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Av...
 
PARK STREET 💋 Call Girl 9827461493 Call Girls in Escort service book now
PARK STREET 💋 Call Girl 9827461493 Call Girls in  Escort service book nowPARK STREET 💋 Call Girl 9827461493 Call Girls in  Escort service book now
PARK STREET 💋 Call Girl 9827461493 Call Girls in Escort service book now
 
Kalyan Call Girl 98350*37198 Call Girls in Escort service book now
Kalyan Call Girl 98350*37198 Call Girls in Escort service book nowKalyan Call Girl 98350*37198 Call Girls in Escort service book now
Kalyan Call Girl 98350*37198 Call Girls in Escort service book now
 
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 MonthsSEO Case Study: How I Increased SEO Traffic & Ranking by 50-60%  in 6 Months
SEO Case Study: How I Increased SEO Traffic & Ranking by 50-60% in 6 Months
 
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptxQSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
 
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
 
HomeRoots Pitch Deck | Investor Insights | April 2024
HomeRoots Pitch Deck | Investor Insights | April 2024HomeRoots Pitch Deck | Investor Insights | April 2024
HomeRoots Pitch Deck | Investor Insights | April 2024
 
WheelTug Short Pitch Deck 2024 | Byond Insights
WheelTug Short Pitch Deck 2024 | Byond InsightsWheelTug Short Pitch Deck 2024 | Byond Insights
WheelTug Short Pitch Deck 2024 | Byond Insights
 
Berhampur Call Girl Just Call 8084732287 Top Class Call Girl Service Available
Berhampur Call Girl Just Call 8084732287 Top Class Call Girl Service AvailableBerhampur Call Girl Just Call 8084732287 Top Class Call Girl Service Available
Berhampur Call Girl Just Call 8084732287 Top Class Call Girl Service Available
 
Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for Viewing
 
Phases of Negotiation .pptx
 Phases of Negotiation .pptx Phases of Negotiation .pptx
Phases of Negotiation .pptx
 
Uneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration PresentationUneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration Presentation
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
 
Putting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptxPutting the SPARK into Virtual Training.pptx
Putting the SPARK into Virtual Training.pptx
 
Cannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 UpdatedCannabis Legalization World Map: 2024 Updated
Cannabis Legalization World Map: 2024 Updated
 
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTSDurg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
Durg CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN durg ESCORTS
 

Predictive Analysis

  • 1. Predictive Analytics Peter Bruce THE INSTITUTE FOR STATISTICS EDUCATION at Statistics.com peter.bruce@statistics.com
  • 2. About Statistics.com THE INSTITUTE FOR STATISTICS EDUCATION • 100+ courses, introductory and advanced • Traditional statistics, data mining, machine learning, text mining, clinical trials, optimization, use of R • All online • Typically 4 weeks, scheduled dates • Don’t need to be online particular times/days • Private discussion forum with instructors - noted authors & experts
  • 3. A man walks into a Target® store…
  • 4. Predictive Analytics • In marketing, used for model driven targeted sales efforts • Also… will loan default, what diagnosis (given symptoms), is tax return fraudulent, …
  • 5. Market Research • Traditionally surveys, analysis, information gathering, strategy • Moving online increases the amount of data, speeds its flow, and makes it more accessible
  • 6. Washington Post (web) • 35 different reports tracking traffic daily • Midday report “are we on track for visitors?” • # visitors from key domains - .gov, .mil, .senate or .house
  • 7. Daily Mail (UK web) • A traditional ingredient is stories about animals – tracked on web • “The animals that do best are monkeys, dogs, and cats, in that order…” Martin Clark (editor)
  • 9. Predictive Analytics • Goes beyond the obvious, capturing complexity • Implemented for real-time behavior and decisions
  • 10. Pregnant? • Obvious retail clues – maternity clothes, baby food, baby clothes, crib … • These may be too late • Earlier clues not so obvious – lotions, supplements, and, esp., combinations and changes in purchase patterns • Data mining algorithms can capture these less obvious, more complex signals
  • 11. Training the Model • Bridal registry • Women of similar demographic not on bridal registry • Together, the training set – Known outcome – Purchase data over time
  • 12. Hypothetical Data Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90 Registry ? 1011 1 1 1 1 0 1 0 1012 1 0 1 0 1 0 1 1013 1 1 0 1 1 0 1 1014 0 1 1 0 1 1 0 1015 1 1 0 1 1 0 0 1016 0 0 1 0 1 0 1
  • 13. Classification Algorithms • K-nearest neighbors (involves 3 notions) – Distance measure – Centroid – Majority vote or average
  • 14. K-NN Cust # zinc10 zinc90 mag10 mag90 cotton10 cotton90 Registry ? 1 1 1 1 1 0 1 0 2 1 1 1 1 0 1 0 3 1 1 0 1 0 1 0 4 0 1 1 0 1 1 0 5 1 1 0 1 1 0 1 6 0 0 1 0 1 0 1 NEW 1 0 1 1 0 1 ?
  • 15. Classification Algorithms, cont. • Logistic Regression • CART • Discriminant Analysis • Neural Network • Naïve Bayes
  • 16. The Overfit Problem 0 200 400 600 800 1000 1200 1400 1600 0 200 400 600 800 1000 Revenue Expenditure
  • 17. Complex function - overfit 0 200 400 600 800 1000 1200 1400 1600 0 200 400 600 800 1000 Revenue Expenditure
  • 18. Therefore: Validate the Model • Partition the original data – Training – Validation • Fit the model to the training data • Assess performance using the validation data
  • 19. Performance Metrics • Continuous – RMSE • Categorical (often binary) – % accurate (confusion matrix) – Lift
  • 20. Confusion Matrix and Cutoff Control Training Data scoring - Summary Report Cut off Prob.Val. for Success (Updatable) 0.5 Classification Confusion Matrix Predicted Class Actual Class 1 0 1 43 8 0 6 247
  • 21. Lift • In classifying “pregnant” vs. “not-pregnant” classifying everyone as “not-pregnant” has very high overall accuracy • Need metric that reflects greater importance of the “pregnant” category, which is rare • Lift is the model’s improvement over average random selection
  • 22. Decile Lift Chart 0 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 Decilemean/Globalmean Deciles Decile-wise lift chart (validation dataset)
  • 23. Validate the Model • Compare one model to another • Avoid overfit • Solution: apply model to hold-out sample – Assess performance of different models – Fine tune parameters of individual models
  • 24. Partitioning • Randomly split the initial data into 2 or 3 groups – Training – Validation – Test • Repeated use of validation data to compare and fine tune models -> overfit to validation, in addition to training – “Test” partition used only once, at the end
  • 25. Software • SAS Enterprise Miner $$$$ • IBM SPSS Modeler (Clementine) $$$$ • XLMiner (Excel add-in) $ • Statistica Data Miner $$ • Salford Systems $$ • Rapid Miner $$ (open source free version) • R open source free
  • 26. Data Mining - More • Clustering (segmentation) • Profiling (explanatory models) • Time series • Affinity (recommender systems) • Text analytics (NLP, sentiment analysis)
  • 27. Skill Shortage • McKinsey “Big Data” report – Supply gap of 140,000-190,000 “deep analytical talent” • Emergence of “Analytics” masters programs (Northwestern, NC State, …)