SlideShare une entreprise Scribd logo
1  sur  12
Extension and
Validation of Moro et
al.
By: Tapan Oza
Goals
• Repeatable results
• Use same data
• Use same protocol

• Extension

Tapan Oza

• Validation

• Same data, new protocols
• Averaged one-dependence estimators (AODE)
• Random Forest

• Tools used: Weka

2
• "Using data mining for bank direct marketing: An
application of the CRISP-DM methodology." Moro et al.
• CRISP-DM: CRoss-Industry Standard Data Mining
• Paper uses data from a Portuguese bank
• Acquired via Call Center in 17 different campaigns
• Large number of features
• Large number of cases

Tapan Oza

Original Paper

• Classification methodologies:
• Naïve Bayes
• Decision Tree
• Support Vector Machine

3
Tapan Oza

CRISP-DM

4
Classification Methodologies
• Assumes independent features
• Classification using Bayes Rule
• Apply a decision rule on probability function

• Decision Tree
• Many ways to build tree
• Common method splits on information gain

Tapan Oza

• Naïve Bayes

• Support Vector Machine
• Requires linearly separable data
• Identifies separating hyperplanes
5
Performance: Accuracy vs Speed
• Data mining is strategic
• Computation costs are falling (Amazon EC2)
• Without accuracy, model is useless

• What do we use to measure Accuracy?

Tapan Oza

• Why Accuracy?

• Area under the receiver operating characteristic curve
(AUROC)
• Higher AUROC = more confidence in classification
6
Extensions
• Modified Naïve Bayes
• Weak assumption of data independence
• Higher computational cost
• Computation is cheap

• Random Forest
•
•
•
•

Tapan Oza

• AODE

Many trees, one classification
Every tree “votes” on classification
Class with most “votes” is chosen
Impressive accuracy
7
Results: Validation
• Paper doesn’t specify tree type
• 2 out of 3 validated
• SVM not validated
AUROC

SVM

NB

Decision Tree

Original

0.938

0.870

0.868

Validation

0.583

0.861

Tapan Oza

• Average two different tree results

0.863

8
Results: Extension
• Extension was to have two models

• Weka output for AODE was incomplete
• Cause unknown
• Could be Weka

Tapan Oza

• AODE
• Random forest

• Random forest AUROC is 0.9
• Best result out of all the algorithms

9
• Random forest has impressive accuracy
• Naïve Bayes, Decision Tree, Random Forest are accurate
enough for deployment
• Make sure you have the same tools when validating
• Make sure you use multiple tools when testing
extensions

Tapan Oza

Lessons Learned

10
• Moro, Sérgio, Raul Laureano, and Paulo Cortez. "Using data
mining for bank direct marketing: An application of the crispdm methodology." (2011).
• Breiman, Leo. "Random forests." Machine learning 45.1
(2001): 5-32.
• Webb, Geoffrey I., Janice R. Boughton, and Zhihai Wang. "Not
so naive bayes: Aggregating one-dependence estimators."
Machine Learning 58.1 (2005): 5-24.

Tapan Oza

References:

11
Questions?

Contenu connexe

Similaire à Extension and validation of moro et al

Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data Science
Carlos Edo
 

Similaire à Extension and validation of moro et al (20)

Machine Learning Application Development
Machine Learning Application DevelopmentMachine Learning Application Development
Machine Learning Application Development
 
Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!Performance Issue? Machine Learning to the rescue!
Performance Issue? Machine Learning to the rescue!
 
Making powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisMaking powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysis
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
How and why you need to build a big data lab
How and why you need to build a big data labHow and why you need to build a big data lab
How and why you need to build a big data lab
 
Classification of URLs
Classification of URLsClassification of URLs
Classification of URLs
 
AIMO: An African Internet Measurements Observatory
AIMO: An African Internet Measurements ObservatoryAIMO: An African Internet Measurements Observatory
AIMO: An African Internet Measurements Observatory
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 
Data ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housingData ware housing- Introduction to data ware housing
Data ware housing- Introduction to data ware housing
 
Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data Science
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
 
Rapid Miner
Rapid MinerRapid Miner
Rapid Miner
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...Saksham Sarode - Building Effective test Data Management in Distributed Envir...
Saksham Sarode - Building Effective test Data Management in Distributed Envir...
 
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningUnderstanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine Learning
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Data mining
Data miningData mining
Data mining
 
Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data Science
 
Big data
Big dataBig data
Big data
 
An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...
An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...
An Approach to Combining Disparate Clinical Study Data across Multiple Sponso...
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Extension and validation of moro et al

  • 1. Extension and Validation of Moro et al. By: Tapan Oza
  • 2. Goals • Repeatable results • Use same data • Use same protocol • Extension Tapan Oza • Validation • Same data, new protocols • Averaged one-dependence estimators (AODE) • Random Forest • Tools used: Weka 2
  • 3. • "Using data mining for bank direct marketing: An application of the CRISP-DM methodology." Moro et al. • CRISP-DM: CRoss-Industry Standard Data Mining • Paper uses data from a Portuguese bank • Acquired via Call Center in 17 different campaigns • Large number of features • Large number of cases Tapan Oza Original Paper • Classification methodologies: • Naïve Bayes • Decision Tree • Support Vector Machine 3
  • 5. Classification Methodologies • Assumes independent features • Classification using Bayes Rule • Apply a decision rule on probability function • Decision Tree • Many ways to build tree • Common method splits on information gain Tapan Oza • Naïve Bayes • Support Vector Machine • Requires linearly separable data • Identifies separating hyperplanes 5
  • 6. Performance: Accuracy vs Speed • Data mining is strategic • Computation costs are falling (Amazon EC2) • Without accuracy, model is useless • What do we use to measure Accuracy? Tapan Oza • Why Accuracy? • Area under the receiver operating characteristic curve (AUROC) • Higher AUROC = more confidence in classification 6
  • 7. Extensions • Modified Naïve Bayes • Weak assumption of data independence • Higher computational cost • Computation is cheap • Random Forest • • • • Tapan Oza • AODE Many trees, one classification Every tree “votes” on classification Class with most “votes” is chosen Impressive accuracy 7
  • 8. Results: Validation • Paper doesn’t specify tree type • 2 out of 3 validated • SVM not validated AUROC SVM NB Decision Tree Original 0.938 0.870 0.868 Validation 0.583 0.861 Tapan Oza • Average two different tree results 0.863 8
  • 9. Results: Extension • Extension was to have two models • Weka output for AODE was incomplete • Cause unknown • Could be Weka Tapan Oza • AODE • Random forest • Random forest AUROC is 0.9 • Best result out of all the algorithms 9
  • 10. • Random forest has impressive accuracy • Naïve Bayes, Decision Tree, Random Forest are accurate enough for deployment • Make sure you have the same tools when validating • Make sure you use multiple tools when testing extensions Tapan Oza Lessons Learned 10
  • 11. • Moro, Sérgio, Raul Laureano, and Paulo Cortez. "Using data mining for bank direct marketing: An application of the crispdm methodology." (2011). • Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32. • Webb, Geoffrey I., Janice R. Boughton, and Zhihai Wang. "Not so naive bayes: Aggregating one-dependence estimators." Machine Learning 58.1 (2005): 5-24. Tapan Oza References: 11