SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
A Guideline
for
Statistical and Machine
Learning
Alexandre Alves, June/12/2014
Define your Goal
Define your Goal
Are you interested on predicting or inferring your data?
Prediction is a black-box method: given values for the features X1, …, Xp, it
predicts the value of the response Y.
Inference is a white-box method: how is the response Y affected as the
features X1, …, Xp change.
Define your Goal
People tend to think they need to predict, but more often than not inference will give
them more insight:
In an advertisement campaign, which media contributed most to sales?
Analyzing a business process failure, which attribute of the process contributes the
most to a negative outcome?
Given an increase in height, what is the expected increase in weight?
You must have a goal in mind in the form of a Question to be answered by means of
analyzing the Observations in your data.
Define the Model
Define the Model
Looking at the Observations, is the Response present in the data?
In a history of fraudulent transactions, the outcome of fraud or not fraud is specified in the
transactions themselves.
If so, then you are looking at a Supervised model, and there is a Response variable.
Or is the Response not in the data?
In a financial market Exchange, which stocks are hot? The trade transactions do not include
a variable specifying if the stock is hot or not hot!
In this case, you are looking at an Unsupervised model.
Supervised Models
Is the Response variable quantitative?
What’s the weight? What’s the price? What’s the income?
You are dealing with a Regression problem.
Or is the Response variable qualitative (categorical)?
Is it fraud? What’s the gender - male or female? What’s the brand - A, B, C?
You are looking into a Classification problem.
Regression Problems
Is there a somewhat linear relationship between the features and the response?
Gas consumption for horsepower.
Fit a Linear model to your Observations.
Is there no clear relationship or form between the features and the response?
Gas consumption for year of the car model.
Prefer a non-parametric method, such Regression Splines and Generalized Additive
Models.
Classification Problems
Is the Response made of only two categories (e.g. yes/no)?
Fit a Logistic regression model to your Observations.
Is there a somewhat linear boundary between the categories of the Response?
Use Linear Discriminant Analysis.
Is there no clear boundary form between the categories, but is the probability distribution of the categories known?
Use a Naive Bayes Classifier.
Otherwise if no clear boundary and distribution is not known:
Use K-Nearest Neighbors.
Unsupervised Models
Unsupervised learning is a relative new field
Is there a desired number of groups or categories?
Hot stocks (financial derivatives) and Not-so Hot
K-Means Clustering
Otherwise if number of groups is not known:
Stocks A an B trend together, stocks C and D trend together, stocks E and F…
Hierarchical Clustering
Train, (and Re-train)
the Model
Assessing the Model
The model is created by fitting the Observations.
The Accuracy of the model must be assessed:
If a regression problem, then measure the mean squared error.
If a classification problem, then measure the error rate.
Being able to measure, now we can try different methods to improve the model:
Leave-k-out of the test data and Cross-Validate.
Bootstrap by resampling.
Improving the Model
The possible findings are:
Change the features used in the Model:
Car color has no correlation to gas consumption, thus remove it from Model.
Change the interaction between the features:
Horsepower to gas consumption is not strictly linear, thus square the horsepower variable.
Change the model:
Low accuracy is a good indication that the selected Model is wrong.
Trade-offs
Models that tend to have high accuracy are hard to interpret and therefore inappropriate for inference
Linear regressions easy to interpret, however have low accuracy.
Support-Vector-Machines are very flexible, however can’t be easily interpreted.
Models that tend to be flexible are less biased, however don’t cope well to variances in the training data
Linear regressions are biased towards a linear form, however cope well with variances to the
training data.
k-NN has no bias, however has high variance as the training data changes.
Flexibility versus Interpretability, Bias versus Variance
–William Deming
“In God we trust, all others bring data.”	

”
–George Box
“All models are wrong, some are useful.”	

”
–Rutherford Roger
“We are drowning in information and
starving for knowledge.”	

”

Contenu connexe

En vedette

Statistical Machine Learning from Data - Introduction to ...
Statistical Machine Learning from Data - Introduction to ...Statistical Machine Learning from Data - Introduction to ...
Statistical Machine Learning from Data - Introduction to ...butest
 
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...WithTheBest
 
Stanford Statistical Learning
Stanford Statistical LearningStanford Statistical Learning
Stanford Statistical LearningKurt Holst
 
Public Sector Digital: What Does The Future Hold?
Public Sector Digital: What Does The Future Hold?Public Sector Digital: What Does The Future Hold?
Public Sector Digital: What Does The Future Hold?Chris Yiu
 
Statistical Learning from Dialogues for Intelligent Assistants
Statistical Learning from Dialogues for Intelligent AssistantsStatistical Learning from Dialogues for Intelligent Assistants
Statistical Learning from Dialogues for Intelligent AssistantsYun-Nung (Vivian) Chen
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learningmahutte
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido
 
Instrument Calibration
Instrument Calibration Instrument Calibration
Instrument Calibration Ramesh Dham
 
Infographic - MSP AWS Migration
Infographic - MSP AWS MigrationInfographic - MSP AWS Migration
Infographic - MSP AWS MigrationCopperEgg
 
Future Agricultures Consortium overview Sept 2011
Future Agricultures Consortium overview Sept 2011Future Agricultures Consortium overview Sept 2011
Future Agricultures Consortium overview Sept 2011futureagricultures
 
Bluffers guide to Terminology
Bluffers guide to TerminologyBluffers guide to Terminology
Bluffers guide to TerminologyJim Gough
 
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrero
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio GuerreroTendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrero
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrerotex4future
 
Panduan si harka_sebagai_pegawai
Panduan si harka_sebagai_pegawaiPanduan si harka_sebagai_pegawai
Panduan si harka_sebagai_pegawaiWenni Meliana
 
Vancouver executive briefing seminar by csr training institute
Vancouver executive briefing seminar by csr training instituteVancouver executive briefing seminar by csr training institute
Vancouver executive briefing seminar by csr training instituteWayne Dunn
 
Логистика на аутсорс: гибкость в период нестабильности IForum2014
Логистика на аутсорс: гибкость в период нестабильности IForum2014Логистика на аутсорс: гибкость в период нестабильности IForum2014
Логистика на аутсорс: гибкость в период нестабильности IForum2014TOCHKA
 

En vedette (20)

Statistical Machine Learning from Data - Introduction to ...
Statistical Machine Learning from Data - Introduction to ...Statistical Machine Learning from Data - Introduction to ...
Statistical Machine Learning from Data - Introduction to ...
 
Introduction
IntroductionIntroduction
Introduction
 
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
The Art and Power of Data-Driven Modeling: Statistical and Machine Learning A...
 
Stanford Statistical Learning
Stanford Statistical LearningStanford Statistical Learning
Stanford Statistical Learning
 
Public Sector Digital: What Does The Future Hold?
Public Sector Digital: What Does The Future Hold?Public Sector Digital: What Does The Future Hold?
Public Sector Digital: What Does The Future Hold?
 
Statistical Learning from Dialogues for Intelligent Assistants
Statistical Learning from Dialogues for Intelligent AssistantsStatistical Learning from Dialogues for Intelligent Assistants
Statistical Learning from Dialogues for Intelligent Assistants
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
 
Instrument Calibration
Instrument Calibration Instrument Calibration
Instrument Calibration
 
Formato planeacion
Formato planeacionFormato planeacion
Formato planeacion
 
Infographic - MSP AWS Migration
Infographic - MSP AWS MigrationInfographic - MSP AWS Migration
Infographic - MSP AWS Migration
 
Future Agricultures Consortium overview Sept 2011
Future Agricultures Consortium overview Sept 2011Future Agricultures Consortium overview Sept 2011
Future Agricultures Consortium overview Sept 2011
 
Bluffers guide to Terminology
Bluffers guide to TerminologyBluffers guide to Terminology
Bluffers guide to Terminology
 
Pres eucome 2016_v3
Pres eucome 2016_v3Pres eucome 2016_v3
Pres eucome 2016_v3
 
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrero
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio GuerreroTendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrero
Tendències i models de negoci del sector Tèxtil –Moda de José Antonio Guerrero
 
Notam 15-nov-16
Notam 15-nov-16Notam 15-nov-16
Notam 15-nov-16
 
Drip fund
Drip fundDrip fund
Drip fund
 
Panduan si harka_sebagai_pegawai
Panduan si harka_sebagai_pegawaiPanduan si harka_sebagai_pegawai
Panduan si harka_sebagai_pegawai
 
Vancouver executive briefing seminar by csr training institute
Vancouver executive briefing seminar by csr training instituteVancouver executive briefing seminar by csr training institute
Vancouver executive briefing seminar by csr training institute
 
Логистика на аутсорс: гибкость в период нестабильности IForum2014
Логистика на аутсорс: гибкость в период нестабильности IForum2014Логистика на аутсорс: гибкость в период нестабильности IForum2014
Логистика на аутсорс: гибкость в период нестабильности IForum2014
 

Similaire à A Guideline to Statistical and Machine Learning

HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with Regoodwintx
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
Data mining - Machine Learning
Data mining - Machine LearningData mining - Machine Learning
Data mining - Machine LearningRupaDutta3
 
CHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptxCHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptxUmaDeviAnanth
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining ProcessMarc Berman
 
Regularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptxRegularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptxMohamed Essam
 
Workbook Project
Workbook ProjectWorkbook Project
Workbook ProjectBrian Ryan
 
Creating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning AlgorithmCreating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning AlgorithmBill Fite
 
Explainable Machine Learning
Explainable Machine LearningExplainable Machine Learning
Explainable Machine LearningBill Fite
 
Scaling in research
Scaling  in researchScaling  in research
Scaling in researchankitsengar
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 MSL 5080, Methods of Analysis for Business Operations 1 .docx MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxgertrudebellgrove
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxMSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxpoulterbarbara
 
Lead Scoring Case Study_Final.pptx
Lead Scoring Case Study_Final.pptxLead Scoring Case Study_Final.pptx
Lead Scoring Case Study_Final.pptxRachnaGoel10
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear RegressionSara Hooker
 
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptEdu4Sure
 

Similaire à A Guideline to Statistical and Machine Learning (20)

Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
 
Regresión
RegresiónRegresión
Regresión
 
HRUG - Linear regression with R
HRUG - Linear regression with RHRUG - Linear regression with R
HRUG - Linear regression with R
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Data mining - Machine Learning
Data mining - Machine LearningData mining - Machine Learning
Data mining - Machine Learning
 
CHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptxCHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptx
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Regularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptxRegularization_BY_MOHAMED_ESSAM.pptx
Regularization_BY_MOHAMED_ESSAM.pptx
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Workbook Project
Workbook ProjectWorkbook Project
Workbook Project
 
Creating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning AlgorithmCreating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning Algorithm
 
Explainable Machine Learning
Explainable Machine LearningExplainable Machine Learning
Explainable Machine Learning
 
Scaling in research
Scaling  in researchScaling  in research
Scaling in research
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 MSL 5080, Methods of Analysis for Business Operations 1 .docx MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 
MSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docxMSL 5080, Methods of Analysis for Business Operations 1 .docx
MSL 5080, Methods of Analysis for Business Operations 1 .docx
 
Lead Scoring Case Study_Final.pptx
Lead Scoring Case Study_Final.pptxLead Scoring Case Study_Final.pptx
Lead Scoring Case Study_Final.pptx
 
Econometrics
EconometricsEconometrics
Econometrics
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear Regression
 
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
 
Msd 2018 dec
Msd 2018 decMsd 2018 dec
Msd 2018 dec
 

Plus de Alexandre de Castro Alves

Plus de Alexandre de Castro Alves (7)

Developing Modular Systems using OSGi
Developing Modular Systems using OSGiDeveloping Modular Systems using OSGi
Developing Modular Systems using OSGi
 
Speeding up big data with event processing
Speeding up big data with event processingSpeeding up big data with event processing
Speeding up big data with event processing
 
A General Extension System for Event Processing Languages
A General Extension System for Event Processing LanguagesA General Extension System for Event Processing Languages
A General Extension System for Event Processing Languages
 
Ts 4783 1
Ts 4783 1Ts 4783 1
Ts 4783 1
 
Bpel4 Ws 1.1 To Ws Bpel 2.0
Bpel4 Ws 1.1 To Ws Bpel 2.0Bpel4 Ws 1.1 To Ws Bpel 2.0
Bpel4 Ws 1.1 To Ws Bpel 2.0
 
Introduction to OSGi
Introduction to OSGiIntroduction to OSGi
Introduction to OSGi
 
Alves Mea Pch1 Free
Alves Mea Pch1 FreeAlves Mea Pch1 Free
Alves Mea Pch1 Free
 

Dernier

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Dernier (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

A Guideline to Statistical and Machine Learning

  • 1. A Guideline for Statistical and Machine Learning Alexandre Alves, June/12/2014
  • 3. Define your Goal Are you interested on predicting or inferring your data? Prediction is a black-box method: given values for the features X1, …, Xp, it predicts the value of the response Y. Inference is a white-box method: how is the response Y affected as the features X1, …, Xp change.
  • 4. Define your Goal People tend to think they need to predict, but more often than not inference will give them more insight: In an advertisement campaign, which media contributed most to sales? Analyzing a business process failure, which attribute of the process contributes the most to a negative outcome? Given an increase in height, what is the expected increase in weight? You must have a goal in mind in the form of a Question to be answered by means of analyzing the Observations in your data.
  • 6. Define the Model Looking at the Observations, is the Response present in the data? In a history of fraudulent transactions, the outcome of fraud or not fraud is specified in the transactions themselves. If so, then you are looking at a Supervised model, and there is a Response variable. Or is the Response not in the data? In a financial market Exchange, which stocks are hot? The trade transactions do not include a variable specifying if the stock is hot or not hot! In this case, you are looking at an Unsupervised model.
  • 7. Supervised Models Is the Response variable quantitative? What’s the weight? What’s the price? What’s the income? You are dealing with a Regression problem. Or is the Response variable qualitative (categorical)? Is it fraud? What’s the gender - male or female? What’s the brand - A, B, C? You are looking into a Classification problem.
  • 8. Regression Problems Is there a somewhat linear relationship between the features and the response? Gas consumption for horsepower. Fit a Linear model to your Observations. Is there no clear relationship or form between the features and the response? Gas consumption for year of the car model. Prefer a non-parametric method, such Regression Splines and Generalized Additive Models.
  • 9. Classification Problems Is the Response made of only two categories (e.g. yes/no)? Fit a Logistic regression model to your Observations. Is there a somewhat linear boundary between the categories of the Response? Use Linear Discriminant Analysis. Is there no clear boundary form between the categories, but is the probability distribution of the categories known? Use a Naive Bayes Classifier. Otherwise if no clear boundary and distribution is not known: Use K-Nearest Neighbors.
  • 10. Unsupervised Models Unsupervised learning is a relative new field Is there a desired number of groups or categories? Hot stocks (financial derivatives) and Not-so Hot K-Means Clustering Otherwise if number of groups is not known: Stocks A an B trend together, stocks C and D trend together, stocks E and F… Hierarchical Clustering
  • 12. Assessing the Model The model is created by fitting the Observations. The Accuracy of the model must be assessed: If a regression problem, then measure the mean squared error. If a classification problem, then measure the error rate. Being able to measure, now we can try different methods to improve the model: Leave-k-out of the test data and Cross-Validate. Bootstrap by resampling.
  • 13. Improving the Model The possible findings are: Change the features used in the Model: Car color has no correlation to gas consumption, thus remove it from Model. Change the interaction between the features: Horsepower to gas consumption is not strictly linear, thus square the horsepower variable. Change the model: Low accuracy is a good indication that the selected Model is wrong.
  • 14. Trade-offs Models that tend to have high accuracy are hard to interpret and therefore inappropriate for inference Linear regressions easy to interpret, however have low accuracy. Support-Vector-Machines are very flexible, however can’t be easily interpreted. Models that tend to be flexible are less biased, however don’t cope well to variances in the training data Linear regressions are biased towards a linear form, however cope well with variances to the training data. k-NN has no bias, however has high variance as the training data changes. Flexibility versus Interpretability, Bias versus Variance
  • 15. –William Deming “In God we trust, all others bring data.” ” –George Box “All models are wrong, some are useful.” ” –Rutherford Roger “We are drowning in information and starving for knowledge.” ”