SlideShare une entreprise Scribd logo
1  sur  62
Télécharger pour lire hors ligne
Location:
QuantUniversity Meetup
August 24th 2016
Boston MA
Machine Learning: An intuitive foundation
2016 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
www.QuantUniversity.com
sri@quantuniversity.com
2
Slides and Code will be available at:
http://www.analyticscertificate.com
- Analytics Advisory services
- Custom training programs
- Architecture assessments, advice and audits
• Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers.
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Charted Financial Analyst and Certified Analytics
Professional
• Teaches Analytics in the Babson College MBA
program and at Northeastern University, Boston
Sri Krishnamurthy
Founder and CEO
4
5
Quantitative Analytics and Big Data Analytics Onboarding
• Trained more than 500 students in
Quantitative methods, Data Science
and Big Data Technologies using
MATLAB, Python and R
• Launching the Analytics Certificate
Program in September
7
Quantitative Analytics and Big Data Analytics Onboarding
• Apply at:
www.analyticscertificate.com
• Program starting September 18th
• Module 1:
▫ Sep 18th , 25th , Oct 2nd, 9th
• Module 2:
▫ Oct 16th , 23th , 30th, Nov 6th
• Module 3:
▫ Nov 13th, 20th, Dec 4th, Dec 11th
• Capstone + Certification Ceremony
▫ Dec 18th
8
• September
▫ 11th, 12th : Spark Workshop, Boston
 www.analyticscertificate.com/SparkWorkshop
 Sponsored by IBM
▫ 19th, 20th : Anomaly Detection Workshop, New York
 www.analyticscertificate.com/AnomalyNYC
 Sponsored by Microsoft
Events of Interest
9
Agenda
1. Data
2. Goals
3. Machine learning algorithms
4. Process
5. Performance Evaluation
11
Dataset, variable and Observations
Dataset: A rectangular array with Rows as observations and
columns as variables
Variable: A characteristic of members of a population ( Age, State
etc.)
Observation: List of Variable values for a member of the
population
Variables
 A variable could be:
Categorical
 Yes/No flags
 AAA,BB ratings for bonds
Numerical
 35 mpg
 $170K salary
Datasets
• Longitudinal
▫ Observations are dependent
▫ Temporal-continuity is required
• Cross-sectional
▫ Observations are independent
15
Data
Cross
sectional
Numerical Categorical
Longitudinal
Numerical
Summary
16
17
• Descriptive Statistics
▫ Goal is to describe the data at hand
▫ Backward looking
▫ Statistical techniques employed here
• Predictive Analytics
▫ Goal is to use historical data to build a model for prediction
▫ Forward looking
▫ Machine learning techniques employed here
Goal
18
• How do you summarize numerical variables ?
• How do you summarize categorical variables ?
• How do you describe variability in numerical variables ?
• How do you summarize relationships between categorical and
numerical variables ?
• How do you summarize relationships between 2 numerical
variables?
Descriptive Statistics – Cross sectional datasets
See Data Analysis Taxonomy.xlsx
19
• Goal is to extract the various components
Longitudinal datasets
20
• Given a dataset, build a model that captures the similarities in
different observations and assigns them to different buckets.
• Given a set of variables, predict the value of another variable in a
given data set
▫ Predict Salaries given work experience, education etc.
▫ Predict whether a loan would be approved given fico score, current
loans, employment status etc.
Predictive Analytics : Cross sectional datasets
21
• Given a time series dataset, build a model that can be used to
forecast values in the future
Predictive Analytics : Time series datasets
22
Goal
Descriptive
Statistics
Cross
sectional
Numerical Categorical
Numerical vs
Categorical
Categorical vs
Categorical
Numerical vs
Numerical
Time series
Predictive
Analytics
Cross-
sectional
Segmentation Prediction
Predict a
number
Predict a
category
Time-series
Summary
23
24
Machine Learning Algorithms
Goal
Descriptive
Statistics
Cross
sectional
Numerical Categorical
Numerical vs
Categorical
Categorical vs
Categorical
Numerical vs
Numerical
Time series
Predictive
Analytics
Cross-
sectional
Segmentation Prediction
Predict a
number
Predict a
category
Time-series
25
• Supervised Algorithms
▫ Given a set of variables 𝑥𝑖, predict the value of another variable 𝑦 in a
given data set such that
▫ If y is numeric => Prediction
▫ If y is categorical => Classification
Machine Learning
x1,x2,x3… Model F(X) y
26
• Unsupervised Algorithms
▫ Given a dataset with variables 𝑥𝑖, build a model that captures the
similarities in different observations and assigns them to different
buckets => Clustering
Machine Learning
Obs1,
Obs2,Obs3
etc.
Model
Obs1- Class 1
Obs2- Class 2
Obs3- Class 1
27
Supervised
Learning
algorithms
Parametric
models
Non-
Parametric
models
Supervised learning Algorithms - Prediction
28
• Parametric models
▫ Assume some functional form
▫ Fit coefficients
• Examples : Linear Regression, Neural Networks
Supervised Learning models - Prediction
𝑌 = 𝛽0 + 𝛽1 𝑋1
Linear Regression Model Neural network Model
29
• Non-Parametric models
▫ No functional form assumed
• Examples : K-nearest neighbors, Decision Trees
Supervised Learning models
K-nearest neighbor Model Decision tree Model
• Given estimates መ𝛽0, መ𝛽1, … , መ𝛽 𝑝We can make predictions using the
formula
ො𝑦 = መ𝛽0 + መ𝛽1 𝑥1 + መ𝛽2 𝑥2 + ⋯ + መ𝛽 𝑝 𝑥 𝑝
• The parameters are estimated using the least squares approach to
minimize the sum of squared errors
𝑅𝑆𝑆 = ෍
𝑖=1
𝑛
(𝑦𝑖 − ො𝑦𝑖)2
Multiple linear regression
30
31
• Parametric models
▫ Assume some functional form
▫ Fit coefficients
• Examples : Logistic Regression, Neural Networks
Supervised Learning models - Classification
Logistic Regression Model Neural network Model
32
• Non-Parametric models
▫ No functional form assumed
• Examples : K-nearest neighbors, Decision Trees
Supervised Learning models
K-nearest neighbor Model Decision tree Model
33
• Unsupervised Algorithms
▫ Given a dataset with variables 𝑥𝑖, build a model that captures the
similarities in different observations and assigns them to different
buckets => Clustering
Machine Learning
Obs1,
Obs2,Obs3
etc.
Model
Obs1- Class 1
Obs2- Class 2
Obs3- Class 1
K-means clustering
• These methods partition the data into k clusters by assigning each data point to
its closest cluster centroid by minimizing the within-cluster sum of squares
(WSS), which is:
෍
𝑘=1
𝐾
෍
𝑖∈𝑆 𝑘
෍
𝑗=1
𝑃
(𝑥𝑖𝑗 − 𝜇 𝑘𝑗)2
where 𝑆 𝑘 is the set of observations in the kth cluster and 𝜇 𝑘𝑗 is the mean of jth
variable of the cluster center of the kth cluster.
• Then, they select the top n points that are the farthest away from their nearest
cluster centers as outliers.
34
35
Anomaly Detection vs Unsupervised Learning
36
Distance functions
• Euclidean distance:
37
Distance functions
• Manhattan distance:
D =|𝑋𝐴- 𝑋 𝐵|+ |𝑌𝐴- 𝑌𝐵|
38
Distance functions
• Correlation distance:
39
Machine Learning Algorithms
Machine
Learning
Supervised
Prediction
Parametric
Linear
Regression
Neural
Networks
Non-
parametric
KNN Decision Trees
Classification
Parametric
Logistic
Regression
Neural
Networks
Non Parametric
Decision Trees KNN
Unsupervised
algorithms
K-means
Associative
rule mining
40
41
The Process
Data
cleansing
Feature
Engineering
Training
and Testing
Model
building
Model
selection
42
• What transformations do I need for the x and y variables ?
• Which are the best features to use?
▫ Dimension Reduction – PCA
▫ Best subset selection
 Forward selection
 Backward elimination
 Stepwise regression
Feature Engineering
43
Data
Training
80%
Testing
20%
Training the model
44
45
Evaluating
Machine
learning
algorithms
Supervised -
Prediction
R-square RMS MAE MAPE
Supervised-
Classification
Confusion
Matrix
ROC Curves
Evaluation framework
46
• The prediction error for record i is defined as the difference
between its actual y value and its predicted y value
𝑒𝑖 = 𝑦𝑖 − ො𝑦𝑖
• 𝑅2 indicates how well data fits the statistical model
𝑅2 = 1 −
σ𝑖=1
𝑛
(𝑦𝑖 − ො𝑦𝑖)2
σ𝑖=1
𝑛
(𝑦𝑖 − ത𝑦𝑖)2
Prediction Accuracy Measures
47
• Fit measures in classical regression modeling:
• Adjusted 𝑅2 has been adjusted for the number of predictors. It increases only
when the improve of model is more than one would expect to see by chance
(p is the total number of explanatory variables)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 1 −
Τσ𝑖=1
𝑛
(𝑦𝑖 − ො𝑦𝑖)2 (𝑛 − 𝑝 − 1)
σ𝑖=1
𝑛
𝑦𝑖 − ത𝑦𝑖
2 /(𝑛 − 1)
• MAE or MAD (mean absolute error/deviation) gives the magnitude of the
average absolute error
𝑀𝐴𝐸 =
1
𝑛 σ𝑖=1
𝑛
𝑒𝑖
Prediction Accuracy Measures
48
▫ MAPE (mean absolute percentage error) gives a percentage score of
how predictions deviate on average
𝑀𝐴𝑃𝐸 =
1
𝑛 σ𝑖=1
𝑛
𝑒𝑖/𝑦𝑖
× 100%
• RMSE (root-mean-squared error) is computed on the training and
validation data
𝑅𝑀𝑆𝐸 = 1/𝑛 ෍
𝑖=1
𝑛
𝑒𝑖
2
Prediction Accuracy Measures
49
• Consider a two-class case with classes 𝐶0 and 𝐶1
• Classification matrix:
Classification matrix
Predicted Class
Actual Class 𝐶0 𝐶1
𝐶0
𝑛0,0= number of 𝐶0 cases
classified correctly
𝑛0,1= number of 𝐶0 cases
classified incorrectly as 𝐶1
𝐶1
𝑛1,0= number of 𝐶1 cases
classified incorrectly as 𝐶0
𝑛1,1= number of 𝐶1 cases
classified correctly
50
• Estimated misclassification rate (overall error rate) is a main
accuracy measure
𝑒𝑟𝑟 =
𝑛0,1 + 𝑛1,0
𝑛0,0 + 𝑛0,1 + 𝑛1,0 + 𝑛1,1
=
𝑛0,1 + 𝑛1,0
𝑛
• Overall accuracy:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 1 − 𝑒𝑟𝑟 =
𝑛0,0 + 𝑛1,1
𝑛
Accuracy Measures
51
• The ROC curve plots the pairs {sensitivity, 1-specificity}
as the cutoff value increases from 0 and 1
• Sensitivity (also called the true positive rate, or
the recall in some fields) measures the proportion of
positives that are correctly identified (e.g., the
percentage of sick people who are correctly identified
as having the condition).
• Specificity (also called the true negative rate) measures
the proportion of negatives that are correctly identified
as such (e.g., the percentage of healthy people who are
correctly identified as not having the condition).
• Better performance is reflected by curves that are
closer to the top left corner
ROC Curve
Agenda
1. Data
2. Goals
3. Machine learning algorithms
4. Process
5. Performance Evaluation
53
Data
Cross
sectional
Numerical Categorical
Longitudinal
Numerical
Handling Data
54
Goal
Descriptive
Statistics
Cross
sectional
Numerical Categorical
Numerical vs
Categorical
Categorical vs
Categorical
Numerical vs
Numerical
Time series
Predictive
Analytics
Cross-
sectional
Segmentation Prediction
Predict a
number
Predict a
category
Time-series
Goal
55
Machine Learning Algorithms
Machine
Learning
Supervised
Prediction
Parametric
Linear
Regression
Neural
Networks
Non-
parametric
KNN Decision Trees
Classification
Parametric
Logistic
Regression
Neural
Networks
Non Parametric
Decision Trees KNN
Unsupervised
algorithms
K-means
Associative
rule mining
56
The Process
Data
cleansing
Feature
Engineering
Training
and Testing
Model
building
Model
selection
57
Evaluating
Machine
learning
algorithms
Supervised -
Prediction
R-square RMS MAE MAPE
Supervised-
Classification
Confusion
Matrix
ROC Curves
Evaluation framework
60
www.analyticscertificate.com/SparkWorkshop
61
Q&A
Slides, code and details about the Apache Spark Workshop
at: http://www.analyticscertificate.com/SparkWorkshop/
Thank you!
Members & Sponsors!
Sri Krishnamurthy, CFA, CAP
Founder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.com
Contact
Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
62

Contenu connexe

Tendances

Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Salah Amean
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slidesQuantUniversity
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Simplilearn
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Kazi Toufiq Wadud
 
Data Science in Education.pptx
Data Science in Education.pptxData Science in Education.pptx
Data Science in Education.pptxAthenaJoseph2
 
Mental Disorder Diagnosis using Machine Learning
Mental Disorder Diagnosis using Machine LearningMental Disorder Diagnosis using Machine Learning
Mental Disorder Diagnosis using Machine LearningOleksii Volkovskyi
 
Taiwanese Credit Card Client Fraud detection
Taiwanese Credit Card Client Fraud detectionTaiwanese Credit Card Client Fraud detection
Taiwanese Credit Card Client Fraud detectionRavi Gupta
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceData Science Thailand
 
Customer churn classification using machine learning techniques
Customer churn classification using machine learning techniquesCustomer churn classification using machine learning techniques
Customer churn classification using machine learning techniquesSindhujanDhayalan
 
Supervised Machine Learning Techniques
Supervised Machine Learning TechniquesSupervised Machine Learning Techniques
Supervised Machine Learning TechniquesTara ram Goyal
 
Machine learning overview
Machine learning overviewMachine learning overview
Machine learning overviewprih_yah
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 
Churn Prediction in Practice
Churn Prediction in PracticeChurn Prediction in Practice
Churn Prediction in PracticeBigData Republic
 

Tendances (20)

Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?
 
Data Science in Education.pptx
Data Science in Education.pptxData Science in Education.pptx
Data Science in Education.pptx
 
Mental Disorder Diagnosis using Machine Learning
Mental Disorder Diagnosis using Machine LearningMental Disorder Diagnosis using Machine Learning
Mental Disorder Diagnosis using Machine Learning
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
 
neural networks
neural networksneural networks
neural networks
 
machine learning
machine learningmachine learning
machine learning
 
Taiwanese Credit Card Client Fraud detection
Taiwanese Credit Card Client Fraud detectionTaiwanese Credit Card Client Fraud detection
Taiwanese Credit Card Client Fraud detection
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data Science
 
Machine Learning
Machine Learning Machine Learning
Machine Learning
 
Customer churn classification using machine learning techniques
Customer churn classification using machine learning techniquesCustomer churn classification using machine learning techniques
Customer churn classification using machine learning techniques
 
Supervised Machine Learning Techniques
Supervised Machine Learning TechniquesSupervised Machine Learning Techniques
Supervised Machine Learning Techniques
 
Rnn and lstm
Rnn and lstmRnn and lstm
Rnn and lstm
 
Outlier Detection
Outlier DetectionOutlier Detection
Outlier Detection
 
Machine learning overview
Machine learning overviewMachine learning overview
Machine learning overview
 
Chapter 12 outlier
Chapter 12 outlierChapter 12 outlier
Chapter 12 outlier
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Churn Prediction in Practice
Churn Prediction in PracticeChurn Prediction in Practice
Churn Prediction in Practice
 

Similaire à Machine learning meetup

Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningAnomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningQuantUniversity
 
Anomaly detection Meetup Slides
Anomaly detection Meetup SlidesAnomaly detection Meetup Slides
Anomaly detection Meetup SlidesQuantUniversity
 
introduction to machine learning 3c-feature-extraction.pptx
introduction to machine learning 3c-feature-extraction.pptxintroduction to machine learning 3c-feature-extraction.pptx
introduction to machine learning 3c-feature-extraction.pptxPratik Gohel
 
Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2khairulhuda242
 
Unit 3 – AIML.pptx
Unit 3 – AIML.pptxUnit 3 – AIML.pptx
Unit 3 – AIML.pptxhiblooms
 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analyticsDinakar nk
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and ldaSuresh Pokharel
 
Outlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsOutlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsQuantUniversity
 
Data Science and Machine Learning with Tensorflow
 Data Science and Machine Learning with Tensorflow Data Science and Machine Learning with Tensorflow
Data Science and Machine Learning with TensorflowShubham Sharma
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptxNIKHILGR3
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxrajalakshmi5921
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningNandakumar P
 

Similaire à Machine learning meetup (20)

Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep LearningAnomaly detection: Core Techniques and Advances in Big Data and Deep Learning
Anomaly detection: Core Techniques and Advances in Big Data and Deep Learning
 
Anomaly detection Meetup Slides
Anomaly detection Meetup SlidesAnomaly detection Meetup Slides
Anomaly detection Meetup Slides
 
Ds for finance day 2
Ds for finance day 2Ds for finance day 2
Ds for finance day 2
 
introduction to machine learning 3c-feature-extraction.pptx
introduction to machine learning 3c-feature-extraction.pptxintroduction to machine learning 3c-feature-extraction.pptx
introduction to machine learning 3c-feature-extraction.pptx
 
Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2Week 13 Feature Selection Computer Vision Bagian 2
Week 13 Feature Selection Computer Vision Bagian 2
 
Unit 3 – AIML.pptx
Unit 3 – AIML.pptxUnit 3 – AIML.pptx
Unit 3 – AIML.pptx
 
Credit risk meetup
Credit risk meetupCredit risk meetup
Credit risk meetup
 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analytics
 
230727_HB_JointJournalClub.pptx
230727_HB_JointJournalClub.pptx230727_HB_JointJournalClub.pptx
230727_HB_JointJournalClub.pptx
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
Outlier analysis for Temporal Datasets
Outlier analysis for Temporal DatasetsOutlier analysis for Temporal Datasets
Outlier analysis for Temporal Datasets
 
Data Science and Machine Learning with Tensorflow
 Data Science and Machine Learning with Tensorflow Data Science and Machine Learning with Tensorflow
Data Science and Machine Learning with Tensorflow
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
 
EDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptxEDAB Module 5 Singular Value Decomposition (SVD).pptx
EDAB Module 5 Singular Value Decomposition (SVD).pptx
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data Mining
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Data Mining Lecture_8(a).pptx
Data Mining Lecture_8(a).pptxData Mining Lecture_8(a).pptx
Data Mining Lecture_8(a).pptx
 

Plus de QuantUniversity

EU Artificial Intelligence Act 2024 passed !
EU Artificial Intelligence Act 2024 passed !EU Artificial Intelligence Act 2024 passed !
EU Artificial Intelligence Act 2024 passed !QuantUniversity
 
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdfManaging-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdfQuantUniversity
 
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALSPYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALSQuantUniversity
 
Qu for India - QuantUniversity FundRaiser
Qu for India  - QuantUniversity FundRaiserQu for India  - QuantUniversity FundRaiser
Qu for India - QuantUniversity FundRaiserQuantUniversity
 
Ml master class for CFA Dallas
Ml master class for CFA DallasMl master class for CFA Dallas
Ml master class for CFA DallasQuantUniversity
 
Algorithmic auditing 1.0
Algorithmic auditing 1.0Algorithmic auditing 1.0
Algorithmic auditing 1.0QuantUniversity
 
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...QuantUniversity
 
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...QuantUniversity
 
Seeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewSeeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewQuantUniversity
 
AI Explainability and Model Risk Management
AI Explainability and Model Risk ManagementAI Explainability and Model Risk Management
AI Explainability and Model Risk ManagementQuantUniversity
 
Algorithmic auditing 1.0
Algorithmic auditing 1.0Algorithmic auditing 1.0
Algorithmic auditing 1.0QuantUniversity
 
Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021QuantUniversity
 
Bayesian Portfolio Allocation
Bayesian Portfolio AllocationBayesian Portfolio Allocation
Bayesian Portfolio AllocationQuantUniversity
 
Constructing Private Asset Benchmarks
Constructing Private Asset BenchmarksConstructing Private Asset Benchmarks
Constructing Private Asset BenchmarksQuantUniversity
 
Machine Learning Interpretability
Machine Learning InterpretabilityMachine Learning Interpretability
Machine Learning InterpretabilityQuantUniversity
 
Responsible AI in Action
Responsible AI in ActionResponsible AI in Action
Responsible AI in ActionQuantUniversity
 
Qu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQuantUniversity
 

Plus de QuantUniversity (20)

EU Artificial Intelligence Act 2024 passed !
EU Artificial Intelligence Act 2024 passed !EU Artificial Intelligence Act 2024 passed !
EU Artificial Intelligence Act 2024 passed !
 
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdfManaging-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
Managing-the-Risks-of-LLMs-in-FS-Industry-Roundtable-TruEra-QuantU.pdf
 
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALSPYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
PYTHON AND DATA SCIENCE FOR INVESTMENT PROFESSIONALS
 
Qu for India - QuantUniversity FundRaiser
Qu for India  - QuantUniversity FundRaiserQu for India  - QuantUniversity FundRaiser
Qu for India - QuantUniversity FundRaiser
 
Ml master class for CFA Dallas
Ml master class for CFA DallasMl master class for CFA Dallas
Ml master class for CFA Dallas
 
Algorithmic auditing 1.0
Algorithmic auditing 1.0Algorithmic auditing 1.0
Algorithmic auditing 1.0
 
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
 
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
Machine Learning: Considerations for Fairly and Transparently Expanding Acces...
 
Seeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper reviewSeeing what a gan cannot generate: paper review
Seeing what a gan cannot generate: paper review
 
AI Explainability and Model Risk Management
AI Explainability and Model Risk ManagementAI Explainability and Model Risk Management
AI Explainability and Model Risk Management
 
Algorithmic auditing 1.0
Algorithmic auditing 1.0Algorithmic auditing 1.0
Algorithmic auditing 1.0
 
Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021Machine Learning in Finance: 10 Things You Need to Know in 2021
Machine Learning in Finance: 10 Things You Need to Know in 2021
 
Bayesian Portfolio Allocation
Bayesian Portfolio AllocationBayesian Portfolio Allocation
Bayesian Portfolio Allocation
 
The API Jungle
The API JungleThe API Jungle
The API Jungle
 
Explainable AI Workshop
Explainable AI WorkshopExplainable AI Workshop
Explainable AI Workshop
 
Constructing Private Asset Benchmarks
Constructing Private Asset BenchmarksConstructing Private Asset Benchmarks
Constructing Private Asset Benchmarks
 
Machine Learning Interpretability
Machine Learning InterpretabilityMachine Learning Interpretability
Machine Learning Interpretability
 
Responsible AI in Action
Responsible AI in ActionResponsible AI in Action
Responsible AI in Action
 
Qu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in FinanceQu speaker series 14: Synthetic Data Generation in Finance
Qu speaker series 14: Synthetic Data Generation in Finance
 
Qwafafew meeting 5
Qwafafew meeting 5Qwafafew meeting 5
Qwafafew meeting 5
 

Dernier

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 

Dernier (20)

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 

Machine learning meetup

  • 1. Location: QuantUniversity Meetup August 24th 2016 Boston MA Machine Learning: An intuitive foundation 2016 Copyright QuantUniversity LLC. Presented By: Sri Krishnamurthy, CFA, CAP www.QuantUniversity.com sri@quantuniversity.com
  • 2. 2 Slides and Code will be available at: http://www.analyticscertificate.com
  • 3. - Analytics Advisory services - Custom training programs - Architecture assessments, advice and audits
  • 4. • Founder of QuantUniversity LLC. and www.analyticscertificate.com • Advisory and Consultancy for Financial Analytics • Prior Experience at MathWorks, Citigroup and Endeca and 25+ financial services and energy customers. • Regular Columnist for the Wilmott Magazine • Author of forthcoming book “Financial Modeling: A case study approach” published by Wiley • Charted Financial Analyst and Certified Analytics Professional • Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston Sri Krishnamurthy Founder and CEO 4
  • 5. 5 Quantitative Analytics and Big Data Analytics Onboarding • Trained more than 500 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R • Launching the Analytics Certificate Program in September
  • 6.
  • 7. 7 Quantitative Analytics and Big Data Analytics Onboarding • Apply at: www.analyticscertificate.com • Program starting September 18th • Module 1: ▫ Sep 18th , 25th , Oct 2nd, 9th • Module 2: ▫ Oct 16th , 23th , 30th, Nov 6th • Module 3: ▫ Nov 13th, 20th, Dec 4th, Dec 11th • Capstone + Certification Ceremony ▫ Dec 18th
  • 8. 8 • September ▫ 11th, 12th : Spark Workshop, Boston  www.analyticscertificate.com/SparkWorkshop  Sponsored by IBM ▫ 19th, 20th : Anomaly Detection Workshop, New York  www.analyticscertificate.com/AnomalyNYC  Sponsored by Microsoft Events of Interest
  • 9. 9
  • 10. Agenda 1. Data 2. Goals 3. Machine learning algorithms 4. Process 5. Performance Evaluation
  • 11. 11
  • 12. Dataset, variable and Observations Dataset: A rectangular array with Rows as observations and columns as variables Variable: A characteristic of members of a population ( Age, State etc.) Observation: List of Variable values for a member of the population
  • 13. Variables  A variable could be: Categorical  Yes/No flags  AAA,BB ratings for bonds Numerical  35 mpg  $170K salary
  • 14. Datasets • Longitudinal ▫ Observations are dependent ▫ Temporal-continuity is required • Cross-sectional ▫ Observations are independent
  • 16. 16
  • 17. 17 • Descriptive Statistics ▫ Goal is to describe the data at hand ▫ Backward looking ▫ Statistical techniques employed here • Predictive Analytics ▫ Goal is to use historical data to build a model for prediction ▫ Forward looking ▫ Machine learning techniques employed here Goal
  • 18. 18 • How do you summarize numerical variables ? • How do you summarize categorical variables ? • How do you describe variability in numerical variables ? • How do you summarize relationships between categorical and numerical variables ? • How do you summarize relationships between 2 numerical variables? Descriptive Statistics – Cross sectional datasets See Data Analysis Taxonomy.xlsx
  • 19. 19 • Goal is to extract the various components Longitudinal datasets
  • 20. 20 • Given a dataset, build a model that captures the similarities in different observations and assigns them to different buckets. • Given a set of variables, predict the value of another variable in a given data set ▫ Predict Salaries given work experience, education etc. ▫ Predict whether a loan would be approved given fico score, current loans, employment status etc. Predictive Analytics : Cross sectional datasets
  • 21. 21 • Given a time series dataset, build a model that can be used to forecast values in the future Predictive Analytics : Time series datasets
  • 22. 22 Goal Descriptive Statistics Cross sectional Numerical Categorical Numerical vs Categorical Categorical vs Categorical Numerical vs Numerical Time series Predictive Analytics Cross- sectional Segmentation Prediction Predict a number Predict a category Time-series Summary
  • 23. 23
  • 24. 24 Machine Learning Algorithms Goal Descriptive Statistics Cross sectional Numerical Categorical Numerical vs Categorical Categorical vs Categorical Numerical vs Numerical Time series Predictive Analytics Cross- sectional Segmentation Prediction Predict a number Predict a category Time-series
  • 25. 25 • Supervised Algorithms ▫ Given a set of variables 𝑥𝑖, predict the value of another variable 𝑦 in a given data set such that ▫ If y is numeric => Prediction ▫ If y is categorical => Classification Machine Learning x1,x2,x3… Model F(X) y
  • 26. 26 • Unsupervised Algorithms ▫ Given a dataset with variables 𝑥𝑖, build a model that captures the similarities in different observations and assigns them to different buckets => Clustering Machine Learning Obs1, Obs2,Obs3 etc. Model Obs1- Class 1 Obs2- Class 2 Obs3- Class 1
  • 28. 28 • Parametric models ▫ Assume some functional form ▫ Fit coefficients • Examples : Linear Regression, Neural Networks Supervised Learning models - Prediction 𝑌 = 𝛽0 + 𝛽1 𝑋1 Linear Regression Model Neural network Model
  • 29. 29 • Non-Parametric models ▫ No functional form assumed • Examples : K-nearest neighbors, Decision Trees Supervised Learning models K-nearest neighbor Model Decision tree Model
  • 30. • Given estimates መ𝛽0, መ𝛽1, … , መ𝛽 𝑝We can make predictions using the formula ො𝑦 = መ𝛽0 + መ𝛽1 𝑥1 + መ𝛽2 𝑥2 + ⋯ + መ𝛽 𝑝 𝑥 𝑝 • The parameters are estimated using the least squares approach to minimize the sum of squared errors 𝑅𝑆𝑆 = ෍ 𝑖=1 𝑛 (𝑦𝑖 − ො𝑦𝑖)2 Multiple linear regression 30
  • 31. 31 • Parametric models ▫ Assume some functional form ▫ Fit coefficients • Examples : Logistic Regression, Neural Networks Supervised Learning models - Classification Logistic Regression Model Neural network Model
  • 32. 32 • Non-Parametric models ▫ No functional form assumed • Examples : K-nearest neighbors, Decision Trees Supervised Learning models K-nearest neighbor Model Decision tree Model
  • 33. 33 • Unsupervised Algorithms ▫ Given a dataset with variables 𝑥𝑖, build a model that captures the similarities in different observations and assigns them to different buckets => Clustering Machine Learning Obs1, Obs2,Obs3 etc. Model Obs1- Class 1 Obs2- Class 2 Obs3- Class 1
  • 34. K-means clustering • These methods partition the data into k clusters by assigning each data point to its closest cluster centroid by minimizing the within-cluster sum of squares (WSS), which is: ෍ 𝑘=1 𝐾 ෍ 𝑖∈𝑆 𝑘 ෍ 𝑗=1 𝑃 (𝑥𝑖𝑗 − 𝜇 𝑘𝑗)2 where 𝑆 𝑘 is the set of observations in the kth cluster and 𝜇 𝑘𝑗 is the mean of jth variable of the cluster center of the kth cluster. • Then, they select the top n points that are the farthest away from their nearest cluster centers as outliers. 34
  • 35. 35 Anomaly Detection vs Unsupervised Learning
  • 37. 37 Distance functions • Manhattan distance: D =|𝑋𝐴- 𝑋 𝐵|+ |𝑌𝐴- 𝑌𝐵|
  • 39. 39 Machine Learning Algorithms Machine Learning Supervised Prediction Parametric Linear Regression Neural Networks Non- parametric KNN Decision Trees Classification Parametric Logistic Regression Neural Networks Non Parametric Decision Trees KNN Unsupervised algorithms K-means Associative rule mining
  • 40. 40
  • 42. 42 • What transformations do I need for the x and y variables ? • Which are the best features to use? ▫ Dimension Reduction – PCA ▫ Best subset selection  Forward selection  Backward elimination  Stepwise regression Feature Engineering
  • 44. 44
  • 45. 45 Evaluating Machine learning algorithms Supervised - Prediction R-square RMS MAE MAPE Supervised- Classification Confusion Matrix ROC Curves Evaluation framework
  • 46. 46 • The prediction error for record i is defined as the difference between its actual y value and its predicted y value 𝑒𝑖 = 𝑦𝑖 − ො𝑦𝑖 • 𝑅2 indicates how well data fits the statistical model 𝑅2 = 1 − σ𝑖=1 𝑛 (𝑦𝑖 − ො𝑦𝑖)2 σ𝑖=1 𝑛 (𝑦𝑖 − ത𝑦𝑖)2 Prediction Accuracy Measures
  • 47. 47 • Fit measures in classical regression modeling: • Adjusted 𝑅2 has been adjusted for the number of predictors. It increases only when the improve of model is more than one would expect to see by chance (p is the total number of explanatory variables) 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 1 − Τσ𝑖=1 𝑛 (𝑦𝑖 − ො𝑦𝑖)2 (𝑛 − 𝑝 − 1) σ𝑖=1 𝑛 𝑦𝑖 − ത𝑦𝑖 2 /(𝑛 − 1) • MAE or MAD (mean absolute error/deviation) gives the magnitude of the average absolute error 𝑀𝐴𝐸 = 1 𝑛 σ𝑖=1 𝑛 𝑒𝑖 Prediction Accuracy Measures
  • 48. 48 ▫ MAPE (mean absolute percentage error) gives a percentage score of how predictions deviate on average 𝑀𝐴𝑃𝐸 = 1 𝑛 σ𝑖=1 𝑛 𝑒𝑖/𝑦𝑖 × 100% • RMSE (root-mean-squared error) is computed on the training and validation data 𝑅𝑀𝑆𝐸 = 1/𝑛 ෍ 𝑖=1 𝑛 𝑒𝑖 2 Prediction Accuracy Measures
  • 49. 49 • Consider a two-class case with classes 𝐶0 and 𝐶1 • Classification matrix: Classification matrix Predicted Class Actual Class 𝐶0 𝐶1 𝐶0 𝑛0,0= number of 𝐶0 cases classified correctly 𝑛0,1= number of 𝐶0 cases classified incorrectly as 𝐶1 𝐶1 𝑛1,0= number of 𝐶1 cases classified incorrectly as 𝐶0 𝑛1,1= number of 𝐶1 cases classified correctly
  • 50. 50 • Estimated misclassification rate (overall error rate) is a main accuracy measure 𝑒𝑟𝑟 = 𝑛0,1 + 𝑛1,0 𝑛0,0 + 𝑛0,1 + 𝑛1,0 + 𝑛1,1 = 𝑛0,1 + 𝑛1,0 𝑛 • Overall accuracy: 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 1 − 𝑒𝑟𝑟 = 𝑛0,0 + 𝑛1,1 𝑛 Accuracy Measures
  • 51. 51 • The ROC curve plots the pairs {sensitivity, 1-specificity} as the cutoff value increases from 0 and 1 • Sensitivity (also called the true positive rate, or the recall in some fields) measures the proportion of positives that are correctly identified (e.g., the percentage of sick people who are correctly identified as having the condition). • Specificity (also called the true negative rate) measures the proportion of negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition). • Better performance is reflected by curves that are closer to the top left corner ROC Curve
  • 52. Agenda 1. Data 2. Goals 3. Machine learning algorithms 4. Process 5. Performance Evaluation
  • 54. 54 Goal Descriptive Statistics Cross sectional Numerical Categorical Numerical vs Categorical Categorical vs Categorical Numerical vs Numerical Time series Predictive Analytics Cross- sectional Segmentation Prediction Predict a number Predict a category Time-series Goal
  • 55. 55 Machine Learning Algorithms Machine Learning Supervised Prediction Parametric Linear Regression Neural Networks Non- parametric KNN Decision Trees Classification Parametric Logistic Regression Neural Networks Non Parametric Decision Trees KNN Unsupervised algorithms K-means Associative rule mining
  • 57. 57 Evaluating Machine learning algorithms Supervised - Prediction R-square RMS MAE MAPE Supervised- Classification Confusion Matrix ROC Curves Evaluation framework
  • 58.
  • 59.
  • 61. 61 Q&A Slides, code and details about the Apache Spark Workshop at: http://www.analyticscertificate.com/SparkWorkshop/
  • 62. Thank you! Members & Sponsors! Sri Krishnamurthy, CFA, CAP Founder and CEO QuantUniversity LLC. srikrishnamurthy www.QuantUniversity.com Contact Information, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be distributed or used in any other publication without the prior written consent of QuantUniversity LLC. 62