SlideShare une entreprise Scribd logo
1  sur  51
Machine Learning and NLP
Machine Learning and NLP
• Machine Learning Review
▪ Supervised vs. Unsupervised Learning, Classification vs. Regression
▪ Simple Classification Example
• Text Classification Examples
▪ Logistic Regression
▪ Naive Bayes
▪ Comparing Methods: Classification Metrics
Machine Learning and NLP
• Machine Learning Review
▪ Supervised vs. Unsupervised Learning
▪ Classification vs. Regression
• Text Classification Examples
▪ Logistic Regression
▪ Naive Bayes
▪ Comparing Methods: Classification Metrics
What is Machine Learning?
Machine learning allows
computers to learn and infer
from data by applying
algorithms to observed data
and make predictions based
on the data.
Machine Learning Vocabulary
• Target/Outcome: predicted category or value
of the data (column to predict)
• Features: properties of the data used for
prediction (non-target columns)
• Example: a single data point within the data
(one row)
• Label: the target value for a single data point
Types of Machine Learning
Here, the data points have known outcome. We train the
model with data. We feed the model with correct answers.
Model Learns and finally predicts new data’s outcome.
Here, the data points have unknown outcome. Data is
given to the model. Right answers are not provided to the
model. The model makes sense of the data given to it.
Can teach you something you were probably not aware
of in the given dataset.
Unsupervised
Supervised
7
Types of Supervised And Unsupervised Learning
Supervised Unsupervised
Classification
Regression
Clustering
Recommendation
Supervised Learning
data points have known outcomeSupervised
Supervised Learning Example: Housing Prices
Target/OutcomeFeature
LabelExample
Supervised Learning Example: Housing Prices
Supervised Learning Example: Housing Prices
Machine Learning and NLP
• Machine Learning Review
▪ Supervised vs. Unsupervised Learning
▪ Classification vs. Regression
• Text Classification Examples
▪ Logistic Regression
▪ Naive Bayes
▪ Comparing Methods: Classification Metrics
Regression
Classification
Types of Supervised Learning
outcome is continuous (numerical)
outcome is a category
Regression
Classification
Types of Supervised Learning
outcome is continuous (numerical)
outcome is a category
Regression vs Classification Problems
Some supervised learning questions we can ask regarding movie data:
• Regression Questions
▪ Can you predict the gross earnings for a movie? $1.1 billion
• Classification Questions
▪ Can you predict whether a movie will win an Oscar or not? Yes | No
▪ Can you predict what the MPAA rating is for a movie? G | PG | PG-13 | R
Regression
Predict a real numeric value for an entity with a given set of features.
Price
Address
Type
Age
Parking
School
Transit
Total sqft
Lot Size
Bathrooms
Bedrooms
Yard
Pool
Fireplace
$
sqft
Property Attributes Linear Regression Model
Supervised Learning Overview
data with
answers
model
predicted
answers
data without
answers
fit
+
+
predict
model
model
Classification: Categorical Answers
emails labeled as
spam/not spam
model
spam or
not spam
unlabeled
emails
fit
+
+
model
model
predict
Machine Learning and NLP
• Machine Learning Review
▪ Supervised vs. Unsupervised Learning, Classification vs. Regression
▪ Classification Example
• Text Classification Examples
▪ Logistic Regression
▪ Naive Bayes
▪ Comparing Methods: Classification Metrics
Machine Learning and NLP
• Machine Learning Review
▪ Supervised vs. Unsupervised Learning, Classification vs. Regression
▪ Classification Example
• Text Classification Examples
▪ Logistic Regression
▪ Naive Bayes
▪ Comparing Methods: Classification Metrics
One of the most popular machine learning techniques for binary classification
Binary classification = how do you best split this data into two groups?
Logistic Regression
Kid-Friendly
Not Kid-Friendly
Number of Animals
The most basic regression technique is linear regression
Logistic Regression
y = β1x + β0
Kid-Friendly
Not Kid-Friendly
Number of Animals
Problem: The y values of the line go from -∞ to +∞
Let’s try applying a transformation to the line to limit the y values from 0 to 1
Logistic Regression
y = β1x + β0
Kid-Friendly
Not Kid-Friendly
Number of Animals
The sigmoid function (aka the “S” curve) solves this problem
If you input the Number of Animals (x), the equation gives you the probability that
the movie is Kid-Friendly (p), which is a number between 0 and 1
Logistic Regression
p =
1 + e-(β1x + β0)
1 .Kid-Friendly
Not Kid-Friendly
Number of Animals
Logistic Regression examples
Predict a label for an entity with a given set of features.
SPAM
Prediction Sentiment Analysis
Steps for classification with NLP
1. Prepare the data: Read in labelled data and preprocess the data
2. Split the data: Separate inputs and outputs into a training set and a test set, repsectively
3. Numerically encode inputs: Using Count Vectorizer or TF-IDF Vectorizer
4. Fit a model: Fit a model on the training data and apply the fitted model to the test set
5. Evaluate the model: Decide how good the model is by calculating various error metrics
Building a Logistic Regression model
A classic use of text analytics is to flag messages as spam
Below is data from the SMS Spam Collection Data, which is a set of over 5K
English text messages that have been labeled as spam or ham (legitimate)
Step 1: Prepare the data
Text Message Label
Nah I don't think he goes to usf, he lives around here though ham
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive
entry question(std txt rate)T&C's apply 08452810075over18's
spam
WINNER!! As a valued network customer you have been selected to receivea £900 prize reward!
To claim call 09061701461. Claim code KL341. Valid 12 hours only.
spam
I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried
enough today.
ham
I HAVE A DATE ON SUNDAY WITH WILL!! ham
… …
Step 1: Prepare the data [Code]
# make sure the data is labeled
import pandas as pd
data = pd.read_table('SMSSpamCollection.txt', header=None)
data.columns = ['label', 'text']
print(data.head()) # print function requires Python 3
Input:
Output:
Step 1: Prepare the data [Code]
# remove words with numbers, punctuation and capital letters
import re
import string
alphanumeric = lambda x: re.sub(r"""w*dw*""", ' ', x)
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())
data['text'] = data.text.map(alphanumeric).map(punc_lower)
print(data.head())
Input:
Output:
To fit a model, the data needs to be split into inputs and outputs
The inputs and output of these models have various names
• Inputs: Features, Predictors, Independent Variables, X’s
• Outputs: Outcome, Response, Dependent Variable, Y
Step 2: Split the data (into inputs and outputs)
# label congrats eat tonight winner chicken dinner wings
0 ham 0 1 0 0 0 0 0
1 ham 0 1 1 0 0 0 0
2 spam 0 0 0 1 0 0 0
. … … … … … … … …
Step 2: Split the data [Code]
# split the data into inputs and outputs
X = data.text # inputs into model
y = data.label # output of model
Input:
Output:
Step 2: Split the data (into a training and test set)
Why do we need to split data into training and test sets?
• Let’s say we had a data set with 100 observations and we found a model that fit
the data perfectly
• What if you were to use that model on a brand new data set?
Blue = Overfitting
Black = Correct
Source:
https://en.wikipedia.org/wiki/Overfitting
Step 2: Split the data (into a training and test set)
To prevent the issue of overfitting, we divide observations into two sets
• A model is fit on the training data and it is evaluated on the test data
• This way, you can see if the model generalizes well
label congrats eat tonight winner chicken dinner wings
0 ham 0 1 0 0 0 0 0
1 ham 0 1 1 0 0 0 0
2 spam 0 0 0 1 0 0 0
3 spam 1 0 0 0 0 0 0
4 ham 0 0 0 0 0 1 0
5 ham 0 0 1 0 0 0 0
6 ham 0 0 0 0 0 0 0
7 spam 0 0 0 0 0 0 0
8 ham 0 0 0 0 0 1 0
9 ham 0 0 0 0 1 1 0
10 spam 0 0 0 0 0 0 0
11 ham 0 0 0 0 0 0 1
Training Set (70-80%)
Test Set (20-30%)
Step 2: Split the data [Code]
# split the data into a training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# test size = 30% of observations, which means training size = 70% of observations
# random state = 42, so we all get the same random train / test split
Input:
Output:
Step 3: Numerically encode the input data [Code]
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words=‘english’)
X_train_cv = cv.fit_transform(X_train) # fit_transform learns the vocab and one-hot
encodes
X_test_cv = cv.transform(X_test) # transform uses the same vocab and one-hot encodes
# print the dimensions of the training set (text messages, terms)
print(X_train_cv.toarray().shape)
Input:
Output:
(3900, 6103)
Step 4: Fit model and predict outcomes [Code]
# Use a logistic regression model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
# Train the model
lr.fit(X_train_cv, y_train)
# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv
data
y_pred_cv = lr.predict(X_test_cv)
y_pred_cv # The output is all of the predictions
Input:
Output:
array(['ham', 'ham', 'ham', ..., 'ham', 'spam', 'ham'], dtype=object)
Step 5: Evaluate the model
After fitting a model on the training data and predicting outcomes for the test data,
how do you know if the model is a good fit?
Actual
# Actual Predicted
1 ham ham
2 ham ham
3 spam spam
4 spam spam
5 ham ham
6 ham spam
7 ham ham
8 ham ham
9 ham ham
10 spam spam
True Negative
(6)
False Positive
(1)
False Negative
(0)
True Positive
(3)
PredictedResult
true negative
true negative
true positive
true positive
true negative
false positive
true negative
true negative
true negative
true positive
ham spam
hamspam
Confusion Matrix
Step 5: Evaluate the model
After fitting a model on the training data and predicting outcomes for the test data,
how do you know if the model is a good fit?
Error Metrics
• Accuracy = (TP + TN) / All
• Precision = TP / (TP + FP)
• Recall = TP / (TP + FN)
• F1 Score = 2*(P*R)/(P+R)
Actual
True Negative
(6)
False Positive
(1)
False Negative
(0)
True Positive
(3)
Predicted
ham spam
hamspam
Confusion Matrix
P = Precision R = Recall
Step 5: Evaluate the model
After fitting a model on the training data and predicting outcomes for the test data,
how do you know if the model is a good fit?
Error Metrics
• Accuracy = (TP + TN) / All = 0.9
• Precision = TP / (TP + FP) = 0.75
• Recall = TP / (TP + FN) = 1
• F1 Score = 2*(P*R)/(P+R) = 0.86
Actual
True Negative
(6)
False Positive
(1)
False Negative
(0)
True Positive
(3)
Predicted
ham spam
hamspam
Confusion Matrix
Step 5: Evaluate the model [Code]
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
cm = confusion_matrix(y_test, y_pred_cv)
sns.heatmap(cm, xticklabels=['predicted_ham', ‘predicted_spam'], yticklabels=['actual_ham', ‘actual_spam'],
annot=True, fmt='d', annot_kws={'fontsize':20}, cmap=“YlGnBu");
true_neg, false_pos = cm[0]
false_neg, true_pos = cm[1]
accuracy = round((true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg),3)
precision = round((true_pos) / (true_pos + false_pos),3)
recall = round((true_pos) / (true_pos + false_neg),3)
f1 = round(2 * (precision * recall) / (precision + recall),3)
print('Accuracy: {}'.format(accuracy))
print('Precision: {}'.format(precision))
print('Recall: {}'.format(recall))
print(‘F1 Score: {}’.format(f1))
Input:
Step 5: Evaluate the model [Code]
Output:
Accuracy: 0.986
Precision: 1.0
Recall: 0.893
F1 Score: 0.943
What was our original goal?
To classify text messages as spam or ham
How did we do that?
By collecting labeled data, cleaning the data, splitting it into a training and test set,
numerically encoding it using Count Vectorizer, fitting a logistic regression model on the
training data, and evaluating the results of the model applied to the test set
Was it a good model?
It seems good, but let’s see if we can get better metrics with another classification technique,
Naive Bayes
Logistic Regression Checkpoint
Machine Learning and NLP
• Machine Learning Review
▪ Supervised vs. Unsupervised Learning,
▪ Classification vs. Regression
• Text Classification Examples
▪ Logistic Regression
▪ Naive Bayes
▪ Comparing Methods: Classification Metrics
One of the simpler and less computationally intensive techniques
Naive Bayes tends to perform well on text classifications
1. Conditional Probability and Bayes Theorem
2. Independence and Naive Bayes
3. Apply Naive Bayes to Spam Example and Compare with Logistic Regression
Naive Bayes
Conditional Probability = what’s the probability that something will happen, given
that something else has happened?
Spam Example = what’s the probability that this text message is spam, given that
it contains the word “cash”?
Naive Bayes: Bayes Theorem
P(A|B) =
P(B|A) x P(A)
P(B)
P(spam | cash) =
P(cash | spam) x P(spam)
P(cash)
Naive Bayes assumes that each event is independent, or that it has no effect on other
events. This is a naive assumption, but it provides a simplified approach.
Spam Example: Naive Bayes assumes that each word in a spam message (like “cash”
and “winner”) is unrelated. This is unrealistic, but it’s the naive assumption.
Naive Bayes: Independent Events
P(spam | winner of some cash) =
P(winner | spam) x P(of | spam) x P(some | spam) x P(cash | spam) x P(spam)
P(ham | winner of some cash) =
P(winner | ham) x P(of | ham) x P(some | ham) x P(cash | ham) x P(ham)
The higher probability wins
Naive Bayes: Fit model [Code]
# Use a Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
# Train the model
nb.fit(X_train_cv, y_train)
# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv
data
y_pred_cv_nb = nb.predict(X_test_cv)
y_pred_cv_nb # The output is all of the predictions
Input:
Output:
array(['ham', 'ham', 'ham', ..., 'ham', 'spam', ‘ham'], dtype='<U4')
Naive Bayes: Results
Accuracy: 0.986
Precision: 0.939
Recall: 0.952
F1 Score: 0.945
Compare Logistic Regression and Naive Bayes
Logistic Regression
Accuracy: 0.986
Precision: 1.0
Recall: 0.893
F1 Score: 0.943
Naive Bayes
Accuracy: 0.986
Precision: 0.939
Recall: 0.952
F1 Score: 0.945
Machine Learning and NLP Review
• Machine Learning Review
▪ Supervised Learning > Classification Techniques > Logistic Regression & Naive Bayes
▪ Error Metrics for Classification > Accuracy | Precision | Recall | F1 Score
• Classification with NLP
▪ Make sure the data is labeled and split into a training and test set
▪ Clean the data and use one-hot encoding to put in a numeric format for modeling
▪ Fit the model on the training data, apply the model to the test data and evaluate
Machine learning and_nlp

Contenu connexe

Tendances

Tendances (19)

08 neural networks
08 neural networks08 neural networks
08 neural networks
 
05 contours seg_matching
05 contours seg_matching05 contours seg_matching
05 contours seg_matching
 
03 image transformations_i
03 image transformations_i03 image transformations_i
03 image transformations_i
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regression
 
Ml6 decision trees
Ml6 decision treesMl6 decision trees
Ml6 decision trees
 
Ml5 svm and-kernels
Ml5 svm and-kernelsMl5 svm and-kernels
Ml5 svm and-kernels
 
04 image transformations_ii
04 image transformations_ii04 image transformations_ii
04 image transformations_ii
 
GBM theory code and parameters
GBM theory code and parametersGBM theory code and parameters
GBM theory code and parameters
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
 
Introduction to Boosted Trees by Tianqi Chen
Introduction to Boosted Trees by Tianqi ChenIntroduction to Boosted Trees by Tianqi Chen
Introduction to Boosted Trees by Tianqi Chen
 
Matrix Factorization
Matrix FactorizationMatrix Factorization
Matrix Factorization
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
PCA and SVD in brief
PCA and SVD in briefPCA and SVD in brief
PCA and SVD in brief
 
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demo
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 

Similaire à Machine learning and_nlp

Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
Ajit Ghodke
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
ESCOM
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
Kevin Lee
 

Similaire à Machine learning and_nlp (20)

.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво....NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
 
The ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptxThe ABC of Implementing Supervised Machine Learning with Python.pptx
The ABC of Implementing Supervised Machine Learning with Python.pptx
 
Getting started with Machine Learning
Getting started with Machine LearningGetting started with Machine Learning
Getting started with Machine Learning
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning models
 
Machine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approachMachine-Learning-Overview a statistical approach
Machine-Learning-Overview a statistical approach
 
Mariia Havrylovych "Active learning and weak supervision in NLP projects"
Mariia Havrylovych "Active learning and weak supervision in NLP projects"Mariia Havrylovych "Active learning and weak supervision in NLP projects"
Mariia Havrylovych "Active learning and weak supervision in NLP projects"
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Overfitting and-tbl
Overfitting and-tblOverfitting and-tbl
Overfitting and-tbl
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learned
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
 
NEURAL Network Design Training
NEURAL Network Design  TrainingNEURAL Network Design  Training
NEURAL Network Design Training
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
 
Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
Workshop: Your first machine learning project
Workshop: Your first machine learning projectWorkshop: Your first machine learning project
Workshop: Your first machine learning project
 
1. Demystifying ML.pdf
1. Demystifying ML.pdf1. Demystifying ML.pdf
1. Demystifying ML.pdf
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
 
wk5ppt2_Iris
wk5ppt2_Iriswk5ppt2_Iris
wk5ppt2_Iris
 

Plus de ankit_ppt (15)

Deep learning summary
Deep learning summaryDeep learning summary
Deep learning summary
 
01 foundations
01 foundations01 foundations
01 foundations
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Text generation and_advanced_topics
Text generation and_advanced_topicsText generation and_advanced_topics
Text generation and_advanced_topics
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
Ml8 boosting and-stacking
Ml8 boosting and-stackingMl8 boosting and-stacking
Ml8 boosting and-stacking
 
Ml4 naive bayes
Ml4 naive bayesMl4 naive bayes
Ml4 naive bayes
 
Lesson 3 ai in the enterprise
Lesson 3   ai in the enterpriseLesson 3   ai in the enterprise
Lesson 3 ai in the enterprise
 
Ml3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metricsMl3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metrics
 
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsMl1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
 
Lesson 2 ai in industry
Lesson 2   ai in industryLesson 2   ai in industry
Lesson 2 ai in industry
 
Lesson 5 arima
Lesson 5 arimaLesson 5 arima
Lesson 5 arima
 

Dernier

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 

Machine learning and_nlp

  • 2. Machine Learning and NLP • Machine Learning Review ▪ Supervised vs. Unsupervised Learning, Classification vs. Regression ▪ Simple Classification Example • Text Classification Examples ▪ Logistic Regression ▪ Naive Bayes ▪ Comparing Methods: Classification Metrics
  • 3. Machine Learning and NLP • Machine Learning Review ▪ Supervised vs. Unsupervised Learning ▪ Classification vs. Regression • Text Classification Examples ▪ Logistic Regression ▪ Naive Bayes ▪ Comparing Methods: Classification Metrics
  • 4. What is Machine Learning? Machine learning allows computers to learn and infer from data by applying algorithms to observed data and make predictions based on the data.
  • 5. Machine Learning Vocabulary • Target/Outcome: predicted category or value of the data (column to predict) • Features: properties of the data used for prediction (non-target columns) • Example: a single data point within the data (one row) • Label: the target value for a single data point
  • 6. Types of Machine Learning Here, the data points have known outcome. We train the model with data. We feed the model with correct answers. Model Learns and finally predicts new data’s outcome. Here, the data points have unknown outcome. Data is given to the model. Right answers are not provided to the model. The model makes sense of the data given to it. Can teach you something you were probably not aware of in the given dataset. Unsupervised Supervised
  • 7. 7 Types of Supervised And Unsupervised Learning Supervised Unsupervised Classification Regression Clustering Recommendation
  • 8. Supervised Learning data points have known outcomeSupervised
  • 9. Supervised Learning Example: Housing Prices Target/OutcomeFeature LabelExample
  • 12. Machine Learning and NLP • Machine Learning Review ▪ Supervised vs. Unsupervised Learning ▪ Classification vs. Regression • Text Classification Examples ▪ Logistic Regression ▪ Naive Bayes ▪ Comparing Methods: Classification Metrics
  • 13. Regression Classification Types of Supervised Learning outcome is continuous (numerical) outcome is a category
  • 14. Regression Classification Types of Supervised Learning outcome is continuous (numerical) outcome is a category
  • 15. Regression vs Classification Problems Some supervised learning questions we can ask regarding movie data: • Regression Questions ▪ Can you predict the gross earnings for a movie? $1.1 billion • Classification Questions ▪ Can you predict whether a movie will win an Oscar or not? Yes | No ▪ Can you predict what the MPAA rating is for a movie? G | PG | PG-13 | R
  • 16. Regression Predict a real numeric value for an entity with a given set of features. Price Address Type Age Parking School Transit Total sqft Lot Size Bathrooms Bedrooms Yard Pool Fireplace $ sqft Property Attributes Linear Regression Model
  • 17. Supervised Learning Overview data with answers model predicted answers data without answers fit + + predict model model
  • 18. Classification: Categorical Answers emails labeled as spam/not spam model spam or not spam unlabeled emails fit + + model model predict
  • 19. Machine Learning and NLP • Machine Learning Review ▪ Supervised vs. Unsupervised Learning, Classification vs. Regression ▪ Classification Example • Text Classification Examples ▪ Logistic Regression ▪ Naive Bayes ▪ Comparing Methods: Classification Metrics
  • 20. Machine Learning and NLP • Machine Learning Review ▪ Supervised vs. Unsupervised Learning, Classification vs. Regression ▪ Classification Example • Text Classification Examples ▪ Logistic Regression ▪ Naive Bayes ▪ Comparing Methods: Classification Metrics
  • 21. One of the most popular machine learning techniques for binary classification Binary classification = how do you best split this data into two groups? Logistic Regression Kid-Friendly Not Kid-Friendly Number of Animals
  • 22. The most basic regression technique is linear regression Logistic Regression y = β1x + β0 Kid-Friendly Not Kid-Friendly Number of Animals
  • 23. Problem: The y values of the line go from -∞ to +∞ Let’s try applying a transformation to the line to limit the y values from 0 to 1 Logistic Regression y = β1x + β0 Kid-Friendly Not Kid-Friendly Number of Animals
  • 24. The sigmoid function (aka the “S” curve) solves this problem If you input the Number of Animals (x), the equation gives you the probability that the movie is Kid-Friendly (p), which is a number between 0 and 1 Logistic Regression p = 1 + e-(β1x + β0) 1 .Kid-Friendly Not Kid-Friendly Number of Animals
  • 25. Logistic Regression examples Predict a label for an entity with a given set of features. SPAM Prediction Sentiment Analysis
  • 26. Steps for classification with NLP 1. Prepare the data: Read in labelled data and preprocess the data 2. Split the data: Separate inputs and outputs into a training set and a test set, repsectively 3. Numerically encode inputs: Using Count Vectorizer or TF-IDF Vectorizer 4. Fit a model: Fit a model on the training data and apply the fitted model to the test set 5. Evaluate the model: Decide how good the model is by calculating various error metrics Building a Logistic Regression model
  • 27. A classic use of text analytics is to flag messages as spam Below is data from the SMS Spam Collection Data, which is a set of over 5K English text messages that have been labeled as spam or ham (legitimate) Step 1: Prepare the data Text Message Label Nah I don't think he goes to usf, he lives around here though ham Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's spam WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only. spam I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today. ham I HAVE A DATE ON SUNDAY WITH WILL!! ham … …
  • 28. Step 1: Prepare the data [Code] # make sure the data is labeled import pandas as pd data = pd.read_table('SMSSpamCollection.txt', header=None) data.columns = ['label', 'text'] print(data.head()) # print function requires Python 3 Input: Output:
  • 29. Step 1: Prepare the data [Code] # remove words with numbers, punctuation and capital letters import re import string alphanumeric = lambda x: re.sub(r"""w*dw*""", ' ', x) punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower()) data['text'] = data.text.map(alphanumeric).map(punc_lower) print(data.head()) Input: Output:
  • 30. To fit a model, the data needs to be split into inputs and outputs The inputs and output of these models have various names • Inputs: Features, Predictors, Independent Variables, X’s • Outputs: Outcome, Response, Dependent Variable, Y Step 2: Split the data (into inputs and outputs) # label congrats eat tonight winner chicken dinner wings 0 ham 0 1 0 0 0 0 0 1 ham 0 1 1 0 0 0 0 2 spam 0 0 0 1 0 0 0 . … … … … … … … …
  • 31. Step 2: Split the data [Code] # split the data into inputs and outputs X = data.text # inputs into model y = data.label # output of model Input: Output:
  • 32. Step 2: Split the data (into a training and test set) Why do we need to split data into training and test sets? • Let’s say we had a data set with 100 observations and we found a model that fit the data perfectly • What if you were to use that model on a brand new data set? Blue = Overfitting Black = Correct Source: https://en.wikipedia.org/wiki/Overfitting
  • 33. Step 2: Split the data (into a training and test set) To prevent the issue of overfitting, we divide observations into two sets • A model is fit on the training data and it is evaluated on the test data • This way, you can see if the model generalizes well label congrats eat tonight winner chicken dinner wings 0 ham 0 1 0 0 0 0 0 1 ham 0 1 1 0 0 0 0 2 spam 0 0 0 1 0 0 0 3 spam 1 0 0 0 0 0 0 4 ham 0 0 0 0 0 1 0 5 ham 0 0 1 0 0 0 0 6 ham 0 0 0 0 0 0 0 7 spam 0 0 0 0 0 0 0 8 ham 0 0 0 0 0 1 0 9 ham 0 0 0 0 1 1 0 10 spam 0 0 0 0 0 0 0 11 ham 0 0 0 0 0 0 1 Training Set (70-80%) Test Set (20-30%)
  • 34. Step 2: Split the data [Code] # split the data into a training and test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # test size = 30% of observations, which means training size = 70% of observations # random state = 42, so we all get the same random train / test split Input: Output:
  • 35. Step 3: Numerically encode the input data [Code] from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(stop_words=‘english’) X_train_cv = cv.fit_transform(X_train) # fit_transform learns the vocab and one-hot encodes X_test_cv = cv.transform(X_test) # transform uses the same vocab and one-hot encodes # print the dimensions of the training set (text messages, terms) print(X_train_cv.toarray().shape) Input: Output: (3900, 6103)
  • 36. Step 4: Fit model and predict outcomes [Code] # Use a logistic regression model from sklearn.linear_model import LogisticRegression lr = LogisticRegression() # Train the model lr.fit(X_train_cv, y_train) # Take the model that was trained on the X_train_cv data and apply it to the X_test_cv data y_pred_cv = lr.predict(X_test_cv) y_pred_cv # The output is all of the predictions Input: Output: array(['ham', 'ham', 'ham', ..., 'ham', 'spam', 'ham'], dtype=object)
  • 37. Step 5: Evaluate the model After fitting a model on the training data and predicting outcomes for the test data, how do you know if the model is a good fit? Actual # Actual Predicted 1 ham ham 2 ham ham 3 spam spam 4 spam spam 5 ham ham 6 ham spam 7 ham ham 8 ham ham 9 ham ham 10 spam spam True Negative (6) False Positive (1) False Negative (0) True Positive (3) PredictedResult true negative true negative true positive true positive true negative false positive true negative true negative true negative true positive ham spam hamspam Confusion Matrix
  • 38. Step 5: Evaluate the model After fitting a model on the training data and predicting outcomes for the test data, how do you know if the model is a good fit? Error Metrics • Accuracy = (TP + TN) / All • Precision = TP / (TP + FP) • Recall = TP / (TP + FN) • F1 Score = 2*(P*R)/(P+R) Actual True Negative (6) False Positive (1) False Negative (0) True Positive (3) Predicted ham spam hamspam Confusion Matrix P = Precision R = Recall
  • 39. Step 5: Evaluate the model After fitting a model on the training data and predicting outcomes for the test data, how do you know if the model is a good fit? Error Metrics • Accuracy = (TP + TN) / All = 0.9 • Precision = TP / (TP + FP) = 0.75 • Recall = TP / (TP + FN) = 1 • F1 Score = 2*(P*R)/(P+R) = 0.86 Actual True Negative (6) False Positive (1) False Negative (0) True Positive (3) Predicted ham spam hamspam Confusion Matrix
  • 40. Step 5: Evaluate the model [Code] from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline cm = confusion_matrix(y_test, y_pred_cv) sns.heatmap(cm, xticklabels=['predicted_ham', ‘predicted_spam'], yticklabels=['actual_ham', ‘actual_spam'], annot=True, fmt='d', annot_kws={'fontsize':20}, cmap=“YlGnBu"); true_neg, false_pos = cm[0] false_neg, true_pos = cm[1] accuracy = round((true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg),3) precision = round((true_pos) / (true_pos + false_pos),3) recall = round((true_pos) / (true_pos + false_neg),3) f1 = round(2 * (precision * recall) / (precision + recall),3) print('Accuracy: {}'.format(accuracy)) print('Precision: {}'.format(precision)) print('Recall: {}'.format(recall)) print(‘F1 Score: {}’.format(f1)) Input:
  • 41. Step 5: Evaluate the model [Code] Output: Accuracy: 0.986 Precision: 1.0 Recall: 0.893 F1 Score: 0.943
  • 42. What was our original goal? To classify text messages as spam or ham How did we do that? By collecting labeled data, cleaning the data, splitting it into a training and test set, numerically encoding it using Count Vectorizer, fitting a logistic regression model on the training data, and evaluating the results of the model applied to the test set Was it a good model? It seems good, but let’s see if we can get better metrics with another classification technique, Naive Bayes Logistic Regression Checkpoint
  • 43. Machine Learning and NLP • Machine Learning Review ▪ Supervised vs. Unsupervised Learning, ▪ Classification vs. Regression • Text Classification Examples ▪ Logistic Regression ▪ Naive Bayes ▪ Comparing Methods: Classification Metrics
  • 44. One of the simpler and less computationally intensive techniques Naive Bayes tends to perform well on text classifications 1. Conditional Probability and Bayes Theorem 2. Independence and Naive Bayes 3. Apply Naive Bayes to Spam Example and Compare with Logistic Regression Naive Bayes
  • 45. Conditional Probability = what’s the probability that something will happen, given that something else has happened? Spam Example = what’s the probability that this text message is spam, given that it contains the word “cash”? Naive Bayes: Bayes Theorem P(A|B) = P(B|A) x P(A) P(B) P(spam | cash) = P(cash | spam) x P(spam) P(cash)
  • 46. Naive Bayes assumes that each event is independent, or that it has no effect on other events. This is a naive assumption, but it provides a simplified approach. Spam Example: Naive Bayes assumes that each word in a spam message (like “cash” and “winner”) is unrelated. This is unrealistic, but it’s the naive assumption. Naive Bayes: Independent Events P(spam | winner of some cash) = P(winner | spam) x P(of | spam) x P(some | spam) x P(cash | spam) x P(spam) P(ham | winner of some cash) = P(winner | ham) x P(of | ham) x P(some | ham) x P(cash | ham) x P(ham) The higher probability wins
  • 47. Naive Bayes: Fit model [Code] # Use a Naive Bayes model from sklearn.naive_bayes import MultinomialNB nb = MultinomialNB() # Train the model nb.fit(X_train_cv, y_train) # Take the model that was trained on the X_train_cv data and apply it to the X_test_cv data y_pred_cv_nb = nb.predict(X_test_cv) y_pred_cv_nb # The output is all of the predictions Input: Output: array(['ham', 'ham', 'ham', ..., 'ham', 'spam', ‘ham'], dtype='<U4')
  • 48. Naive Bayes: Results Accuracy: 0.986 Precision: 0.939 Recall: 0.952 F1 Score: 0.945
  • 49. Compare Logistic Regression and Naive Bayes Logistic Regression Accuracy: 0.986 Precision: 1.0 Recall: 0.893 F1 Score: 0.943 Naive Bayes Accuracy: 0.986 Precision: 0.939 Recall: 0.952 F1 Score: 0.945
  • 50. Machine Learning and NLP Review • Machine Learning Review ▪ Supervised Learning > Classification Techniques > Logistic Regression & Naive Bayes ▪ Error Metrics for Classification > Accuracy | Precision | Recall | F1 Score • Classification with NLP ▪ Make sure the data is labeled and split into a training and test set ▪ Clean the data and use one-hot encoding to put in a numeric format for modeling ▪ Fit the model on the training data, apply the model to the test data and evaluate