NLP Classifier Models & Metrics

NLP Classifier
Models & Metrics
Sanghamitra Deb
Staff Data Scientist
Chegg Inc

OUTLINE
Models
• Tfidf features
• Word2vec features
• Simple feedforward NN classifier
• CNN
• Word based
• Character based
• Siamese Networks
Metrics

Text Classification
Text Pre - processing Collecting Training Data Model Building
Offline
SME
• Reduces noise
• Ensures quality
• Improves overall performance
• Training Data Collection / Examples
of classes that we are trying to model
• Model performance is directly
correlated with quality of training
data
• Model selection
• Architecture
• Parameter Tuning
User
Online
Model Evaluation

Applicable for text based classifications
• Removing special characters.
• Cleaning numbers.
• Removing misspellings
 Peter Norvig’s spell checker.
https://norvig.com/spell-correct.html
 Using Google word2vec vocabulary to identify
misspelled words.
https://mlwhiz.com/blog/2019/01/17/deeple
arning_nlp_preprocess/
• Removing contracted words --- contraction_dict =
{"ain't": "is not", "aren't": "are not","can't":
"cannot”, …}
Preprocessing!
--- Project
specific

TFIDF Features
• ngram_range: (1,3) --- implies unigrams,
bigrams, and trigrams will be taken into account
while creating features.
• min_df: Minimum no of time an ngram should
appear in a corpus to be used as a feature.
Tfidf features can be used with any ML classifier such as LR
When using LR for NLP tasks L1 regularization performs
better since tfidf features are sparse.

Transfer Learning – word2vec features
either using context to predict a target word (a method
known as continuous bag of words, or CBOW), or using a
word to predict a target context, which is called skip-gram
https://medium.com/@zafaralibagh6/a-simple-word2vec-tutorial-61e64e38a6a1
Applying tfidf weighting to word vectors boosts overall model performance
https://towardsdatascience.com/supercharging-word-vectors-be80ee5513d

What is neuron?
https://www.slideshare.net/tw_dsconf/ss-62245351
a1
a2
a3

Neural Network
a1
a2
a3
• Each node is a function with input
and output vectors
• Every network structure is defined
by a set of functions

• Loss is minimized using
Gradient Descent
• Find network parameters
such that the loss is
minimized
• This is done by taking
derivatives of the loss wrt
parameters.
• Next the parameters are
updated by subtracting
learning rate times the
derivative

Commonly
used loss
functions
• Mean Squared Error Loss
• Mean Squared Logarithmic Error Loss
• Mean Absolute Error Loss
Regression Loss Functions
• Binary Cross-Entropy
• Hinge Loss
• Squared Hinge Loss
Binary Classification Loss Functions
• Multi-Class Cross-Entropy Loss
• Sparse Multiclass Cross-Entropy Loss
• Kullback Leibler Divergence Loss
Multi-Class Classification Loss Functions

Cost Function
– Cross
Entropy

Dropout -- avoid overfitting
• Large weights in a neural network are a
sign of a more complex network that has
overfit the training data.
• Probabilistically dropping out nodes in the
network is a simple and effective
regularization method.
• A large network with more training and the
use of a weight constraint are suggested
when using dropout.

Activation
Functions
• Sigmoid/ Softmax
• Tanh
• Relu
• Leaky Relu

Activation
Functions
• Sigmoid/ Softmax
• Tanh
• Relu
• Leaky Relu
a = max(0,z)

Text Data
Data Source -- https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

Text Pre-processing with Keras
PaddingTokenizing

Start with an Embedding Layer
• Embedding Layer of Keras which takes the previously calculated integers and
maps them to a dense vector of the embedding.
o Parameters
 input_dim: the size of the vocabulary
 output_dim: the size of the dense vector
 input_length: the length of the sequence
Hope to see you soon
Nice to see you again
After training
https://stats.stackexchange.com/questions/270546/how-does-keras-embedding-layer-work

Add a pooling layer
• MaxPooling1D/AveragePooling1D or
a GlobalMaxPooling1D/GlobalAveragePooling1D layer
• way to downsample (a way to reduce the size of) the incoming
feature vectors.
• Global max/average pooling takes the maximum/average of all
features whereas in the other case you have to define the pool size.

Definition of
the entire
model

Training
Using pre-trained word embeddings will lead to an accuracy of
0.82. This is a case of transfer learning.
https://realpython.com/python-keras-text-classification

Convolution Neural
Network
Detect features ! Downsample.

What is a CNN?
In a traditional feedforward neural network we connect each
input neuron to each output neuron in the next layer. That’s
also called a fully connected layer, or affine layer.
• We use convolutions over the input layer to compute the
output. This results in local connections, where each region
of the input is connected to a neuron in the output. Each
layer applies different filters and combines the result
• During the training phase, a CNN automatically learns the
values of its filters based on the task you want to perform.
• Inputs --- n_filters, kernel size (=2)

Character based CNN
https://towardsdatascience.com/character-level-cnn-with-keras-50391c3adf33

Advantages
of CNN
• Character Based CNN
• Has the ability to deal with out of vocabulary
words. This makes it particularly suitable for
user generated raw text.
• Works for multiple languages.
• Model size is small since the tokens are
limited to the number of characters ~ 70.
This makes real life deployments easier and
faster.
• Does not need a lot of data cleaning
• Networks with convolutional and pooling layers
are useful for classification tasks in which we
expect to find strong local clues regarding class
membership.
https://machinelearningmastery.com/best-practices-document-classification-deep-learning/

Siamese Networks
Siamese neural network is a class of neural network architectures that contain two or more identical subnetworks. ---- they
have the same configuration, the same parameters & weights. Parameter updating is mirrored across both subnetworks.
• More Robust to class Imbalance
• Ensembling with classifier yields
better results.
• Creates more meaningful
embeddings.

Confidential Material / © 2020 Chegg, Inc. / All Rights Reserved
Multi-task Modeling
CNN
Model
CNN
Model
Cross Entropy Loss
Output
Question
Q A
Answer
Similarity Function
Question/Answer
CNN
Model
Softmax -- # of courses
Cross Entropy Loss
Output
Two tasks
• Similarity between
question and answer.
• Classification of courses

Performance Metrics
Is the model good enough?

Classification
https://en.wikipedia.org/wiki/Precision_and_recall
Precision : TP/(TP+FP) --- what percentage of the positive class
is actually positive?
Recall : TP/(TP+FN) --- what percentage of the positive class
gets captured by the model?
Accuracy --- (TP+TN)/(TP+FP+TN+FN) --- what percentage of
predictions are correct?

Thresholding --- Coverage
In a binary classification if you choose randomly the probability of belonging to a class is 0.5
0.3
0.7
It is possible improve the percentage of
correct results at the cost of coverage.

ROC & AUC
ROC – Reciever Operating Characteristics
An ROC curve (receiver operating characteristic curve) is a graph
showing the performance of a classification model at all
classification thresholds.
AUC – Area Under the Curve.
• AUC is scale-invariant. It measures how well predictions
are ranked, rather than their absolute values.
• AUC is classification-threshold-invariant. It measures the
quality of the model's predictions irrespective of what
classification threshold is chosen.
• Works better for imbalanced datasets.
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
• TPR = TP/(TP+FN)
• FPR = FP/(FP+TN)
Random
https://datascience.stackexchange.com/questions/806/advantages-of-auc-vs-standard-accuracy

Summary
• Tfidf & word2vec provide simple feature extraction techniques
• As the amount of training data increases using deeplearning is logical
• Feed forward Network
• CNN
• Siamese Networks
• It is important to determine which metrics are important before
training data collection and modeling.

Thank You
@sangha_deb
sangha123@gmail.com

Word Vectors with
Context!
• In a context free embedding ”crisp” in sentence “The morning air is
getting crisp” and “getting burned to a crisp” would have the same
vector: f(crisp)
• In a context aware model the embedding would be specific to the
would be augmented by the context in which it appears.
• f(crisp, context)
https://www.gocomics.com/frazz/

Bert features
https://towardsdatascience.com/nlp-extract-contextualized-word-embeddings-from-bert-keras-tf-67ef29f60a7b

NLP Classifier Models & Metrics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to NLP Classifier Models & Metrics

Similar to NLP Classifier Models & Metrics (20)

More from Sanghamitra Deb

More from Sanghamitra Deb (12)

Recently uploaded

Recently uploaded (20)

NLP Classifier Models & Metrics

Editor's Notes