This presentation nlp classifiers, the different types of models tfidf, word2vec & DL models such as feed forward NN , CNN & siamese networks. Details on important metrics such as precision, recall AUC are also given
2. OUTLINE
Models
• Tfidf features
• Word2vec features
• Simple feedforward NN classifier
• CNN
• Word based
• Character based
• Siamese Networks
Metrics
3. Text Classification
Text Pre - processing Collecting Training Data Model Building
Offline
SME
• Reduces noise
• Ensures quality
• Improves overall performance
• Training Data Collection / Examples
of classes that we are trying to model
• Model performance is directly
correlated with quality of training
data
• Model selection
• Architecture
• Parameter Tuning
User
Online
Model Evaluation
4. Applicable for text based classifications
• Removing special characters.
• Cleaning numbers.
• Removing misspellings
Peter Norvig’s spell checker.
https://norvig.com/spell-correct.html
Using Google word2vec vocabulary to identify
misspelled words.
https://mlwhiz.com/blog/2019/01/17/deeple
arning_nlp_preprocess/
• Removing contracted words --- contraction_dict =
{"ain't": "is not", "aren't": "are not","can't":
"cannot”, …}
Preprocessing!
--- Project
specific
5. TFIDF Features
• ngram_range: (1,3) --- implies unigrams,
bigrams, and trigrams will be taken into account
while creating features.
• min_df: Minimum no of time an ngram should
appear in a corpus to be used as a feature.
Tfidf features can be used with any ML classifier such as LR
When using LR for NLP tasks L1 regularization performs
better since tfidf features are sparse.
6. Transfer Learning – word2vec features
either using context to predict a target word (a method
known as continuous bag of words, or CBOW), or using a
word to predict a target context, which is called skip-gram
https://medium.com/@zafaralibagh6/a-simple-word2vec-tutorial-61e64e38a6a1
Applying tfidf weighting to word vectors boosts overall model performance
https://towardsdatascience.com/supercharging-word-vectors-be80ee5513d
12. • Loss is minimized using
Gradient Descent
• Find network parameters
such that the loss is
minimized
• This is done by taking
derivatives of the loss wrt
parameters.
• Next the parameters are
updated by subtracting
learning rate times the
derivative
13. Commonly
used loss
functions
• Mean Squared Error Loss
• Mean Squared Logarithmic Error Loss
• Mean Absolute Error Loss
Regression Loss Functions
• Binary Cross-Entropy
• Hinge Loss
• Squared Hinge Loss
Binary Classification Loss Functions
• Multi-Class Cross-Entropy Loss
• Sparse Multiclass Cross-Entropy Loss
• Kullback Leibler Divergence Loss
Multi-Class Classification Loss Functions
15. Dropout -- avoid overfitting
• Large weights in a neural network are a
sign of a more complex network that has
overfit the training data.
• Probabilistically dropping out nodes in the
network is a simple and effective
regularization method.
• A large network with more training and the
use of a weight constraint are suggested
when using dropout.
23. Start with an Embedding Layer
• Embedding Layer of Keras which takes the previously calculated integers and
maps them to a dense vector of the embedding.
o Parameters
input_dim: the size of the vocabulary
output_dim: the size of the dense vector
input_length: the length of the sequence
Hope to see you soon
Nice to see you again
After training
https://stats.stackexchange.com/questions/270546/how-does-keras-embedding-layer-work
24. Add a pooling layer
• MaxPooling1D/AveragePooling1D or
a GlobalMaxPooling1D/GlobalAveragePooling1D layer
• way to downsample (a way to reduce the size of) the incoming
feature vectors.
• Global max/average pooling takes the maximum/average of all
features whereas in the other case you have to define the pool size.
26. Training
Using pre-trained word embeddings will lead to an accuracy of
0.82. This is a case of transfer learning.
https://realpython.com/python-keras-text-classification
28. What is a CNN?
In a traditional feedforward neural network we connect each
input neuron to each output neuron in the next layer. That’s
also called a fully connected layer, or affine layer.
• We use convolutions over the input layer to compute the
output. This results in local connections, where each region
of the input is connected to a neuron in the output. Each
layer applies different filters and combines the result
• During the training phase, a CNN automatically learns the
values of its filters based on the task you want to perform.
• Inputs --- n_filters, kernel size (=2)
31. Advantages
of CNN
• Character Based CNN
• Has the ability to deal with out of vocabulary
words. This makes it particularly suitable for
user generated raw text.
• Works for multiple languages.
• Model size is small since the tokens are
limited to the number of characters ~ 70.
This makes real life deployments easier and
faster.
• Does not need a lot of data cleaning
• Networks with convolutional and pooling layers
are useful for classification tasks in which we
expect to find strong local clues regarding class
membership.
https://machinelearningmastery.com/best-practices-document-classification-deep-learning/
32. Siamese Networks
Siamese neural network is a class of neural network architectures that contain two or more identical subnetworks. ---- they
have the same configuration, the same parameters & weights. Parameter updating is mirrored across both subnetworks.
• More Robust to class Imbalance
• Ensembling with classifier yields
better results.
• Creates more meaningful
embeddings.
36. Thresholding --- Coverage
In a binary classification if you choose randomly the probability of belonging to a class is 0.5
0.3
0.7
It is possible improve the percentage of
correct results at the cost of coverage.
38. ROC & AUC
ROC – Reciever Operating Characteristics
An ROC curve (receiver operating characteristic curve) is a graph
showing the performance of a classification model at all
classification thresholds.
AUC – Area Under the Curve.
• AUC is scale-invariant. It measures how well predictions
are ranked, rather than their absolute values.
• AUC is classification-threshold-invariant. It measures the
quality of the model's predictions irrespective of what
classification threshold is chosen.
• Works better for imbalanced datasets.
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
• TPR = TP/(TP+FN)
• FPR = FP/(FP+TN)
Random
https://datascience.stackexchange.com/questions/806/advantages-of-auc-vs-standard-accuracy
39. Summary
• Tfidf & word2vec provide simple feature extraction techniques
• As the amount of training data increases using deeplearning is logical
• Feed forward Network
• CNN
• Siamese Networks
• It is important to determine which metrics are important before
training data collection and modeling.
41. Word Vectors with
Context!
• In a context free embedding ”crisp” in sentence “The morning air is
getting crisp” and “getting burned to a crisp” would have the same
vector: f(crisp)
• In a context aware model the embedding would be specific to the
would be augmented by the context in which it appears.
• f(crisp, context)
https://www.gocomics.com/frazz/
A single node corresponds to two operations, computation of z which is a linear combination of features (a) and weights (w) and computation of the activation function sigma(z).
we connect each input neuron to each output neuron in the next layer.
if y=1 and your predicts 0 you are penalized heavily.
Conversely if y=0 and your model 1 the penalization is infinite.
When you build your neural network, one of the choices you get to make is what
activation function to use in the hidden layers,
as well as what is the output units of your neural network.
So far, we've just been using the sigmoid activation function.
Sigmoid --- output layer because if y is either 0 or
1, then it makes sense for y hat to be a number,
the one to output that's between 0 and 1 rather than between minus 1 and 1
Sigmoid --- output layer because if y is either 0 or
1, then it makes sense for y hat to be a number,
the one to output that's between 0 and 1 rather than between minus 1 and 1
Sigmoid --- output layer because if y is either 0 or
1, then it makes sense for y hat to be a number,
the one to output that's between 0 and 1 rather than between minus 1 and 1
Sigmoid --- output layer because if y is either 0 or
1, then it makes sense for y hat to be a number,
the one to output that's between 0 and 1 rather than between minus 1 and 1
With CountVectorizer, we had stacked vectors of word counts, and each vector was the same length (the size of the total corpus vocabulary). With Tokenizer, the resulting vectors equal the length of each text, and the numbers don’t denote counts, but rather correspond to the word values from the dictionary tokenizer.word_index.
Power of generalization --- embeddings are able to share information across similar features.
Fewer nodes with zero values.
We define two different task for optimization. One of them is to match the front of the card with the back of the card. We use the CNN model defined in the previous slide and use the dot product as the similarity function and use a cross entropy loss. For the classification problem we feed the CNN model into a softmax layer to predict the courses. Both tasks are optimized simultaneously.
In a binary classification if you choose randomly the probability of belonging to a class is 0.5