Dataset
DataSet:
The datasets needed for this project have been obtained from the Kaggle site.
The files are:
1. Train Set:
a. Size - 21.17 MB
b. Number of Question Pairs: 404290
2. Test Set:
a. Size -112.47 MB
b. Number of Question Pairs: 2345796
Word Embeddings:
1. Word2Vec:
○ Dimension : 300d
○ Number of Words: 3 million words
○ Training Set: 100 billion words from Google News dataset
2. Glove Vector Representation:
○ Dimension: 50d,100d, 200d, 300d
○ Number of words: 400k words
○ TrainSet: 6 billion tokens from Wikipedia 2014 + Gigaword 5
○ We used 100d Glove vectors representation.
API’s used:
1. Keras
2. Tensorflow
3. Gensim
4. Sklearn
5. Plotly,Seaborn - Visualizations
Implementation of simpler models
The features used are
1. Tf-idf was used as a pre-processing technique.
2. The following models were implemented using the sklearn library
1. Random forest
2. Logistic regression
3. Naive bayes
4. Decision tree
5. SVM
Results
Model Training
accuracy
Validation
accuracy
Training loss Validation
loss
Random
forest
0.775 0.721 0.685315 0.697242
Logistic
regression
0.685 0.669 0.72543 0.741672
Decision tree 0.672 0.695 0.75232 0.74327
Naïve bayes 0.574 0.633 0.8472 0.8253
SVM 0.429 0.553 0.8723 0.8244
Neural Network Architectures
1. Convolutional Neural Network ( CNN )
2. Long Short Term Memory Network ( LSTM)
3. Bidirectional LSTM
Embedding Layer (First Layer): This layer is the first layer in Deep NLP Architectures. This layer maps a given input and
converts it into its corresponding Word Embedding. The embedding layer is then fed into the convolution layer.
Conv1D layer: Convolution1d performs 1d convolution on the word embeddings that are obtained as the output of the
embedding layer. When the convolutional layer is implemented, the parameters that need to be included are the number of
filters, the kernel size, the stride length and the shape of the input.
Max-pooling layer: Max pooling is a sample-based discretization process. The objective is to down-sample an input
representation, reducing its dimensionality and allowing for assumptions to be made about features contained in the
sub-regions.
Dropout: Dropout is carried out to prevent overfitting.
Flatten: The flatten large is used to flatten the output of the CNN layer.
Merge Layer: Merge Large is used to combine different vector outputs from the CNN. Both CNN’s will produce a sentence
vector, merge layer combined them to produce a single vector output.
Dense Layer: It is a regularly connected dense layer. It converts the vector output from the Merge Layer into a number between
0 and 1, which shows the measure of the similarity predicted by the network.
Parameters within the CNN:
Model Number Of layers Dropout Validation Loss Training Loss
CNN 2 0.5 0.54606734 0.544829
CNN 3 0.4 0.6147397 0.61473978
Observations of several CNN models with varying dropout and number of layers
after training on 20 epochs:
It is observed that unlike how more layers lead to more abstraction in images, in text it is found
that two layers lead to the best results.
This is because most of the features are learnt in the first layer itself so any activation in the
first layer gets passed through and nothing new is learnt when adding deeper layers.
Comparison Word2vec vs Glove:
Word Representation Model Validation Log-loss
( 20 epochs)
Time-Taken
Word2Vec Siamese LSTM 0.423 80 mins( per epoch)
Glove Siamese LSTM 0.434 25 mins(per epoch)
Observation:
1. Comparable Log-loss
2. Glove considerably faster compared to Word2vec
● Used Glove Vectors 100 dimension for rest of the analysis( Lesser dimensions,
lesser complexity, faster implementation)
Bidirectional LSTM
Word 1 Word 2 Word 3
LSTM LSTMLSTM
LSTMLSTMLSTM
Concat
Aggregate
ConcatConcat
Embeddings
LSTM
Word 1 Word 2 Word 3
LSTM LSTMLSTM
Aggregate
Embeddings
Comparison between Bi-LSTM and LSTM:
Model Drop-out Validation loss Time-Taken
Bidirectional LSTM Recurrent_Dropout=0.2,Dropout=0.
2
0.44385 90 mins
LSTM Recurrent_Dropout=0.2,Dropout=0.
2
0.43454 25 mins
Observations:
1. Bi-LSTM performs better than LSTM for most tasks.
2. But Overfitting Problem. Bi-LSTM stopped in 7 Epochs. Validation loss started to Increase after 5
epochs.
3. Implemented using Keras Model Checkpoint and Early Stopping.
4. LSTM model also suffered from overfitting ( Stopped in 12 Epochs)
Solution:Tuning Recurrent_Dropout and Dropout for Bi-LSTM ( Did not Implement because of Time
Constraints)
NOTE: We finalized on LSTM + Glove Vectors for Further Analysis, due to comparative performance
and faster implementation.
Model Drop-out Validation loss Training Loss
LSTM Recurrent_Dropout=0.2,Dropout=0.2 0.4345 0.4301
LSTM Recurrent_Dropout=0.5,Dropout=0.5 0.4532 0.4476
LSTM Recurrent_Dropout=0.0,Dropout=0.0 0.4632 0.4375
LSTM Varied Dropout:
Observations:
1. LSTM with dropout of 0.5,0.5 suffers from reduced performance on both train and validation
set.
2. LSTM without dropout was also implemented, as expected it suffers from overfitting.
3. We could have also tried randomized dropout( not implemented due to time complexity)
Further Steps to Improve Log-loss:
1) Preprocessing Text
a) Replace Numbers by ‘n’
b) Handle Special Words and Abbreviations( I’d)
c) Stop -Word Removal
2) Handling Un-balanced Classes
3) Question Symmetry
Preprocessing Text:
● Preprocessing input text is an important step for any NLP task. Why?
● Different pre-processing steps analyzed:
Preprocessing Step Original Text Modified Text
Number Replacement I have 100 apples. I have n Apples
Special Words I’d run a marathon. I would run a marathon.
Stop word removal I have 100 apples. I 100 apples
Importance of Text Preprocessing:
Number Replacement:
● Wordvec, glove vectors don’t recognize numbers
● Numbers have to be replaced by n
Special Words:
● Words like I’d , I’m cause problems when tokenized
○ I’d - I + d
○ I’m - I + m
● The meaning is lost.
● We need to handle these word separately.
○ I’d - I would
○ I’m - I am
Stop Words Removal:
● Stop words increase complexity of system, they might or might not provide
any additional information
Handling Imbalance Classes:
● Training Data and Testing Data might have different distribution of positive and negative
examples.
● If training data has considerably more positive examples, the model tends towards positive
while prediction and vice versa.
● In our data,
○ Train Set: 36.92% positive example
○ Test set: 17.46% positive examples
● We need map the share of positive examples to be the same in Test/Train.
● The weight of one positive example in the train set counts for 0.472 ( 0.1746 / 0.3692)
positive entities in the test set.
● Similarly, the weight of negative example in the train set is (1 - 0.1746) / (1 - 0.3692) = 1.309 .
● Log loss function = -( 0.472001959 * t * logy + 1.309028344 * (1.0 - t) * log(1.0 - y) )
Where t: target value , y : predicted value
● Class Weighting can be provided as parameter in ‘Keras’.
Note:
● This is a work around to achieve better Performance on Test set. As in this case, the distribution is
totally skewed in Test Set.
● Ideally, We need to split the Train, Test and Validation set, maintaining class balance.
Question Pair Symmetry:
● Tried to interchange Q1 and Q2 to see if effects the Model.
● Sometimes model might learn features which are related to the question order.
● We interchanged half of Q1 with Q2 and trained the model.
Analysis: Different Preprocessing Steps and their effect on log-loss
Model parameters:
● Length of Each Question- 30
● Recurrent Dropout - 0.2
● Dropout - 0.2
● Length of LSTM Layer -100
● Train : Validation - 9:1
Model Comparison:
Steps Train
Loss
Validatio
n loss
Public Test
loss
Private Test
Loss
Symmetry-No , Simple Text Preprocessing -Yes , Class
Weighting - No
0.4034 0.4056 0.4041 0.4065
Symmetry-No , Simple Text Preprocessing -Yes , Class
Weighting - Yes
0.3045 0.2931 0.3104 0.3144
Symmetry-Yes
Simple Text Preprocessing -Yes
Class Weighting - Yes
0.3033 0.2925 0.3079 0.3109
Symmetry-Yes , Simple Text Preprocessing -Yes , Class
Weighting - Yes, Stop word Removal - Yes
0.3512 0.3567 0.3626 0.3617
Observations:
● Underfitting for some models - Restricted to 20 epochs due to time-constraint( The log-loss was still
decreasing)
● Stop words Removal decreased the performance. Stop words had contextual information which was
important.
Note: Simple Text Preprocessing - Numbers and Special Words
Further Steps which can Explore:
● Parameter tuning especially dropout can be tuned to attain better log-loss.
● Analysis using increased number of epochs.
● Ensemble of Models
○ Bagging - Avoids Overfitting
○ Boosting - Helps improve weak models( Under fitting)
● Observed lot of Nouns in the data, Parts-of-Speech features
○ Replacing Proper Noun by Noun
● More text Processing: Lemmatization, Stemming
● Character Level LSTM’s
● Word - Attention Model’s
References:
1. Contextual Bidirectional Long Short-Term Memory Recurrent Neural Network Language Models: A
Generative Approach to Sentiment Analysis - http://www.aclweb.org/anthology/E17-1096
2. A Hierarchical Neural Autoencoder for Paragraphs and Documents - https://arxiv.org/pdf/1506.01057v2.pdf
3. Siamese Recurrent Architectures for Learning Sentence Similarity -
http://www.mit.edu/~jonasm/info/MuellerThyagarajan_AAAI16.pdf
4. Efficient Estimation of Word Representations in Vector Space - https://arxiv.org/pdf/1301.3781.pdf
5. GloVe: Global Vectors for Word Representation - https://nlp.stanford.edu/pubs/glove.pdf
6. HDLTex: Hierarchical Deep Learning for Text Classification - https://arxiv.org/pdf/1709.08267.pdf
7. Signature Verification using a "Siamese" Time Delay Neural Network -
https://papers.nips.cc/paper/769-signature-verification-using-a-siamese-time-delay-neural-network.pdf
8. Dropout: A Simple Way to Prevent Neural Networks from Overfitting -
http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
9. Understanding CNNs for NLP -
http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
10. Comparison study of CNNs and RNNs for NLP - https://arxiv.org/pdf/1702.01923.pdf
11. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
12. Term Frequency Inverse Document Frequency - http://www.tfidf.com/