The datasets needed for this project have been obtained from the Kaggle site.
The files are:
1. Train Set:
a. Size - 21.17 MB
b. Number of Question Pairs: 404290
2. Test Set:
a. Size -112.47 MB
b. Number of Question Pairs: 2345796
○ Dimension : 300d
○ Number of Words: 3 million words
○ Training Set: 100 billion words from Google News dataset
2. Glove Vector Representation:
○ Dimension: 50d,100d, 200d, 300d
○ Number of words: 400k words
○ TrainSet: 6 billion tokens from Wikipedia 2014 + Gigaword 5
○ We used 100d Glove vectors representation.
5. Plotly,Seaborn - Visualizations
Implementation of simpler models
The features used are
1. Tf-idf was used as a pre-processing technique.
2. The following models were implemented using the sklearn library
1. Random forest
2. Logistic regression
3. Naive bayes
4. Decision tree
Embedding Layer (First Layer): This layer is the first layer in Deep NLP Architectures. This layer maps a given input and
converts it into its corresponding Word Embedding. The embedding layer is then fed into the convolution layer.
Conv1D layer: Convolution1d performs 1d convolution on the word embeddings that are obtained as the output of the
embedding layer. When the convolutional layer is implemented, the parameters that need to be included are the number of
filters, the kernel size, the stride length and the shape of the input.
Max-pooling layer: Max pooling is a sample-based discretization process. The objective is to down-sample an input
representation, reducing its dimensionality and allowing for assumptions to be made about features contained in the
Dropout: Dropout is carried out to prevent overfitting.
Flatten: The flatten large is used to flatten the output of the CNN layer.
Merge Layer: Merge Large is used to combine different vector outputs from the CNN. Both CNN’s will produce a sentence
vector, merge layer combined them to produce a single vector output.
Dense Layer: It is a regularly connected dense layer. It converts the vector output from the Merge Layer into a number between
0 and 1, which shows the measure of the similarity predicted by the network.
Parameters within the CNN:
Model Number Of layers Dropout Validation Loss Training Loss
CNN 2 0.5 0.54606734 0.544829
CNN 3 0.4 0.6147397 0.61473978
Observations of several CNN models with varying dropout and number of layers
after training on 20 epochs:
It is observed that unlike how more layers lead to more abstraction in images, in text it is found
that two layers lead to the best results.
This is because most of the features are learnt in the first layer itself so any activation in the
first layer gets passed through and nothing new is learnt when adding deeper layers.
Comparison Word2vec vs Glove:
Word Representation Model Validation Log-loss
( 20 epochs)
Word2Vec Siamese LSTM 0.423 80 mins( per epoch)
Glove Siamese LSTM 0.434 25 mins(per epoch)
1. Comparable Log-loss
2. Glove considerably faster compared to Word2vec
● Used Glove Vectors 100 dimension for rest of the analysis( Lesser dimensions,
lesser complexity, faster implementation)
Word 1 Word 2 Word 3
Word 1 Word 2 Word 3
Comparison between Bi-LSTM and LSTM:
Model Drop-out Validation loss Time-Taken
Bidirectional LSTM Recurrent_Dropout=0.2,Dropout=0.
0.44385 90 mins
0.43454 25 mins
1. Bi-LSTM performs better than LSTM for most tasks.
2. But Overfitting Problem. Bi-LSTM stopped in 7 Epochs. Validation loss started to Increase after 5
3. Implemented using Keras Model Checkpoint and Early Stopping.
4. LSTM model also suffered from overfitting ( Stopped in 12 Epochs)
Solution:Tuning Recurrent_Dropout and Dropout for Bi-LSTM ( Did not Implement because of Time
NOTE: We finalized on LSTM + Glove Vectors for Further Analysis, due to comparative performance
and faster implementation.
Model Drop-out Validation loss Training Loss
LSTM Recurrent_Dropout=0.2,Dropout=0.2 0.4345 0.4301
LSTM Recurrent_Dropout=0.5,Dropout=0.5 0.4532 0.4476
LSTM Recurrent_Dropout=0.0,Dropout=0.0 0.4632 0.4375
LSTM Varied Dropout:
1. LSTM with dropout of 0.5,0.5 suffers from reduced performance on both train and validation
2. LSTM without dropout was also implemented, as expected it suffers from overfitting.
3. We could have also tried randomized dropout( not implemented due to time complexity)
Further Steps to Improve Log-loss:
1) Preprocessing Text
a) Replace Numbers by ‘n’
b) Handle Special Words and Abbreviations( I’d)
c) Stop -Word Removal
2) Handling Un-balanced Classes
3) Question Symmetry
● Preprocessing input text is an important step for any NLP task. Why?
● Different pre-processing steps analyzed:
Preprocessing Step Original Text Modified Text
Number Replacement I have 100 apples. I have n Apples
Special Words I’d run a marathon. I would run a marathon.
Stop word removal I have 100 apples. I 100 apples
Importance of Text Preprocessing:
● Wordvec, glove vectors don’t recognize numbers
● Numbers have to be replaced by n
● Words like I’d , I’m cause problems when tokenized
○ I’d - I + d
○ I’m - I + m
● The meaning is lost.
● We need to handle these word separately.
○ I’d - I would
○ I’m - I am
Stop Words Removal:
● Stop words increase complexity of system, they might or might not provide
any additional information
Handling Imbalance Classes:
● Training Data and Testing Data might have different distribution of positive and negative
● If training data has considerably more positive examples, the model tends towards positive
while prediction and vice versa.
● In our data,
○ Train Set: 36.92% positive example
○ Test set: 17.46% positive examples
● We need map the share of positive examples to be the same in Test/Train.
● The weight of one positive example in the train set counts for 0.472 ( 0.1746 / 0.3692)
positive entities in the test set.
● Similarly, the weight of negative example in the train set is (1 - 0.1746) / (1 - 0.3692) = 1.309 .
● Log loss function = -( 0.472001959 * t * logy + 1.309028344 * (1.0 - t) * log(1.0 - y) )
Where t: target value , y : predicted value
● Class Weighting can be provided as parameter in ‘Keras’.
● This is a work around to achieve better Performance on Test set. As in this case, the distribution is
totally skewed in Test Set.
● Ideally, We need to split the Train, Test and Validation set, maintaining class balance.
Question Pair Symmetry:
● Tried to interchange Q1 and Q2 to see if effects the Model.
● Sometimes model might learn features which are related to the question order.
● We interchanged half of Q1 with Q2 and trained the model.
Analysis: Different Preprocessing Steps and their effect on log-loss
● Length of Each Question- 30
● Recurrent Dropout - 0.2
● Dropout - 0.2
● Length of LSTM Layer -100
● Train : Validation - 9:1
Symmetry-No , Simple Text Preprocessing -Yes , Class
Weighting - No
0.4034 0.4056 0.4041 0.4065
Symmetry-No , Simple Text Preprocessing -Yes , Class
Weighting - Yes
0.3045 0.2931 0.3104 0.3144
Simple Text Preprocessing -Yes
Class Weighting - Yes
0.3033 0.2925 0.3079 0.3109
Symmetry-Yes , Simple Text Preprocessing -Yes , Class
Weighting - Yes, Stop word Removal - Yes
0.3512 0.3567 0.3626 0.3617
● Underfitting for some models - Restricted to 20 epochs due to time-constraint( The log-loss was still
● Stop words Removal decreased the performance. Stop words had contextual information which was
Note: Simple Text Preprocessing - Numbers and Special Words
Further Steps which can Explore:
● Parameter tuning especially dropout can be tuned to attain better log-loss.
● Analysis using increased number of epochs.
● Ensemble of Models
○ Bagging - Avoids Overfitting
○ Boosting - Helps improve weak models( Under fitting)
● Observed lot of Nouns in the data, Parts-of-Speech features
○ Replacing Proper Noun by Noun
● More text Processing: Lemmatization, Stemming
● Character Level LSTM’s
● Word - Attention Model’s
1. Contextual Bidirectional Long Short-Term Memory Recurrent Neural Network Language Models: A
Generative Approach to Sentiment Analysis - http://www.aclweb.org/anthology/E17-1096
2. A Hierarchical Neural Autoencoder for Paragraphs and Documents - https://arxiv.org/pdf/1506.01057v2.pdf
3. Siamese Recurrent Architectures for Learning Sentence Similarity -
4. Efficient Estimation of Word Representations in Vector Space - https://arxiv.org/pdf/1301.3781.pdf
5. GloVe: Global Vectors for Word Representation - https://nlp.stanford.edu/pubs/glove.pdf
6. HDLTex: Hierarchical Deep Learning for Text Classification - https://arxiv.org/pdf/1709.08267.pdf
7. Signature Verification using a "Siamese" Time Delay Neural Network -
8. Dropout: A Simple Way to Prevent Neural Networks from Overfitting -
9. Understanding CNNs for NLP -
10. Comparison study of CNNs and RNNs for NLP - https://arxiv.org/pdf/1702.01923.pdf
12. Term Frequency Inverse Document Frequency - http://www.tfidf.com/