Contenu connexe



  1. Quora Question Pairs By: Shradha Sunil Peddamail Jayavardhan Reddy
  2. Dataset DataSet: The datasets needed for this project have been obtained from the Kaggle site. The files are: 1. Train Set: a. Size - 21.17 MB b. Number of Question Pairs: 404290 2. Test Set: a. Size -112.47 MB b. Number of Question Pairs: 2345796
  4. Word Embeddings: 1. Word2Vec: ○ Dimension : 300d ○ Number of Words: 3 million words ○ Training Set: 100 billion words from Google News dataset 2. Glove Vector Representation: ○ Dimension: 50d,100d, 200d, 300d ○ Number of words: 400k words ○ TrainSet: 6 billion tokens from Wikipedia 2014 + Gigaword 5 ○ We used 100d Glove vectors representation. API’s used: 1. Keras 2. Tensorflow 3. Gensim 4. Sklearn 5. Plotly,Seaborn - Visualizations
  5. Implementation of simpler models The features used are 1. Tf-idf was used as a pre-processing technique. 2. The following models were implemented using the sklearn library 1. Random forest 2. Logistic regression 3. Naive bayes 4. Decision tree 5. SVM
  6. Results Model Training accuracy Validation accuracy Training loss Validation loss Random forest 0.775 0.721 0.685315 0.697242 Logistic regression 0.685 0.669 0.72543 0.741672 Decision tree 0.672 0.695 0.75232 0.74327 Naïve bayes 0.574 0.633 0.8472 0.8253 SVM 0.429 0.553 0.8723 0.8244
  7. Neural Network Architectures 1. Convolutional Neural Network ( CNN ) 2. Long Short Term Memory Network ( LSTM) 3. Bidirectional LSTM
  8. Simple Siamese Architecture: Neural Network Input 2 Output 2 Neural Network Input 1 Output 1 Similarity Measure Weights Shared Output
  9. Siamese Architecture Using CNN:
  10. Siamese Architecture with CNN with GloVe Representation:
  11. Embedding Layer (First Layer): This layer is the first layer in Deep NLP Architectures. This layer maps a given input and converts it into its corresponding Word Embedding. The embedding layer is then fed into the convolution layer. Conv1D layer: Convolution1d performs 1d convolution on the word embeddings that are obtained as the output of the embedding layer. When the convolutional layer is implemented, the parameters that need to be included are the number of filters, the kernel size, the stride length and the shape of the input. Max-pooling layer: Max pooling is a sample-based discretization process. The objective is to down-sample an input representation, reducing its dimensionality and allowing for assumptions to be made about features contained in the sub-regions. Dropout: Dropout is carried out to prevent overfitting. Flatten: The flatten large is used to flatten the output of the CNN layer. Merge Layer: Merge Large is used to combine different vector outputs from the CNN. Both CNN’s will produce a sentence vector, merge layer combined them to produce a single vector output. Dense Layer: It is a regularly connected dense layer. It converts the vector output from the Merge Layer into a number between 0 and 1, which shows the measure of the similarity predicted by the network. Parameters within the CNN:
  12. Model Number Of layers Dropout Validation Loss Training Loss CNN 2 0.5 0.54606734 0.544829 CNN 3 0.4 0.6147397 0.61473978 Observations of several CNN models with varying dropout and number of layers after training on 20 epochs: It is observed that unlike how more layers lead to more abstraction in images, in text it is found that two layers lead to the best results. This is because most of the features are learnt in the first layer itself so any activation in the first layer gets passed through and nothing new is learnt when adding deeper layers.
  13. Siamese Architecture Using LSTM: LSTM Embedding Layer LSTM Embedding Layer Dense Weights Shared Merge Question 1 Question 2
  14. Siamese Architecture with LSTM with Word2Vec 300d Representation:
  15. Siamese Architecture with LSTM with Glove 100d Representation:
  16. Comparison Word2vec vs Glove: Word Representation Model Validation Log-loss ( 20 epochs) Time-Taken Word2Vec Siamese LSTM 0.423 80 mins( per epoch) Glove Siamese LSTM 0.434 25 mins(per epoch) Observation: 1. Comparable Log-loss 2. Glove considerably faster compared to Word2vec ● Used Glove Vectors 100 dimension for rest of the analysis( Lesser dimensions, lesser complexity, faster implementation)
  17. Bidirectional LSTM Word 1 Word 2 Word 3 LSTM LSTMLSTM LSTMLSTMLSTM Concat Aggregate ConcatConcat Embeddings LSTM Word 1 Word 2 Word 3 LSTM LSTMLSTM Aggregate Embeddings
  18. Comparison between Bi-LSTM and LSTM: Model Drop-out Validation loss Time-Taken Bidirectional LSTM Recurrent_Dropout=0.2,Dropout=0. 2 0.44385 90 mins LSTM Recurrent_Dropout=0.2,Dropout=0. 2 0.43454 25 mins Observations: 1. Bi-LSTM performs better than LSTM for most tasks. 2. But Overfitting Problem. Bi-LSTM stopped in 7 Epochs. Validation loss started to Increase after 5 epochs. 3. Implemented using Keras Model Checkpoint and Early Stopping. 4. LSTM model also suffered from overfitting ( Stopped in 12 Epochs) Solution:Tuning Recurrent_Dropout and Dropout for Bi-LSTM ( Did not Implement because of Time Constraints) NOTE: We finalized on LSTM + Glove Vectors for Further Analysis, due to comparative performance and faster implementation.
  19. Model Drop-out Validation loss Training Loss LSTM Recurrent_Dropout=0.2,Dropout=0.2 0.4345 0.4301 LSTM Recurrent_Dropout=0.5,Dropout=0.5 0.4532 0.4476 LSTM Recurrent_Dropout=0.0,Dropout=0.0 0.4632 0.4375 LSTM Varied Dropout: Observations: 1. LSTM with dropout of 0.5,0.5 suffers from reduced performance on both train and validation set. 2. LSTM without dropout was also implemented, as expected it suffers from overfitting. 3. We could have also tried randomized dropout( not implemented due to time complexity)
  20. Further Steps to Improve Log-loss: 1) Preprocessing Text a) Replace Numbers by ‘n’ b) Handle Special Words and Abbreviations( I’d) c) Stop -Word Removal 2) Handling Un-balanced Classes 3) Question Symmetry
  21. Preprocessing Text: ● Preprocessing input text is an important step for any NLP task. Why? ● Different pre-processing steps analyzed: Preprocessing Step Original Text Modified Text Number Replacement I have 100 apples. I have n Apples Special Words I’d run a marathon. I would run a marathon. Stop word removal I have 100 apples. I 100 apples
  22. Importance of Text Preprocessing: Number Replacement: ● Wordvec, glove vectors don’t recognize numbers ● Numbers have to be replaced by n Special Words: ● Words like I’d , I’m cause problems when tokenized ○ I’d - I + d ○ I’m - I + m ● The meaning is lost. ● We need to handle these word separately. ○ I’d - I would ○ I’m - I am Stop Words Removal: ● Stop words increase complexity of system, they might or might not provide any additional information
  23. Handling Imbalance Classes: ● Training Data and Testing Data might have different distribution of positive and negative examples. ● If training data has considerably more positive examples, the model tends towards positive while prediction and vice versa. ● In our data, ○ Train Set: 36.92% positive example ○ Test set: 17.46% positive examples ● We need map the share of positive examples to be the same in Test/Train. ● The weight of one positive example in the train set counts for 0.472 ( 0.1746 / 0.3692) positive entities in the test set. ● Similarly, the weight of negative example in the train set is (1 - 0.1746) / (1 - 0.3692) = 1.309 . ● Log loss function = -( 0.472001959 * t * logy + 1.309028344 * (1.0 - t) * log(1.0 - y) ) Where t: target value , y : predicted value ● Class Weighting can be provided as parameter in ‘Keras’. Note: ● This is a work around to achieve better Performance on Test set. As in this case, the distribution is totally skewed in Test Set. ● Ideally, We need to split the Train, Test and Validation set, maintaining class balance.
  24. Question Pair Symmetry: ● Tried to interchange Q1 and Q2 to see if effects the Model. ● Sometimes model might learn features which are related to the question order. ● We interchanged half of Q1 with Q2 and trained the model. Analysis: Different Preprocessing Steps and their effect on log-loss Model parameters: ● Length of Each Question- 30 ● Recurrent Dropout - 0.2 ● Dropout - 0.2 ● Length of LSTM Layer -100 ● Train : Validation - 9:1
  25. Model Comparison: Steps Train Loss Validatio n loss Public Test loss Private Test Loss Symmetry-No , Simple Text Preprocessing -Yes , Class Weighting - No 0.4034 0.4056 0.4041 0.4065 Symmetry-No , Simple Text Preprocessing -Yes , Class Weighting - Yes 0.3045 0.2931 0.3104 0.3144 Symmetry-Yes Simple Text Preprocessing -Yes Class Weighting - Yes 0.3033 0.2925 0.3079 0.3109 Symmetry-Yes , Simple Text Preprocessing -Yes , Class Weighting - Yes, Stop word Removal - Yes 0.3512 0.3567 0.3626 0.3617 Observations: ● Underfitting for some models - Restricted to 20 epochs due to time-constraint( The log-loss was still decreasing) ● Stop words Removal decreased the performance. Stop words had contextual information which was important. Note: Simple Text Preprocessing - Numbers and Special Words
  26. Further Steps which can Explore: ● Parameter tuning especially dropout can be tuned to attain better log-loss. ● Analysis using increased number of epochs. ● Ensemble of Models ○ Bagging - Avoids Overfitting ○ Boosting - Helps improve weak models( Under fitting) ● Observed lot of Nouns in the data, Parts-of-Speech features ○ Replacing Proper Noun by Noun ● More text Processing: Lemmatization, Stemming ● Character Level LSTM’s ● Word - Attention Model’s
  27. References: 1. Contextual Bidirectional Long Short-Term Memory Recurrent Neural Network Language Models: A Generative Approach to Sentiment Analysis - 2. A Hierarchical Neural Autoencoder for Paragraphs and Documents - 3. Siamese Recurrent Architectures for Learning Sentence Similarity - 4. Efficient Estimation of Word Representations in Vector Space - 5. GloVe: Global Vectors for Word Representation - 6. HDLTex: Hierarchical Deep Learning for Text Classification - 7. Signature Verification using a "Siamese" Time Delay Neural Network - 8. Dropout: A Simple Way to Prevent Neural Networks from Overfitting - 9. Understanding CNNs for NLP - 10. Comparison study of CNNs and RNNs for NLP - 11. 12. Term Frequency Inverse Document Frequency -
  28. Thankyou Any Questions?