4. Self-introduction ● Artsem Zhyvalkouski
● From: Minsk, Belarus 🇧🇾
● CS student @ Tokyo City University 🇯🇵
● Did R&D in CV / NLP at a startup
● Love reading NLP papers
● Enjoy learning languages
○ Fluent in 🇧🇾🇷🇺🇬🇧🇯🇵
○ Interested in 🇫🇷🇰🇷
6. Motivation Why did I start working on this competition?
● To familiarize myself with current SOTA in NLP:
Transformer-based models
● To learn PyTorch🔥 and HuggingFace🤗 libraries
● The dataset is rather small so no need in powerful
machines: Colab Pro was enough
● The task seemed fun / unusual and I had time for it
7. Teammates ● Théo
○ Student in applied Mathematics & Machine Learning
○ 10th in Google Quest Q&A
● Anton
○ Works in IT
○ 10th in Google Quest Q&A
● Hikkiiii
○ NLP / QA background
○ 10th in TensorFlow 2.0 Question Answering
9. Task & data ● Task: for a given tweet, predict what word or phrase best
supports the labeled sentiment
● Application example: some businesses may want to know
exactly why people think something about their product
● Data
○ Train: 27k tweets
○ Public / private test: 4k / 8k tweets
text (given) sentiment (given) selected_text (target)
I really really like the
song Love Story by
Taylor Swift
positive like
i need to get my
computer fixed
neutral i need to get my computer
fixed
Sooo SAD I will miss you
here in San Diego!!!
negative Sooo SAD
11. Problem with
labels
● Some labels are noisy
text (given) sentiment (given) selected_text (target)
hey mia! totally adore
your music. when will
your cd be out?
positive y adore
I know It was worth a
shot, though!
positive as wort
the exact one i was
thinking of the bestttt.
positive e bestttt
13. Transformers ● Transformers like BERT, RoBERTa, BART etc. have
become default in SOTA NLP
● Pretrained on a huge amount of texts
● Somewhat heavy and long to train
● Can be used in either NER or QA setup for this task
● QA worked best
15. My models:
summary
● RoBERTa-base-squad2, RoBERTa-large-squad2,
DistilRoBERTa-base, XLNet-base-cased
● Pretraining on SQuAD 2.0
○ Task pretraining works
● Avg / Max of layers w/o embedding layer
● Multi Sample Dropout
● AdamW with linear warmup schedule
● Custom loss: Jaccard-based Soft Labels
● Best single model: RoBERTa-base-squad2, 5 fold
stratified CV: 0.715
16. My models:
architecture
Layer 1
Layer n
...
Embeddings
Transformer
Head
MSD + Dense
Sentiment Sentence Tokens<s> </s> </s> </s>
Probabilities of the token being the start of the selected text
Probabilities of the token being the end of the selected text
MaxPoolAvgPool
17. Multi Sample
Dropout
● Multi-Sample Dropout for Accelerated Training and
Better Generalization
(https://arxiv.org/pdf/1905.09788.pdf)
*Image from Jigsaw Unintended Bias in Toxicity Classification 8th place solution
18. Optimizer and
schedule
● AdamW optimizer: Decoupled Weight Decay
Regularization
(https://arxiv.org/pdf/1711.05101.pdf)
● Linear warmup schedule
20. Théo’s models ● Transformers
○ BERT-base-uncased
○ BERT-large-uncased-wwm
○ DistilBERT
○ ALBERT-large-v2
● Architecture
○ MSD on the concatenation of the last 8 hidden states
● Training
○ Smoothed categorical cross-entropy
○ Discriminative learning rate
(https://arxiv.org/pdf/1801.06146.pdf)
○ Sequence bucketing to speed up the training
21. Anton’s models
● Transformers
○ RoBERTa-base
○ BERTweet: A pre-trained language model for English
Tweets (https://arxiv.org/pdf/2005.10200.pdf)
● Architecture
○ Same as Théo’s
● Training
○ Smoothed categorical cross-entropy
○ Discriminative learning rate
○ Custom merges.txt file for RoBERTa
22. Hikkiiii’s models
● Transformers
○ RoBERTa-base
○ RoBERTa-large
● Architecture
○ Append sentiment token to the end of the text
○ CNN + Linear layer on the concatenation of the last
3 hidden states
● Training
○ Standard cross-entropy loss
23. ● Are we done? 🤗
○ Transformers are token-level, hence we
can’t capture the noisy pattern
○ No obvious way to make transformers
character-level
○ Character-level RNNs are not even
nearly as powerful as transformers
○ We can’t simply blend models with
different tokenizations
Problems
24. ● Solution: stacking to the rescue!
○ Convert token probabilities from
transformers to char-level by assigning
each char the probability of its token
○ Feed OOF char-level probabilities from
several transformers into a char-level
NN using stacking
Stacking
25. Stacking
token level start & end proba
Transformer
target :
Start / end tokens
<sos> Sentiment <sep> Tokens <sep>
char level start & end proba
(token based)
Char NN
target :
Start / end chars
using offsets
char level start / end proba
Characters
…
n models
Concatenate
token level start & end proba
Transformer
target :
Start / end tokens
<sos> Sentiment <sep> Tokens <sep>
char level start & end proba
(token based)
Start & end featuresSentiment
using offsets
26. Char-level NN:
RNN
Start & end probas
Bidirectional LSTM Embedding
Characters
Embedding
Sentiment
Bidirectional LSTM x2 with skip connection
MSD + Linear
Softmax
Start & end probas
27. Char-level NN:
CNN
Start & end probas
Conv1D +
BatchNorm
Embedding
Characters
Embedding
Sentiment
Conv1D +
BatchNorm
x4
MSD + Linear
Softmax
Start & end probas
28. Char-level NN:
WaveNet
Start & end probas
Conv1D +
BatchNorm
Embedding
Characters
Embedding
Sentiment
WaveBlock +
BatchNorm
x3
MSD + Linear
Softmax
Start & end probas
29. Char-level NNs:
details ● Adam optimizer
● Linear learning rate decay without warmup
● Smoothed Cross Entropy Loss
● Stochastic Weighted Average: Averaging Weights Leads
to Wider Optima and Better Generalization
(https://arxiv.org/pdf/1803.05407.pdf)
● Select the whole text if predicted start_idx > end_idx
30. An obvious step ● So now we have a lot of different 1st level models
and different 2nd level architectures
● If you participated in a tabular data competition, an
obvious next step is...
33. Pseudo-labeling ● We used one of our CV 0.7354 blends to pseudo-label the public
test data
● Approach from
the Google Quest Q&A 1st place
solution: “leakless” pseudo-labels
● Confidence score:
(start_probas.max() + end_probas.max()) / 2
● Threshold=0.35 to cut off
low-confidence samples
● This gave a pretty robust boost
of 0.001-0.002 for many models
*Image from https://datawhatnow.com/pseudo-labeling-semi-supervised-
learning/
37. Finding the
“Magic”
● The noise in the label comes from consecutive spaces
Selected text :
onna
Original text :
is _ back _ home _ now _ _ _ _ _ _ gonna _ miss _ every _ one
● We assumed they were removed during annotation
Text with spaces cleaned :
is _ back _ home _ now _ gonna _ miss _ every _ one
Annotation, on the cleaned text :
is _ back _ home _ now _ gonna _ miss _ every _ one
➔ Stores the start and end indices (?)
● Which results in problems when retrieving the labels on the original text
Retrieved label, on the original text :
is _ back _ home _ now _ _ _ _ _ _ gonna _ miss _ every _ one
➔ 5 removed spaces offsets the label by 5 characters
38. Using the
“Magic”
● Then, we can post-process our predictions to retrieve the noise
Selected text:
onna
Original text :
is _ back _ home _ now _ _ _ _ _ _ gonna _ miss _ every _ one
● We use a transformer to get the start/end token
We use it on the cleaned text :
Assuming the model perfectly predicts “miss”, perfect start and end predictions would look this way :
is _ back _ home _ now _ gonna _ miss _ every _ one
0 … 0 1 1 1 1 0 ... 0
0 … 0 1 1 1 1 0 ... 0
Because transformer work at token level, the whole word is gonna be selected
● Finally, we can align those predictions with the original text
Prediction on the noisy data :
is _ back _ home _ now _ _ _ _ _ _ gonna _ miss _ every _ one
0 … 0 1 1 1 1 0 ... 0
0 … 0 1 1 1 1 0 ... 0
● Which matches the noisy label !
39. Why we didn’t
use the “Magic” ● We found the pattern pretty late and didn’t have
enough time to leverage it directly
● Eventually, our 2nd level models learnt the pattern
even better then simple pre/post-processing
41. 2nd place
solution
● Pre/post-process following the “Magic”
● Sample tweets equally according to their sentiment to
mitigate the imbalance within batches
● Reranking model
1. Store top n candidates from the base model and
assign a “step_1_score” accordingly
2. Train a RoBERTa to predict Jaccard for the
candidates: “step_2_score”
3. Choose the best one by:
step_2_score + step_1_score * 0.5
45. Conclusion
● Transformers are perfect for QA
● PyTorch🔥 & HuggingFace🤗 are awesome
● Transformers can be extended to char-level
● Diversity rules
● Annotation process is crucial
● Teaming up is rewarding
● Kaggle is good for learning but impractical sometimes
● Kaggle community is amazing
46. My links
● Follow me on
○ LinkedIn: https://www.linkedin.com/in/zhyvalkouski
○ Kaggle: https://www.kaggle.com/aruchomu
○ Twitter: https://twitter.com/artem_aruchomu
○ GitHub: https://github.com/heartkilla
● I’m open to new opportunities!