SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
네이버 NLP Challenge 후기
이신의, 박장원, 박종성
연세대학교 데이터공학 연구실
2018.12.28
1. Model
2. Hyperparameter
3. To be considered..
Contents
MODEL
: Architecture for Named Entity Recognition
Model
1 Model Architecture
Model
1 1. Input Layer
Model
1 1. Input Layer
1. 어절 (Word)
• fastText (pretrained word embedding)
이용하여 dense word vector 만듦
Model
1 1. Input Layer
2. 음절 (Character)
• “을”, “과” 등 음절 대부분이 fastText word
embedding에도 존재
• 동일한 fastText를 이용하여 어절의 각 음절
을 dense character vector로 만듦
• Bi-LSTM으로 encoding한 후 last hidden
state을 어절의 representation
• 음절 단위의 pretrained embedding이
어절보다 성능 향상에 더 큰 기여!
Model
1 1. Input Layer
3. 어절 단위의 개체명 one-hot encoding
• 학습 데이터에서 각 어절이 어떠한 개체명에
속하는지 사전으로 저장
• 학습 데이터에서 등장하지 않았던 어절은 “O”
class
• 해당 encoding이 있을 때와 없을 때의 성능
차가 상당
Model
1 2. Bi-LSTM
Bi-directional LSTM
• Vanilla LSTM, LayerNorm LSTM, Coupled-Input-
Forget-Gate LSTM …
• Time step 마다 layer normalization은 속도 저하
• Coupled Input Forget Gate LSTM
- 더 적은 수의 parameter, 빠른 속도, 높은 성능
𝑓𝑡 = 𝜎(𝑊𝑥ℎ_𝑓 𝑥 𝑡 + 𝑊ℎℎ_𝑓ℎ 𝑡−1 + 𝑏ℎ_𝑓)
𝑜𝑡 = 𝜎(𝑊𝑥ℎ_𝑜 𝑥 𝑡 + 𝑊ℎℎ_𝑜ℎ 𝑡−1 + 𝑏ℎ_𝑜)
𝑔𝑡 = 𝑡𝑎𝑛ℎ(𝑊𝑥ℎ_𝑔 𝑥 𝑡 + 𝑊ℎℎ_𝑔ℎ 𝑡−1 + 𝑏ℎ_𝑔)
𝑐𝑡 = 𝑓𝑡 ⨀ 𝑐𝑡−1 + (1 − 𝑓𝑡) ⨀ 𝑔𝑡
ℎ 𝑡 = 𝑜𝑡 ⨀tanh(𝑐𝑡)
Model
1 3. Self-Attention
Attention
• Bi-LSTM의 hidden states간의 self-alignment
• 자기 자신과의 attention score는 -∞로 지정
• Multi-head dot product attention like
Transformer
- 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑄𝐾 𝑇
𝑑
𝑉
(𝑄𝑖
𝑇
𝐾𝑗: = −∞ if i = j)
- 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝑄, 𝐾, 𝑉 = 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑8 𝑊 𝑂
- ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑊𝑖
𝑄
, 𝐾𝑊𝑖
𝐾
, 𝑉𝑊𝑖
𝑉
- 𝑄, 𝐾, 𝑉 ≔ hidden states of Bi-LSTM
Model
1 4. Masked Separable Conv1d
Model
1 4. Masked Separable Conv1d
Model
1 Masked Separable Conv1d
for richer representation
• Masking for autoregressive
- (kernel size -1) 만큼 zero padding 하여 현재 time step
보다 앞선 token을 보지 못하게 함
• Depth-wise convolution
- 각 channel마다 convolution
• Pointwise Convolution
- 1x1 convolution을 이용하여 channel interaction
• Positional embedding
- convolution 연산 후, Sinusoid embedding (like
Transformer)을 더함
• 각 kernel 별로 나온 feature map을 모두 concatenate
4. Masked Separable Conv1d
Pictures from https://towardsdatascience.com/a-basic-introduction-to-separable-
convolutions-b99ec3102728
HyperparameterWhat is
?
“In machine learning, a hyperparameter is a
parameter whose value is set before the
learning process begins. By contrast, the values
of other parameters are derived via training.
Given these hyperparameters, the training
algorithm learns the parameters from the data.”
Hyperparameter
2 연금술? 흑마법?
• Model만으로 F1 88.92%에 도달
• F1 89.72%까지 약 0.8%는
Hyperparameter Tuning (a.k.a. 연금술)로
달성
• 모델 제작에 2일, 연금술에 7일 소요...
“There’s a place for alchemy. Alchemy worked.”
- Ali Rahimi
Hyperparameter
2 1. Train-Dev split / Learning Rate Decay
Train, 86000
Dev, 4000 • Train: Dev = 86000 : 4000
• Dev F1 score와 Test F1 score가 비례함 (0.5% 정도 차이 존재)
Train-Dev Split
Learning rate scheduling
• Adam을 사용함에도 learning rate 조절 필요
- Dev F1 score 기준, 두 번째 epoch부터 성능이 떨어짐
• Learning rate decay를 사용!
- Epoch이 지날 때 마다 lr에 0.1 곱함
• 세번째 epoch부터는 lr를 더 줄여도 F1이 오르지 않았음….
Hyperparameter
2 2. Auxiliary Loss
Language Model Loss
• Language Model을 이용하여 representation learning을
하면 down stream task 성능 향상 (BERT, ELMO)
• 외부 데이터를 사용하지 않고 richer representation 을 만
드는 것이 목표
• Convolution의 feature map이 NER과 Language Model
에 모두 사용
• 앞의 m 어절이 주어지고 다음 어절 예측하는 LM
• Language model loss=σ 𝑘=1
𝑛
− log 𝑡 𝑘 𝑡 𝑘−𝑚, … , 𝑡 𝑘−1 ; 𝜃)
• Loss = Original loss + 0.5 * Language Model loss
Hyperparameter
2 3. Dropout
Dropout
• 기본적인 dropout은 25%
• LSTM은 variational dropout으로 10%만
- Dropout mask를 모든 time step에서 공유
• Dropout은 “적절한” 위치에 적용
- Embedding 또는 FC layer 이후에는 dropout이 역효과
- Multi-head attention과 FC layer 직전에 사용
Hyperparameter
2 4. Other Options
• Wt+1 ← 0.999Wt + 0.001Wt+1
Exponential Moving Average
• Weight에 L2 Regularization 적용
• L2 Weight Decay = 1e-7
Regularization
• Batch size = [8, 16, 32, 64]
• Batch size를 8로 했을 때 미세하지만 성능이 가장 좋았음
Batch size
To Be Considered
: Things that we couldn’t use…
To be considered
3 사용하지 못해 아쉬웠던 것들
1. Multi-lingual BERT 2. ELMO 3. Ensemble
- 어절 단위 tokenization X
- 마지막 layer에 attention을 이용
하여 feature 추출
→ 오히려 성능이 저하, 어절과 sub-
word간의 alignment 필요
- 한국어 데이터셋으로 학습한 모델
을 찾을 수 없었음
- 처음부터 ELMO embedding을
학습하기에는 시간 부족
- 크레딧 부족으로 적용해보지 못 함
Thank You
&
Reference
[1] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding. ArXiv e-prints.
[2] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep
contextualized word prepresentations. In Proceedings of NAACL, 2018.
[3] A. Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications. Forthcoming
[4] Yarin Gal. A theoretically grounded application of dropout in recurrent neural networks. arXiv preprint arXiv:1512.05287,
2015.
[5] K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and J. Schmidhuber. LSTM: A search space odyssey. CoRR,
abs/1503.04069, 2015.
[6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need.
In Neural Information Processing Systems (NIPS), 2017.
[7] https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728

Contenu connexe

Tendances

Tendances (20)

Efficient Training of Bert by Progressively Stacking
Efficient Training of Bert by Progressively StackingEfficient Training of Bert by Progressively Stacking
Efficient Training of Bert by Progressively Stacking
 
Supervised Constrastive Learning
Supervised Constrastive LearningSupervised Constrastive Learning
Supervised Constrastive Learning
 
Bag of Tricks for Image Classification with Convolutional Neural Networks (C...
Bag of Tricks for Image Classification  with Convolutional Neural Networks (C...Bag of Tricks for Image Classification  with Convolutional Neural Networks (C...
Bag of Tricks for Image Classification with Convolutional Neural Networks (C...
 
A Beginner's guide to understanding Autoencoder
A Beginner's guide to understanding AutoencoderA Beginner's guide to understanding Autoencoder
A Beginner's guide to understanding Autoencoder
 
GPT-X
GPT-XGPT-X
GPT-X
 
Denoising auto encoders(d a)
Denoising auto encoders(d a)Denoising auto encoders(d a)
Denoising auto encoders(d a)
 
Siamese neural networks for one shot image recognition paper explained
Siamese neural networks for one shot image recognition paper explainedSiamese neural networks for one shot image recognition paper explained
Siamese neural networks for one shot image recognition paper explained
 
딥러닝의 기본
딥러닝의 기본딥러닝의 기본
딥러닝의 기본
 
LSTM 네트워크 이해하기
LSTM 네트워크 이해하기LSTM 네트워크 이해하기
LSTM 네트워크 이해하기
 
Image net classification with deep convolutional neural networks
Image net classification with deep convolutional neural networks Image net classification with deep convolutional neural networks
Image net classification with deep convolutional neural networks
 
keras 빨리 훑어보기(intro)
keras 빨리 훑어보기(intro)keras 빨리 훑어보기(intro)
keras 빨리 훑어보기(intro)
 
Intriguing properties of contrastive losses
Intriguing properties of contrastive lossesIntriguing properties of contrastive losses
Intriguing properties of contrastive losses
 
Review MLP Mixer
Review MLP MixerReview MLP Mixer
Review MLP Mixer
 
Neural Machine Translation 기반의 영어-일본어 자동번역
Neural Machine Translation 기반의 영어-일본어 자동번역Neural Machine Translation 기반의 영어-일본어 자동번역
Neural Machine Translation 기반의 영어-일본어 자동번역
 
Rnn개념정리
Rnn개념정리Rnn개념정리
Rnn개념정리
 
One-Shot Learning
One-Shot LearningOne-Shot Learning
One-Shot Learning
 
Big Bird - Transformers for Longer Sequences
Big Bird - Transformers for Longer SequencesBig Bird - Transformers for Longer Sequences
Big Bird - Transformers for Longer Sequences
 
Energy based models and boltzmann machines - v2.0
Energy based models and boltzmann machines - v2.0Energy based models and boltzmann machines - v2.0
Energy based models and boltzmann machines - v2.0
 
CS294-112 18
CS294-112 18CS294-112 18
CS294-112 18
 
CS294-112 Lecture 13
CS294-112 Lecture 13CS294-112 Lecture 13
CS294-112 Lecture 13
 

Similaire à 네이버 NLP Challenge 후기

05_동기화_개요
05_동기화_개요05_동기화_개요
05_동기화_개요
noerror
 
Concurrent programming
Concurrent programmingConcurrent programming
Concurrent programming
Byeongsu Kang
 

Similaire à 네이버 NLP Challenge 후기 (20)

Image Deep Learning 실무적용
Image Deep Learning 실무적용Image Deep Learning 실무적용
Image Deep Learning 실무적용
 
180624 mobile visionnet_baeksucon_jwkang_pub
180624 mobile visionnet_baeksucon_jwkang_pub180624 mobile visionnet_baeksucon_jwkang_pub
180624 mobile visionnet_baeksucon_jwkang_pub
 
파이콘 한국 2019 튜토리얼 - LRP (Part 2)
파이콘 한국 2019 튜토리얼 - LRP (Part 2)파이콘 한국 2019 튜토리얼 - LRP (Part 2)
파이콘 한국 2019 튜토리얼 - LRP (Part 2)
 
WHAT DO VISION TRANSFORMERS LEARN A VISUAL EXPLORATION.pdf
WHAT DO VISION TRANSFORMERS LEARN A VISUAL EXPLORATION.pdfWHAT DO VISION TRANSFORMERS LEARN A VISUAL EXPLORATION.pdf
WHAT DO VISION TRANSFORMERS LEARN A VISUAL EXPLORATION.pdf
 
자연어5 | 1차강의
자연어5 | 1차강의자연어5 | 1차강의
자연어5 | 1차강의
 
Auto Scalable 한 Deep Learning Production 을 위한 AI Serving Infra 구성 및 AI DevOps...
Auto Scalable 한 Deep Learning Production 을 위한 AI Serving Infra 구성 및 AI DevOps...Auto Scalable 한 Deep Learning Production 을 위한 AI Serving Infra 구성 및 AI DevOps...
Auto Scalable 한 Deep Learning Production 을 위한 AI Serving Infra 구성 및 AI DevOps...
 
Deep neural networks cnn rnn_ae_some practical techniques
Deep neural networks cnn rnn_ae_some practical techniquesDeep neural networks cnn rnn_ae_some practical techniques
Deep neural networks cnn rnn_ae_some practical techniques
 
Machine learning linearregression
Machine learning linearregressionMachine learning linearregression
Machine learning linearregression
 
Imagination-Augmented Agents for Deep Reinforcement Learning
Imagination-Augmented Agents for Deep Reinforcement LearningImagination-Augmented Agents for Deep Reinforcement Learning
Imagination-Augmented Agents for Deep Reinforcement Learning
 
05_동기화_개요
05_동기화_개요05_동기화_개요
05_동기화_개요
 
인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝인공지능, 기계학습 그리고 딥러닝
인공지능, 기계학습 그리고 딥러닝
 
Deep learning overview
Deep learning overviewDeep learning overview
Deep learning overview
 
2017 tensor flow dev summit
2017 tensor flow dev summit2017 tensor flow dev summit
2017 tensor flow dev summit
 
Transfer learning usage
Transfer learning usageTransfer learning usage
Transfer learning usage
 
Concurrent programming
Concurrent programmingConcurrent programming
Concurrent programming
 
Attention is all you need 설명
Attention is all you need 설명Attention is all you need 설명
Attention is all you need 설명
 
DeBERTA : Decoding-Enhanced BERT with Disentangled Attention
DeBERTA : Decoding-Enhanced BERT with Disentangled AttentionDeBERTA : Decoding-Enhanced BERT with Disentangled Attention
DeBERTA : Decoding-Enhanced BERT with Disentangled Attention
 
220910_GatedRNN
220910_GatedRNN220910_GatedRNN
220910_GatedRNN
 
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
[paper review] 손규빈 - Eye in the sky & 3D human pose estimation in video with ...
 
Neural turing machine
Neural turing machineNeural turing machine
Neural turing machine
 

네이버 NLP Challenge 후기

  • 1. 네이버 NLP Challenge 후기 이신의, 박장원, 박종성 연세대학교 데이터공학 연구실 2018.12.28
  • 2. 1. Model 2. Hyperparameter 3. To be considered.. Contents
  • 3. MODEL : Architecture for Named Entity Recognition
  • 6. Model 1 1. Input Layer 1. 어절 (Word) • fastText (pretrained word embedding) 이용하여 dense word vector 만듦
  • 7. Model 1 1. Input Layer 2. 음절 (Character) • “을”, “과” 등 음절 대부분이 fastText word embedding에도 존재 • 동일한 fastText를 이용하여 어절의 각 음절 을 dense character vector로 만듦 • Bi-LSTM으로 encoding한 후 last hidden state을 어절의 representation • 음절 단위의 pretrained embedding이 어절보다 성능 향상에 더 큰 기여!
  • 8. Model 1 1. Input Layer 3. 어절 단위의 개체명 one-hot encoding • 학습 데이터에서 각 어절이 어떠한 개체명에 속하는지 사전으로 저장 • 학습 데이터에서 등장하지 않았던 어절은 “O” class • 해당 encoding이 있을 때와 없을 때의 성능 차가 상당
  • 9. Model 1 2. Bi-LSTM Bi-directional LSTM • Vanilla LSTM, LayerNorm LSTM, Coupled-Input- Forget-Gate LSTM … • Time step 마다 layer normalization은 속도 저하 • Coupled Input Forget Gate LSTM - 더 적은 수의 parameter, 빠른 속도, 높은 성능 𝑓𝑡 = 𝜎(𝑊𝑥ℎ_𝑓 𝑥 𝑡 + 𝑊ℎℎ_𝑓ℎ 𝑡−1 + 𝑏ℎ_𝑓) 𝑜𝑡 = 𝜎(𝑊𝑥ℎ_𝑜 𝑥 𝑡 + 𝑊ℎℎ_𝑜ℎ 𝑡−1 + 𝑏ℎ_𝑜) 𝑔𝑡 = 𝑡𝑎𝑛ℎ(𝑊𝑥ℎ_𝑔 𝑥 𝑡 + 𝑊ℎℎ_𝑔ℎ 𝑡−1 + 𝑏ℎ_𝑔) 𝑐𝑡 = 𝑓𝑡 ⨀ 𝑐𝑡−1 + (1 − 𝑓𝑡) ⨀ 𝑔𝑡 ℎ 𝑡 = 𝑜𝑡 ⨀tanh(𝑐𝑡)
  • 10. Model 1 3. Self-Attention Attention • Bi-LSTM의 hidden states간의 self-alignment • 자기 자신과의 attention score는 -∞로 지정 • Multi-head dot product attention like Transformer - 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾 𝑇 𝑑 𝑉 (𝑄𝑖 𝑇 𝐾𝑗: = −∞ if i = j) - 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝑄, 𝐾, 𝑉 = 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑8 𝑊 𝑂 - ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑊𝑖 𝑄 , 𝐾𝑊𝑖 𝐾 , 𝑉𝑊𝑖 𝑉 - 𝑄, 𝐾, 𝑉 ≔ hidden states of Bi-LSTM
  • 11. Model 1 4. Masked Separable Conv1d
  • 12. Model 1 4. Masked Separable Conv1d
  • 13. Model 1 Masked Separable Conv1d for richer representation • Masking for autoregressive - (kernel size -1) 만큼 zero padding 하여 현재 time step 보다 앞선 token을 보지 못하게 함 • Depth-wise convolution - 각 channel마다 convolution • Pointwise Convolution - 1x1 convolution을 이용하여 channel interaction • Positional embedding - convolution 연산 후, Sinusoid embedding (like Transformer)을 더함 • 각 kernel 별로 나온 feature map을 모두 concatenate 4. Masked Separable Conv1d Pictures from https://towardsdatascience.com/a-basic-introduction-to-separable- convolutions-b99ec3102728
  • 14. HyperparameterWhat is ? “In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training. Given these hyperparameters, the training algorithm learns the parameters from the data.”
  • 15. Hyperparameter 2 연금술? 흑마법? • Model만으로 F1 88.92%에 도달 • F1 89.72%까지 약 0.8%는 Hyperparameter Tuning (a.k.a. 연금술)로 달성 • 모델 제작에 2일, 연금술에 7일 소요... “There’s a place for alchemy. Alchemy worked.” - Ali Rahimi
  • 16. Hyperparameter 2 1. Train-Dev split / Learning Rate Decay Train, 86000 Dev, 4000 • Train: Dev = 86000 : 4000 • Dev F1 score와 Test F1 score가 비례함 (0.5% 정도 차이 존재) Train-Dev Split Learning rate scheduling • Adam을 사용함에도 learning rate 조절 필요 - Dev F1 score 기준, 두 번째 epoch부터 성능이 떨어짐 • Learning rate decay를 사용! - Epoch이 지날 때 마다 lr에 0.1 곱함 • 세번째 epoch부터는 lr를 더 줄여도 F1이 오르지 않았음….
  • 17. Hyperparameter 2 2. Auxiliary Loss Language Model Loss • Language Model을 이용하여 representation learning을 하면 down stream task 성능 향상 (BERT, ELMO) • 외부 데이터를 사용하지 않고 richer representation 을 만 드는 것이 목표 • Convolution의 feature map이 NER과 Language Model 에 모두 사용 • 앞의 m 어절이 주어지고 다음 어절 예측하는 LM • Language model loss=σ 𝑘=1 𝑛 − log 𝑡 𝑘 𝑡 𝑘−𝑚, … , 𝑡 𝑘−1 ; 𝜃) • Loss = Original loss + 0.5 * Language Model loss
  • 18. Hyperparameter 2 3. Dropout Dropout • 기본적인 dropout은 25% • LSTM은 variational dropout으로 10%만 - Dropout mask를 모든 time step에서 공유 • Dropout은 “적절한” 위치에 적용 - Embedding 또는 FC layer 이후에는 dropout이 역효과 - Multi-head attention과 FC layer 직전에 사용
  • 19. Hyperparameter 2 4. Other Options • Wt+1 ← 0.999Wt + 0.001Wt+1 Exponential Moving Average • Weight에 L2 Regularization 적용 • L2 Weight Decay = 1e-7 Regularization • Batch size = [8, 16, 32, 64] • Batch size를 8로 했을 때 미세하지만 성능이 가장 좋았음 Batch size
  • 20. To Be Considered : Things that we couldn’t use…
  • 21. To be considered 3 사용하지 못해 아쉬웠던 것들 1. Multi-lingual BERT 2. ELMO 3. Ensemble - 어절 단위 tokenization X - 마지막 layer에 attention을 이용 하여 feature 추출 → 오히려 성능이 저하, 어절과 sub- word간의 alignment 필요 - 한국어 데이터셋으로 학습한 모델 을 찾을 수 없었음 - 처음부터 ELMO embedding을 학습하기에는 시간 부족 - 크레딧 부족으로 적용해보지 못 함
  • 23. Reference [1] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv e-prints. [2] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word prepresentations. In Proceedings of NAACL, 2018. [3] A. Howard. Mobilenets: Efficient convolutional neural networks for mobile vision applications. Forthcoming [4] Yarin Gal. A theoretically grounded application of dropout in recurrent neural networks. arXiv preprint arXiv:1512.05287, 2015. [5] K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and J. Schmidhuber. LSTM: A search space odyssey. CoRR, abs/1503.04069, 2015. [6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Neural Information Processing Systems (NIPS), 2017. [7] https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728