Nlp and transformer (v3s)

Transformer in NLP
2021.6.2~3
윤형기 (hky@openwith.net)
English version

• Intro
– AI & NLP
– NLP Overview
• Traditional methods
• Word Embedding
• RNN/LSTM
• Transformer Revolution – BERT & GPT
– BERT
– GPT
• Wrap-up
2

INTRO
AI & Deep Learning
NLP Overview
3

• AI, ML/DL & NLP
https://artificialintelligence.oodles.io/blogs/ai-applications-for-business/
▪ Problem Solving
▪ CSP
▪ Knowledge, Inference, Planning
▪ FOL/Inference/Knowledge
Representation
▪ Learning
▪ ML/RL
▪ Communication, perceiving, acting
▪ NLP/NLU
▪ Perception & Robotics
Source: AI by Russel & Norvig (mod’d)
5

• ML and DL
• Representation Learning
6

NLP
• Traditional Methods
– WordNet & Thesaurus
– Statistical Methods
• Distributional Hypothesis
• Co-occurrence matrix
• Vector Similarity
• Mutual Information
• Dimensionality Reduction & SVD
• Word2Vec
– Inference & Neural Nets
• Representing words in NN
– Simple word2vec
• CBOW와 Skip-Gram
• Weights in Word2Vec
– Enhancing Word2Vec
• CBOW model and Probability
• Enhancing word2vec
• Negative Sampling
• RNN
– RNN?
– Language Model과 RNNLM
– LSTM-based LM
– Sentence-generation using RNN
– seq2seq
• Encoder & Decoder class
• NLP today
– Attention and Transformer
8

WordNet & Thesaurus
• Dictionary
– Vocabulary in a dictionary exists as
independent unit
• Thesaurus
– 유의어 사전
• 색인언어의 어휘집으로, 개념 간 관계
를 체계적으로 명시
– WordNet
• Word-word semantic relationships
presented as a network
• Princeton university, 1985
– Problems of Thesaurus
• 시대변화에 대응하기 어렵다
• 사람을 쓰는 비용이 크다
• 단어의 미묘한 차이를 표현할 수 없다.
10

Statistical Methods
• Probabilistic Graph Model
– HMM, SVM, CRF (conditional random field)
• TF-IDF
• Vector space model
• Zipf’s Law, Topic Modeling
• 단어의 분산 표현
– Vector representation of words → semantics (dense vector)
• ANN을 이용해서 얻은 단어 vector = dense vector representation
– Distributional hypothesis
• “단어 자체가 아닌 그 단어가 사용된 맥락 (context)이 의미를 형성”
11

• Co-occurrence matrix
• Frequency-centered analysis
• Vector-to Vector Similarity
– 유사단어의 Ranking 표시
• Relative ranking
• Enhancing Statistical Methods– Mutual Information
12

• Semantic Analysis
– LSA (Latent Semantic Analysis)
• reveals the meaning of word combinations and computing
vectors to represent this meaning.
– PCA (Principal Components Analysis)
• Dimensionality Reduction
– SVD
– LDA (Latent Dirichlet Allocation)
– assumes that each document is a mixture of some arbitrary number
of topics that you select when you begin training the LDiA model.
13

Inference methods and ANN
• Limitations of Statistical Methods
• A lot of human feature engineering
• Handling big sparse matrix: (예) 1M 단어의 경우 1M x 1M
• 추론 기반 기법
15

Representing words in ANN
• OHE
• Transform into low-dimensional vector using FCN
16
행렬로 표현
c (context) x W (weight)로 해당 row vector 추출

one-hot vector로부터 word vector의 계산
17

Simple word2vec
• CBOW model
– Context → target word
18

Enhancing word2vec
• Store word embeddings in Embedding layer
• Negative sampling
• 중심단어 학습 시 모든 단어의 weight를 갱신하지 않도록 하는 것
20
"say"에 해당하는 열벡터와 은닉층 neuron의 내적을 계산

RNN
• Recurrence
– ‘layers having states’ or ‘layers having memory’
• Unfolding * (truncated) BPTT
▪ Wx = 입력 x를 출력 h로 변환
▪ Wh = 1개 출력을 다음 시각
출력으로 변환
22
Unrolled RNN

RNNLM
• CBOW model into a LM (language model)
'You say goobye and I say hello'
23

LSTM
• Overview
– State in each layers
• Memory state’s attributes are updated with each training example.
LSTM network and its memory
the rules that govern
the information stored
in the state (memory)
are trained neural
nets themselves.
24

LSTM을 이용한 Language Model
• Sentence Generation using RNN
25

확률분포대로
한 단어를 sampling
확률분포 출력과
sampling을 반복
26

• Stacked LSTM
Each LSTM layer
is a cell with its
own gates and
state vector
27

Seq2Seq
• Encoder + Decoder
– Encoder: Input sequence
→ Output context vector/thought vector
• (i.e. encoder RNN’s final hidden state).
• if encoder is a bi-RNN it could be the concatenation of both
directions’ final hidden states.
– Decoder: context vector → different sequence 출력
• (e.g. the translation or reply of the input text, etc).
28

– Problem (1)
• Degraded performance in a long sentence (signals get diluted)
• 앞선 signal을 유지하기 위해 다양한 노력 (예: skip-connections)
• 다수의 hidden state를 하나의 context vecto에 어떻게 결합?
– Simple concatenation, sum/average/ max/min, …
29

– Problem (2)
• Unrealistic assumption: “same hidden states (and inputs) will be
more/less important to each output identically”
– Problem (3)
• Encoder는 문장을 고정길이 vector로 encoding
30

TRANSFORMER REVOLUTION
– BERT & GPT
BERT
GPT
31

Attention
• Overview
– Early Attention – Attention as complementing RNN
– Bahdanau et. Al. - 각 입력 단어와 출력 간에 연결되는 신경망을 추가
→ 현재 결과에 대한 모든 단어의 기여를 가중치로 표현.
– Luong
– “Attention is Everything You Need” by Vaswani et. Al
• Transformer에서 (Scaled-dot product) Attention만 사용
• Core concepts: Dynamic weighting (= global alignment weights)
– An attention calculates dynamic (alignment) weights representing
the relative importance of the inputs in the sequence (the keys) for
that particular output (the query).
– Multiplying the dynamic weights (the alignment scores) with the
input sequence (the values) will then weight the sequence.
33

• Calculating Attended context vector
– 다양한 방법 가능
– Weighted summation of input vectors
• Dot-product (MatMul) of the dynamically attended weights with
the input sequence (V).
34

• Heat map as a byproduct
35

– Scaled Dot-Product Attention
• Normalization such as Softmax to non-linearly scale the weight
values between 0 and 1. Because the dot-product can produce
very large magnitudes with very large vector dimensions (d)
• → very small gradients when passed into Softmax, we can scale
the values prior (scale = 1 / √ d).
Therefore, the normalised and scaled-dot-product attention
= softmax(a * scale) = softmax(Q @ K / √ d)
36

Transformer Overview
• Ideas
• architecture to solve Seq2Seq tasks while handling long-range
dependencies with ease.
https://arxiv.org/pdf/1706.03762.pdf
39

Encoder
• Self-attention mechanism
– 예
41

Creating query, key, and value matrices
42

• Multi-head attention
– Positional encoding
Self-attention of the word well
Self-attention of the word it
43

– A single encoder block
• Feedforward network
44

• Encoder component의 결합
A stack of encoders with encoder 1 expanded
45

Decoder
Decoder prediction at time step t = 1 Decoder prediction at time step t = 2
Decoder prediction at time step t = 3 Decoder prediction at time step t = 4 46

• Feedforward network
• Add and norm component
• Linear and softmax layers
decoder block with an add & norm component
Linear and softmax layers
47
Decoder block

• Completing Decoder
A stack of two decoders with decoder 1 expanded 48

Self-Attention
• Self-Attention at a High Level
”The animal didn't cross the street because it was too tired”
50

Multi-head Attention
• Ideas
• runs through an attention mechanism several times in parallel.
• The independent attention outputs are then concatenated and
linearly transformed into the expected dimension.
MHA allows for attending to
parts of the sequence
differently (e.g. longer-term
dependencies versus shorter-
term dependencies)
W are all learnable
parameter matrices.
Source: https://paperswithcode.com/method/multi-head-attention
51

Overview
• BERT Ideas
• Relating the word 'Python' to all other words
• BERT generating the representation of each word in the sentence
54
https://github.com/google-research/bert

• Characteristics
• Apply bi-directional attention to all MHA sublayers
• Only has encoder stack
• Bigger scale than original transformer
– N = 6 stacks, dmodel = 512, A=8 (no of attention head)
– → dk = dmodel/A = 64 dimensions of a head)
• WordPiece tokenization을 이용
• Use learned positional encoding instead of sine-cosine
– (Together with supervised learning)
– unsupervised embedding, pre-trained models with un-labelled
text
– Pre-training + Fine-Training
55
추가

Configurations of BERT
• BERT-base • BERT-large
56

Pre-training the BERT model
• Input data representation
– Token Embedding
– Segment Embedding
58

– Position Embedding
– Final Representation
59

• Pre-training Strategy
– Language Modeling
• Auto-regressive language modeling
– Is unidirectional = read the sentence in only one direction.
– In forward prediction (left-to-right), model reads all the words from
left to right up to the blank in order to make a prediction, as:
Paris is a beautiful __.
– In backward prediction (right-to-left), model reads all the words from
right to left up in order to make a prediction, as: __. I love Paris
• Auto-encoding language modeling
– takes advantage of both forward and backward prediction.
– is bidirectional – reads sentence in both directions while predicting
– Paris is a beautiful __. I love Paris
'Paris is a beautiful city. I love Paris’. → Paris is a beautiful __. I love Paris
60

– Masked Language Modeling
61

Predicting the masked token
62

– Next Sentence Prediction
Sample dataset
63

• Subword tokenization
– Byte pair encoding
Character sequence with count
Creating a vocabulary with all unique characters 64

• BERT Variants I
– ALBERT
• A Lite version of BERT – Cross-layer parameter sharing, Factorized
embedding parameterization
– RoBERTa
• Robustly Optimized BERT pre-training Approach
• Using dynamic masking instead of static masking, …
– ELECTRA
– SpanBERT
• BERT Variants II (Knowledge Distillation)
– DistilBERT
– TinyBERT
65

GPT
https://medium.com/walmartglobaltech/the-journey-of-open-ai-gpt-models-32d95b7b7fb2
66

GPT Overview
• GPT
– 2018, OpenAI, "Improving Language Understanding by Generative
Pre-Training“ 에서 Generative Pre-trained Transformer (GPT)
– 2019 GPT-2
– 2020 GPT-3
– auto-regressive in nature.
• Characteristics (vs. BERT)
– self-attention layer에서 future token을 mask (vs. BERT는 해당 단어
를 [mask]로 변경)
– masked self-attention 이용 (vs. BERT는 self-attention 사용).
• Open sourcing
– GPT-2는 공개
– GPT-3는 비공개
67

GPT architecture
Architecture of GPT (Radford et al., 2018) 68

BERT vs. GPT
BERT by Google GPT by OpenAI
Ideas a pre-trained NLP model
developed by Google in 2018
Unsupervised ML
autoregressive in nature
Attention direction Bidirectional in nature. Uni-directional in nature
Applications: Voice assistance with enhanced
customer experience
Enhanced search
write news, generate articles,
resumes as well as codes.
Size BERT-base, BERT-large 470 times bigger than BERT
Requirements in
Service model
fine-tuning process 필요 - train
the model on a separate layer
on sentence encodings.
few-shot learning
text-in and text-out API 제공
Transformer block Encoder block Decoder block
Sentence generation No direct generation Direct generation
69

• 동향
– Meta model
– Teacher-Student Architecture
– BERT-as-a-Service
70
추가

multilingual BERT
• Multi-lingual/한글 BERT 문제
– multilingual BERT
• 여러 언어별 text representations from wiki text of 104 different
languages
• 110k shared WordPiece vocabulary
• understands the context from different languages without any paired
or language-aligned training data
• zero-shot knowledge transfer in M-BERT without any cross-language
object
– Cross-lingual language model (XLM)
• M-BERT보다 높은 성능
– causal language modeling (CLM)
– masked language modeling (MLM)
– translation language modeling (TLM)
73
추가

74
출처: https://github.com/snunlp/KR-BERT
추가
참고자료: https://arxiv.org/pdf/2008.03979.pdf

• Issues
– Long Inference time for service preparation
– Model is sometimes underfit
• Research Directions
– Enhancing the training performance of model (Model 수정의 문제)
• XLNet
• RoBERTa
– dynamic masking instead of static masking in MLM task
– NSP task 없이 MLM task만으로 train (large batch size)
– use BBPE as a tokenizer
• MT-DNN
• T5 (Text-to-Text Transfer Transformer)
– Scale-down Transformer (while maintining performance)
• TinyBERT
• DistilBERT (Knowledge Distillation)
• ALBERT (A Lite BERT)
• Transformer.zip (Quantization, Pruning)
75
추가

Nlp and transformer (v3s)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Nlp and transformer (v3s)

Similaire à Nlp and transformer (v3s) (20)

Plus de H K Yoon

Plus de H K Yoon (8)

Dernier

Dernier (20)

Nlp and transformer (v3s)