10. WordNet & Thesaurus
• Dictionary
– Vocabulary in a dictionary exists as
independent unit
• Thesaurus
– 유의어 사전
• 색인언어의 어휘집으로, 개념 간 관계
를 체계적으로 명시
– WordNet
• Word-word semantic relationships
presented as a network
• Princeton university, 1985
– Problems of Thesaurus
• 시대변화에 대응하기 어렵다
• 사람을 쓰는 비용이 크다
• 단어의 미묘한 차이를 표현할 수 없다.
10
11. Statistical Methods
• Probabilistic Graph Model
– HMM, SVM, CRF (conditional random field)
• TF-IDF
• Vector space model
• Zipf’s Law, Topic Modeling
• 단어의 분산 표현
– Vector representation of words → semantics (dense vector)
• ANN을 이용해서 얻은 단어 vector = dense vector representation
– Distributional hypothesis
• “단어 자체가 아닌 그 단어가 사용된 맥락 (context)이 의미를 형성”
11
13. • Semantic Analysis
– LSA (Latent Semantic Analysis)
• reveals the meaning of word combinations and computing
vectors to represent this meaning.
– PCA (Principal Components Analysis)
• Dimensionality Reduction
– SVD
– LDA (Latent Dirichlet Allocation)
– assumes that each document is a mixture of some arbitrary number
of topics that you select when you begin training the LDiA model.
13
15. Inference methods and ANN
• Limitations of Statistical Methods
• A lot of human feature engineering
• Handling big sparse matrix: (예) 1M 단어의 경우 1M x 1M
• 추론 기반 기법
15
16. Representing words in ANN
• OHE
• Transform into low-dimensional vector using FCN
16
행렬로 표현
c (context) x W (weight)로 해당 row vector 추출
20. Enhancing word2vec
• Store word embeddings in Embedding layer
• Negative sampling
• 중심단어 학습 시 모든 단어의 weight를 갱신하지 않도록 하는 것
20
"say"에 해당하는 열벡터와 은닉층 neuron의 내적을 계산
22. RNN
• Recurrence
– ‘layers having states’ or ‘layers having memory’
• Unfolding * (truncated) BPTT
▪ Wx = 입력 x를 출력 h로 변환
▪ Wh = 1개 출력을 다음 시각
출력으로 변환
22
Unrolled RNN
23. RNNLM
• CBOW model into a LM (language model)
'You say goobye and I say hello'
23
24. LSTM
• Overview
– State in each layers
• Memory state’s attributes are updated with each training example.
LSTM network and its memory
the rules that govern
the information stored
in the state (memory)
are trained neural
nets themselves.
24
27. • Stacked LSTM
Each LSTM layer
is a cell with its
own gates and
state vector
27
28. Seq2Seq
• Encoder + Decoder
– Encoder: Input sequence
→ Output context vector/thought vector
• (i.e. encoder RNN’s final hidden state).
• if encoder is a bi-RNN it could be the concatenation of both
directions’ final hidden states.
– Decoder: context vector → different sequence 출력
• (e.g. the translation or reply of the input text, etc).
28
29. – Problem (1)
• Degraded performance in a long sentence (signals get diluted)
• 앞선 signal을 유지하기 위해 다양한 노력 (예: skip-connections)
• 다수의 hidden state를 하나의 context vecto에 어떻게 결합?
– Simple concatenation, sum/average/ max/min, …
29
30. – Problem (2)
• Unrealistic assumption: “same hidden states (and inputs) will be
more/less important to each output identically”
– Problem (3)
• Encoder는 문장을 고정길이 vector로 encoding
30
33. Attention
• Overview
– Early Attention – Attention as complementing RNN
– Bahdanau et. Al. - 각 입력 단어와 출력 간에 연결되는 신경망을 추가
→ 현재 결과에 대한 모든 단어의 기여를 가중치로 표현.
– Luong
– “Attention is Everything You Need” by Vaswani et. Al
• Transformer에서 (Scaled-dot product) Attention만 사용
• Core concepts: Dynamic weighting (= global alignment weights)
– An attention calculates dynamic (alignment) weights representing
the relative importance of the inputs in the sequence (the keys) for
that particular output (the query).
– Multiplying the dynamic weights (the alignment scores) with the
input sequence (the values) will then weight the sequence.
33
34. • Calculating Attended context vector
– 다양한 방법 가능
– Weighted summation of input vectors
• Dot-product (MatMul) of the dynamically attended weights with
the input sequence (V).
34
36. – Scaled Dot-Product Attention
• Normalization such as Softmax to non-linearly scale the weight
values between 0 and 1. Because the dot-product can produce
very large magnitudes with very large vector dimensions (d)
• → very small gradients when passed into Softmax, we can scale
the values prior (scale = 1 / √ d).
Therefore, the normalised and scaled-dot-product attention
= softmax(a * scale) = softmax(Q @ K / √ d)
36
46. Decoder
Decoder prediction at time step t = 1 Decoder prediction at time step t = 2
Decoder prediction at time step t = 3 Decoder prediction at time step t = 4 46
47. • Feedforward network
• Add and norm component
• Linear and softmax layers
decoder block with an add & norm component
Linear and softmax layers
47
Decoder block
51. Multi-head Attention
• Ideas
• runs through an attention mechanism several times in parallel.
• The independent attention outputs are then concatenated and
linearly transformed into the expected dimension.
MHA allows for attending to
parts of the sequence
differently (e.g. longer-term
dependencies versus shorter-
term dependencies)
W are all learnable
parameter matrices.
Source: https://paperswithcode.com/method/multi-head-attention
51
54. Overview
• BERT Ideas
• Relating the word 'Python' to all other words
• BERT generating the representation of each word in the sentence
54
https://github.com/google-research/bert
55. • Characteristics
• Apply bi-directional attention to all MHA sublayers
• Only has encoder stack
• Bigger scale than original transformer
– N = 6 stacks, dmodel = 512, A=8 (no of attention head)
– → dk = dmodel/A = 64 dimensions of a head)
• WordPiece tokenization을 이용
• Use learned positional encoding instead of sine-cosine
– (Together with supervised learning)
– unsupervised embedding, pre-trained models with un-labelled
text
– Pre-training + Fine-Training
55
추가
60. • Pre-training Strategy
– Language Modeling
• Auto-regressive language modeling
– Is unidirectional = read the sentence in only one direction.
– In forward prediction (left-to-right), model reads all the words from
left to right up to the blank in order to make a prediction, as:
Paris is a beautiful __.
– In backward prediction (right-to-left), model reads all the words from
right to left up in order to make a prediction, as: __. I love Paris
• Auto-encoding language modeling
– takes advantage of both forward and backward prediction.
– is bidirectional – reads sentence in both directions while predicting
– Paris is a beautiful __. I love Paris
'Paris is a beautiful city. I love Paris’. → Paris is a beautiful __. I love Paris
60
69. BERT vs. GPT
BERT by Google GPT by OpenAI
Ideas a pre-trained NLP model
developed by Google in 2018
Unsupervised ML
autoregressive in nature
Attention direction Bidirectional in nature. Uni-directional in nature
Applications: Voice assistance with enhanced
customer experience
Enhanced search
write news, generate articles,
resumes as well as codes.
Size BERT-base, BERT-large 470 times bigger than BERT
Requirements in
Service model
fine-tuning process 필요 - train
the model on a separate layer
on sentence encodings.
few-shot learning
text-in and text-out API 제공
Transformer block Encoder block Decoder block
Sentence generation No direct generation Direct generation
69
70. • 동향
– Meta model
– Teacher-Student Architecture
– BERT-as-a-Service
70
추가
73. multilingual BERT
• Multi-lingual/한글 BERT 문제
– multilingual BERT
• 여러 언어별 text representations from wiki text of 104 different
languages
• 110k shared WordPiece vocabulary
• understands the context from different languages without any paired
or language-aligned training data
• zero-shot knowledge transfer in M-BERT without any cross-language
object
– Cross-lingual language model (XLM)
• M-BERT보다 높은 성능
– causal language modeling (CLM)
– masked language modeling (MLM)
– translation language modeling (TLM)
73
추가
75. • Issues
– Long Inference time for service preparation
– Model is sometimes underfit
• Research Directions
– Enhancing the training performance of model (Model 수정의 문제)
• XLNet
• RoBERTa
– dynamic masking instead of static masking in MLM task
– NSP task 없이 MLM task만으로 train (large batch size)
– use BBPE as a tokenizer
• MT-DNN
• T5 (Text-to-Text Transfer Transformer)
– Scale-down Transformer (while maintining performance)
• TinyBERT
• DistilBERT (Knowledge Distillation)
• ALBERT (A Lite BERT)
• Transformer.zip (Quantization, Pruning)
75
추가