SlideShare une entreprise Scribd logo
1  sur  76
Télécharger pour lire hors ligne
Transformer in NLP
2021.6.2~3
윤형기 (hky@openwith.net)
English version
• Intro
– AI & NLP
– NLP Overview
• Traditional methods
• Word Embedding
• RNN/LSTM
• Transformer Revolution – BERT & GPT
– BERT
– GPT
• Wrap-up
2
INTRO
AI & Deep Learning
NLP Overview
3
AI and Deep Learning
4
• AI, ML/DL & NLP
https://artificialintelligence.oodles.io/blogs/ai-applications-for-business/
▪ Problem Solving
▪ CSP
▪ Knowledge, Inference, Planning
▪ FOL/Inference/Knowledge
Representation
▪ Learning
▪ ML/RL
▪ Communication, perceiving, acting
▪ NLP/NLU
▪ Perception & Robotics
Source: AI by Russel & Norvig (mod’d)
5
• ML and DL
• Representation Learning
6
NLP Overview
7
NLP
• Traditional Methods
– WordNet & Thesaurus
– Statistical Methods
• Distributional Hypothesis
• Co-occurrence matrix
• Vector Similarity
• Mutual Information
• Dimensionality Reduction & SVD
• Word2Vec
– Inference & Neural Nets
• Representing words in NN
– Simple word2vec
• CBOW와 Skip-Gram
• Weights in Word2Vec
– Enhancing Word2Vec
• CBOW model and Probability
• Enhancing word2vec
• Negative Sampling
• RNN
– RNN?
– Language Model과 RNNLM
– LSTM-based LM
– Sentence-generation using RNN
– seq2seq
• Encoder & Decoder class
• NLP today
– Attention and Transformer
8
TRADITIONAL METHODS
9
WordNet & Thesaurus
• Dictionary
– Vocabulary in a dictionary exists as
independent unit
• Thesaurus
– 유의어 사전
• 색인언어의 어휘집으로, 개념 간 관계
를 체계적으로 명시
– WordNet
• Word-word semantic relationships
presented as a network
• Princeton university, 1985
– Problems of Thesaurus
• 시대변화에 대응하기 어렵다
• 사람을 쓰는 비용이 크다
• 단어의 미묘한 차이를 표현할 수 없다.
10
Statistical Methods
• Probabilistic Graph Model
– HMM, SVM, CRF (conditional random field)
• TF-IDF
• Vector space model
• Zipf’s Law, Topic Modeling
• 단어의 분산 표현
– Vector representation of words → semantics (dense vector)
• ANN을 이용해서 얻은 단어 vector = dense vector representation
– Distributional hypothesis
• “단어 자체가 아닌 그 단어가 사용된 맥락 (context)이 의미를 형성”
11
• Co-occurrence matrix
• Frequency-centered analysis
• Vector-to Vector Similarity
– 유사단어의 Ranking 표시
• Relative ranking
• Enhancing Statistical Methods– Mutual Information
12
• Semantic Analysis
– LSA (Latent Semantic Analysis)
• reveals the meaning of word combinations and computing
vectors to represent this meaning.
– PCA (Principal Components Analysis)
• Dimensionality Reduction
– SVD
– LDA (Latent Dirichlet Allocation)
– assumes that each document is a mixture of some arbitrary number
of topics that you select when you begin training the LDiA model.
13
WORD2VEC
14
Inference methods and ANN
• Limitations of Statistical Methods
• A lot of human feature engineering
• Handling big sparse matrix: (예) 1M 단어의 경우 1M x 1M
• 추론 기반 기법
15
Representing words in ANN
• OHE
• Transform into low-dimensional vector using FCN
16
행렬로 표현
c (context) x W (weight)로 해당 row vector 추출
one-hot vector로부터 word vector의 계산
17
Simple word2vec
• CBOW model
– Context → target word
18
• Skip-Gram model
19
Enhancing word2vec
• Store word embeddings in Embedding layer
• Negative sampling
• 중심단어 학습 시 모든 단어의 weight를 갱신하지 않도록 하는 것
20
"say"에 해당하는 열벡터와 은닉층 neuron의 내적을 계산
RNN
21
RNN
• Recurrence
– ‘layers having states’ or ‘layers having memory’
• Unfolding * (truncated) BPTT
▪ Wx = 입력 x를 출력 h로 변환
▪ Wh = 1개 출력을 다음 시각
출력으로 변환
22
Unrolled RNN
RNNLM
• CBOW model into a LM (language model)
'You say goobye and I say hello'
23
LSTM
• Overview
– State in each layers
• Memory state’s attributes are updated with each training example.
LSTM network and its memory
the rules that govern
the information stored
in the state (memory)
are trained neural
nets themselves.
24
LSTM을 이용한 Language Model
• Sentence Generation using RNN
25
확률분포대로
한 단어를 sampling
확률분포 출력과
sampling을 반복
26
• Stacked LSTM
Each LSTM layer
is a cell with its
own gates and
state vector
27
Seq2Seq
• Encoder + Decoder
– Encoder: Input sequence
→ Output context vector/thought vector
• (i.e. encoder RNN’s final hidden state).
• if encoder is a bi-RNN it could be the concatenation of both
directions’ final hidden states.
– Decoder: context vector → different sequence 출력
• (e.g. the translation or reply of the input text, etc).
28
– Problem (1)
• Degraded performance in a long sentence (signals get diluted)
• 앞선 signal을 유지하기 위해 다양한 노력 (예: skip-connections)
• 다수의 hidden state를 하나의 context vecto에 어떻게 결합?
– Simple concatenation, sum/average/ max/min, …
29
– Problem (2)
• Unrealistic assumption: “same hidden states (and inputs) will be
more/less important to each output identically”
– Problem (3)
• Encoder는 문장을 고정길이 vector로 encoding
30
TRANSFORMER REVOLUTION
– BERT & GPT
BERT
GPT
31
ATTENTION
32
Attention
• Overview
– Early Attention – Attention as complementing RNN
– Bahdanau et. Al. - 각 입력 단어와 출력 간에 연결되는 신경망을 추가
→ 현재 결과에 대한 모든 단어의 기여를 가중치로 표현.
– Luong
– “Attention is Everything You Need” by Vaswani et. Al
• Transformer에서 (Scaled-dot product) Attention만 사용
• Core concepts: Dynamic weighting (= global alignment weights)
– An attention calculates dynamic (alignment) weights representing
the relative importance of the inputs in the sequence (the keys) for
that particular output (the query).
– Multiplying the dynamic weights (the alignment scores) with the
input sequence (the values) will then weight the sequence.
33
• Calculating Attended context vector
– 다양한 방법 가능
– Weighted summation of input vectors
• Dot-product (MatMul) of the dynamically attended weights with
the input sequence (V).
34
• Heat map as a byproduct
35
– Scaled Dot-Product Attention
• Normalization such as Softmax to non-linearly scale the weight
values between 0 and 1. Because the dot-product can produce
very large magnitudes with very large vector dimensions (d)
• → very small gradients when passed into Softmax, we can scale
the values prior (scale = 1 / √ d).
Therefore, the normalised and scaled-dot-product attention
= softmax(a * scale) = softmax(Q @ K / √ d)
36
37
TRANSFORMER
38
Transformer Overview
• Ideas
• architecture to solve Seq2Seq tasks while handling long-range
dependencies with ease.
https://arxiv.org/pdf/1706.03762.pdf
39
40
Encoder
• Self-attention mechanism
– 예
41
Creating query, key, and value matrices
42
• Multi-head attention
– Positional encoding
Self-attention of the word well
Self-attention of the word it
43
– A single encoder block
• Feedforward network
44
• Encoder component의 결합
A stack of encoders with encoder 1 expanded
45
Decoder
Decoder prediction at time step t = 1 Decoder prediction at time step t = 2
Decoder prediction at time step t = 3 Decoder prediction at time step t = 4 46
• Feedforward network
• Add and norm component
• Linear and softmax layers
decoder block with an add & norm component
Linear and softmax layers
47
Decoder block
• Completing Decoder
A stack of two decoders with decoder 1 expanded 48
Encoder + decoder
49
Self-Attention
• Self-Attention at a High Level
”The animal didn't cross the street because it was too tired”
50
Multi-head Attention
• Ideas
• runs through an attention mechanism several times in parallel.
• The independent attention outputs are then concatenated and
linearly transformed into the expected dimension.
MHA allows for attending to
parts of the sequence
differently (e.g. longer-term
dependencies versus shorter-
term dependencies)
W are all learnable
parameter matrices.
Source: https://paperswithcode.com/method/multi-head-attention
51
BERT & GPT
BERT
GPT
52
BERT
53
Overview
• BERT Ideas
• Relating the word 'Python' to all other words
• BERT generating the representation of each word in the sentence
54
https://github.com/google-research/bert
• Characteristics
• Apply bi-directional attention to all MHA sublayers
• Only has encoder stack
• Bigger scale than original transformer
– N = 6 stacks, dmodel = 512, A=8 (no of attention head)
– → dk = dmodel/A = 64 dimensions of a head)
• WordPiece tokenization을 이용
• Use learned positional encoding instead of sine-cosine
– (Together with supervised learning)
– unsupervised embedding, pre-trained models with un-labelled
text
– Pre-training + Fine-Training
55
추가
Configurations of BERT
• BERT-base • BERT-large
56
57
BERT framework
추가
Pre-training the BERT model
• Input data representation
– Token Embedding
– Segment Embedding
58
– Position Embedding
– Final Representation
59
• Pre-training Strategy
– Language Modeling
• Auto-regressive language modeling
– Is unidirectional = read the sentence in only one direction.
– In forward prediction (left-to-right), model reads all the words from
left to right up to the blank in order to make a prediction, as:
Paris is a beautiful __.
– In backward prediction (right-to-left), model reads all the words from
right to left up in order to make a prediction, as: __. I love Paris
• Auto-encoding language modeling
– takes advantage of both forward and backward prediction.
– is bidirectional – reads sentence in both directions while predicting
– Paris is a beautiful __. I love Paris
'Paris is a beautiful city. I love Paris’. → Paris is a beautiful __. I love Paris
60
– Masked Language Modeling
61
Predicting the masked token
62
– Next Sentence Prediction
Sample dataset
63
• Subword tokenization
– Byte pair encoding
Character sequence with count
Creating a vocabulary with all unique characters 64
• BERT Variants I
– ALBERT
• A Lite version of BERT – Cross-layer parameter sharing, Factorized
embedding parameterization
– RoBERTa
• Robustly Optimized BERT pre-training Approach
• Using dynamic masking instead of static masking, …
– ELECTRA
– SpanBERT
• BERT Variants II (Knowledge Distillation)
– DistilBERT
– TinyBERT
65
GPT
https://medium.com/walmartglobaltech/the-journey-of-open-ai-gpt-models-32d95b7b7fb2
66
GPT Overview
• GPT
– 2018, OpenAI, "Improving Language Understanding by Generative
Pre-Training“ 에서 Generative Pre-trained Transformer (GPT)
– 2019 GPT-2
– 2020 GPT-3
– auto-regressive in nature.
• Characteristics (vs. BERT)
– self-attention layer에서 future token을 mask (vs. BERT는 해당 단어
를 [mask]로 변경)
– masked self-attention 이용 (vs. BERT는 self-attention 사용).
• Open sourcing
– GPT-2는 공개
– GPT-3는 비공개
67
GPT architecture
Architecture of GPT (Radford et al., 2018) 68
BERT vs. GPT
BERT by Google GPT by OpenAI
Ideas a pre-trained NLP model
developed by Google in 2018
Unsupervised ML
autoregressive in nature
Attention direction Bidirectional in nature. Uni-directional in nature
Applications: Voice assistance with enhanced
customer experience
Enhanced search
write news, generate articles,
resumes as well as codes.
Size BERT-base, BERT-large 470 times bigger than BERT
Requirements in
Service model
fine-tuning process 필요 - train
the model on a separate layer
on sentence encodings.
few-shot learning
text-in and text-out API 제공
Transformer block Encoder block Decoder block
Sentence generation No direct generation Direct generation
69
• 동향
– Meta model
– Teacher-Student Architecture
– BERT-as-a-Service
70
추가
WRAP-UP
71
72
multilingual BERT
• Multi-lingual/한글 BERT 문제
– multilingual BERT
• 여러 언어별 text representations from wiki text of 104 different
languages
• 110k shared WordPiece vocabulary
• understands the context from different languages without any paired
or language-aligned training data
• zero-shot knowledge transfer in M-BERT without any cross-language
object
– Cross-lingual language model (XLM)
• M-BERT보다 높은 성능
– causal language modeling (CLM)
– masked language modeling (MLM)
– translation language modeling (TLM)
73
추가
74
출처: https://github.com/snunlp/KR-BERT
추가
참고자료: https://arxiv.org/pdf/2008.03979.pdf
• Issues
– Long Inference time for service preparation
– Model is sometimes underfit
• Research Directions
– Enhancing the training performance of model (Model 수정의 문제)
• XLNet
• RoBERTa
– dynamic masking instead of static masking in MLM task
– NSP task 없이 MLM task만으로 train (large batch size)
– use BBPE as a tokenizer
• MT-DNN
• T5 (Text-to-Text Transfer Transformer)
– Scale-down Transformer (while maintining performance)
• TinyBERT
• DistilBERT (Knowledge Distillation)
• ALBERT (A Lite BERT)
• Transformer.zip (Quantization, Pruning)
75
추가
76

Contenu connexe

Tendances

Tendances (20)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Bert
BertBert
Bert
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
Natural language processing PPT presentation
Natural language processing PPT presentationNatural language processing PPT presentation
Natural language processing PPT presentation
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.net
 
Natural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - IntroductionNatural Language Processing (NLP) - Introduction
Natural Language Processing (NLP) - Introduction
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 

Similaire à Nlp and transformer (v3s)

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
Kyuri Kim
 
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...
PyData
 

Similaire à Nlp and transformer (v3s) (20)

Deep Learning for Machine Translation
Deep Learning for Machine TranslationDeep Learning for Machine Translation
Deep Learning for Machine Translation
 
5_RNN_LSTM.pdf
5_RNN_LSTM.pdf5_RNN_LSTM.pdf
5_RNN_LSTM.pdf
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
Natural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine TranslationNatural Language to Visualization by Neural Machine Translation
Natural Language to Visualization by Neural Machine Translation
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...
Sujit Pal - Applying the four-step "Embed, Encode, Attend, Predict" framework...
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
Towards OpenLogos Hybrid Machine Translation - Anabela BarreiroTowards OpenLogos Hybrid Machine Translation - Anabela Barreiro
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Compilers Are Databases
Compilers Are DatabasesCompilers Are Databases
Compilers Are Databases
 
PL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesPL Lecture 01 - preliminaries
PL Lecture 01 - preliminaries
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
 

Plus de H K Yoon (8)

AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
AI 바이오 (2_3일차).pdf
AI 바이오 (2_3일차).pdfAI 바이오 (2_3일차).pdf
AI 바이오 (2_3일차).pdf
 
Outlier Analysis.pdf
Outlier Analysis.pdfOutlier Analysis.pdf
Outlier Analysis.pdf
 
Open stack and k8s(v4)
Open stack and k8s(v4)Open stack and k8s(v4)
Open stack and k8s(v4)
 
Open source Embedded systems
Open source Embedded systemsOpen source Embedded systems
Open source Embedded systems
 
빅데이터, big data
빅데이터, big data빅데이터, big data
빅데이터, big data
 
Sensor web
Sensor webSensor web
Sensor web
 
Tm기반검색v2
Tm기반검색v2Tm기반검색v2
Tm기반검색v2
 

Dernier

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 

Dernier (20)

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 

Nlp and transformer (v3s)

  • 1. Transformer in NLP 2021.6.2~3 윤형기 (hky@openwith.net) English version
  • 2. • Intro – AI & NLP – NLP Overview • Traditional methods • Word Embedding • RNN/LSTM • Transformer Revolution – BERT & GPT – BERT – GPT • Wrap-up 2
  • 3. INTRO AI & Deep Learning NLP Overview 3
  • 4. AI and Deep Learning 4
  • 5. • AI, ML/DL & NLP https://artificialintelligence.oodles.io/blogs/ai-applications-for-business/ ▪ Problem Solving ▪ CSP ▪ Knowledge, Inference, Planning ▪ FOL/Inference/Knowledge Representation ▪ Learning ▪ ML/RL ▪ Communication, perceiving, acting ▪ NLP/NLU ▪ Perception & Robotics Source: AI by Russel & Norvig (mod’d) 5
  • 6. • ML and DL • Representation Learning 6
  • 8. NLP • Traditional Methods – WordNet & Thesaurus – Statistical Methods • Distributional Hypothesis • Co-occurrence matrix • Vector Similarity • Mutual Information • Dimensionality Reduction & SVD • Word2Vec – Inference & Neural Nets • Representing words in NN – Simple word2vec • CBOW와 Skip-Gram • Weights in Word2Vec – Enhancing Word2Vec • CBOW model and Probability • Enhancing word2vec • Negative Sampling • RNN – RNN? – Language Model과 RNNLM – LSTM-based LM – Sentence-generation using RNN – seq2seq • Encoder & Decoder class • NLP today – Attention and Transformer 8
  • 10. WordNet & Thesaurus • Dictionary – Vocabulary in a dictionary exists as independent unit • Thesaurus – 유의어 사전 • 색인언어의 어휘집으로, 개념 간 관계 를 체계적으로 명시 – WordNet • Word-word semantic relationships presented as a network • Princeton university, 1985 – Problems of Thesaurus • 시대변화에 대응하기 어렵다 • 사람을 쓰는 비용이 크다 • 단어의 미묘한 차이를 표현할 수 없다. 10
  • 11. Statistical Methods • Probabilistic Graph Model – HMM, SVM, CRF (conditional random field) • TF-IDF • Vector space model • Zipf’s Law, Topic Modeling • 단어의 분산 표현 – Vector representation of words → semantics (dense vector) • ANN을 이용해서 얻은 단어 vector = dense vector representation – Distributional hypothesis • “단어 자체가 아닌 그 단어가 사용된 맥락 (context)이 의미를 형성” 11
  • 12. • Co-occurrence matrix • Frequency-centered analysis • Vector-to Vector Similarity – 유사단어의 Ranking 표시 • Relative ranking • Enhancing Statistical Methods– Mutual Information 12
  • 13. • Semantic Analysis – LSA (Latent Semantic Analysis) • reveals the meaning of word combinations and computing vectors to represent this meaning. – PCA (Principal Components Analysis) • Dimensionality Reduction – SVD – LDA (Latent Dirichlet Allocation) – assumes that each document is a mixture of some arbitrary number of topics that you select when you begin training the LDiA model. 13
  • 15. Inference methods and ANN • Limitations of Statistical Methods • A lot of human feature engineering • Handling big sparse matrix: (예) 1M 단어의 경우 1M x 1M • 추론 기반 기법 15
  • 16. Representing words in ANN • OHE • Transform into low-dimensional vector using FCN 16 행렬로 표현 c (context) x W (weight)로 해당 row vector 추출
  • 17. one-hot vector로부터 word vector의 계산 17
  • 18. Simple word2vec • CBOW model – Context → target word 18
  • 20. Enhancing word2vec • Store word embeddings in Embedding layer • Negative sampling • 중심단어 학습 시 모든 단어의 weight를 갱신하지 않도록 하는 것 20 "say"에 해당하는 열벡터와 은닉층 neuron의 내적을 계산
  • 22. RNN • Recurrence – ‘layers having states’ or ‘layers having memory’ • Unfolding * (truncated) BPTT ▪ Wx = 입력 x를 출력 h로 변환 ▪ Wh = 1개 출력을 다음 시각 출력으로 변환 22 Unrolled RNN
  • 23. RNNLM • CBOW model into a LM (language model) 'You say goobye and I say hello' 23
  • 24. LSTM • Overview – State in each layers • Memory state’s attributes are updated with each training example. LSTM network and its memory the rules that govern the information stored in the state (memory) are trained neural nets themselves. 24
  • 25. LSTM을 이용한 Language Model • Sentence Generation using RNN 25
  • 27. • Stacked LSTM Each LSTM layer is a cell with its own gates and state vector 27
  • 28. Seq2Seq • Encoder + Decoder – Encoder: Input sequence → Output context vector/thought vector • (i.e. encoder RNN’s final hidden state). • if encoder is a bi-RNN it could be the concatenation of both directions’ final hidden states. – Decoder: context vector → different sequence 출력 • (e.g. the translation or reply of the input text, etc). 28
  • 29. – Problem (1) • Degraded performance in a long sentence (signals get diluted) • 앞선 signal을 유지하기 위해 다양한 노력 (예: skip-connections) • 다수의 hidden state를 하나의 context vecto에 어떻게 결합? – Simple concatenation, sum/average/ max/min, … 29
  • 30. – Problem (2) • Unrealistic assumption: “same hidden states (and inputs) will be more/less important to each output identically” – Problem (3) • Encoder는 문장을 고정길이 vector로 encoding 30
  • 31. TRANSFORMER REVOLUTION – BERT & GPT BERT GPT 31
  • 33. Attention • Overview – Early Attention – Attention as complementing RNN – Bahdanau et. Al. - 각 입력 단어와 출력 간에 연결되는 신경망을 추가 → 현재 결과에 대한 모든 단어의 기여를 가중치로 표현. – Luong – “Attention is Everything You Need” by Vaswani et. Al • Transformer에서 (Scaled-dot product) Attention만 사용 • Core concepts: Dynamic weighting (= global alignment weights) – An attention calculates dynamic (alignment) weights representing the relative importance of the inputs in the sequence (the keys) for that particular output (the query). – Multiplying the dynamic weights (the alignment scores) with the input sequence (the values) will then weight the sequence. 33
  • 34. • Calculating Attended context vector – 다양한 방법 가능 – Weighted summation of input vectors • Dot-product (MatMul) of the dynamically attended weights with the input sequence (V). 34
  • 35. • Heat map as a byproduct 35
  • 36. – Scaled Dot-Product Attention • Normalization such as Softmax to non-linearly scale the weight values between 0 and 1. Because the dot-product can produce very large magnitudes with very large vector dimensions (d) • → very small gradients when passed into Softmax, we can scale the values prior (scale = 1 / √ d). Therefore, the normalised and scaled-dot-product attention = softmax(a * scale) = softmax(Q @ K / √ d) 36
  • 37. 37
  • 39. Transformer Overview • Ideas • architecture to solve Seq2Seq tasks while handling long-range dependencies with ease. https://arxiv.org/pdf/1706.03762.pdf 39
  • 40. 40
  • 42. Creating query, key, and value matrices 42
  • 43. • Multi-head attention – Positional encoding Self-attention of the word well Self-attention of the word it 43
  • 44. – A single encoder block • Feedforward network 44
  • 45. • Encoder component의 결합 A stack of encoders with encoder 1 expanded 45
  • 46. Decoder Decoder prediction at time step t = 1 Decoder prediction at time step t = 2 Decoder prediction at time step t = 3 Decoder prediction at time step t = 4 46
  • 47. • Feedforward network • Add and norm component • Linear and softmax layers decoder block with an add & norm component Linear and softmax layers 47 Decoder block
  • 48. • Completing Decoder A stack of two decoders with decoder 1 expanded 48
  • 50. Self-Attention • Self-Attention at a High Level ”The animal didn't cross the street because it was too tired” 50
  • 51. Multi-head Attention • Ideas • runs through an attention mechanism several times in parallel. • The independent attention outputs are then concatenated and linearly transformed into the expected dimension. MHA allows for attending to parts of the sequence differently (e.g. longer-term dependencies versus shorter- term dependencies) W are all learnable parameter matrices. Source: https://paperswithcode.com/method/multi-head-attention 51
  • 54. Overview • BERT Ideas • Relating the word 'Python' to all other words • BERT generating the representation of each word in the sentence 54 https://github.com/google-research/bert
  • 55. • Characteristics • Apply bi-directional attention to all MHA sublayers • Only has encoder stack • Bigger scale than original transformer – N = 6 stacks, dmodel = 512, A=8 (no of attention head) – → dk = dmodel/A = 64 dimensions of a head) • WordPiece tokenization을 이용 • Use learned positional encoding instead of sine-cosine – (Together with supervised learning) – unsupervised embedding, pre-trained models with un-labelled text – Pre-training + Fine-Training 55 추가
  • 56. Configurations of BERT • BERT-base • BERT-large 56
  • 58. Pre-training the BERT model • Input data representation – Token Embedding – Segment Embedding 58
  • 59. – Position Embedding – Final Representation 59
  • 60. • Pre-training Strategy – Language Modeling • Auto-regressive language modeling – Is unidirectional = read the sentence in only one direction. – In forward prediction (left-to-right), model reads all the words from left to right up to the blank in order to make a prediction, as: Paris is a beautiful __. – In backward prediction (right-to-left), model reads all the words from right to left up in order to make a prediction, as: __. I love Paris • Auto-encoding language modeling – takes advantage of both forward and backward prediction. – is bidirectional – reads sentence in both directions while predicting – Paris is a beautiful __. I love Paris 'Paris is a beautiful city. I love Paris’. → Paris is a beautiful __. I love Paris 60
  • 61. – Masked Language Modeling 61
  • 63. – Next Sentence Prediction Sample dataset 63
  • 64. • Subword tokenization – Byte pair encoding Character sequence with count Creating a vocabulary with all unique characters 64
  • 65. • BERT Variants I – ALBERT • A Lite version of BERT – Cross-layer parameter sharing, Factorized embedding parameterization – RoBERTa • Robustly Optimized BERT pre-training Approach • Using dynamic masking instead of static masking, … – ELECTRA – SpanBERT • BERT Variants II (Knowledge Distillation) – DistilBERT – TinyBERT 65
  • 67. GPT Overview • GPT – 2018, OpenAI, "Improving Language Understanding by Generative Pre-Training“ 에서 Generative Pre-trained Transformer (GPT) – 2019 GPT-2 – 2020 GPT-3 – auto-regressive in nature. • Characteristics (vs. BERT) – self-attention layer에서 future token을 mask (vs. BERT는 해당 단어 를 [mask]로 변경) – masked self-attention 이용 (vs. BERT는 self-attention 사용). • Open sourcing – GPT-2는 공개 – GPT-3는 비공개 67
  • 68. GPT architecture Architecture of GPT (Radford et al., 2018) 68
  • 69. BERT vs. GPT BERT by Google GPT by OpenAI Ideas a pre-trained NLP model developed by Google in 2018 Unsupervised ML autoregressive in nature Attention direction Bidirectional in nature. Uni-directional in nature Applications: Voice assistance with enhanced customer experience Enhanced search write news, generate articles, resumes as well as codes. Size BERT-base, BERT-large 470 times bigger than BERT Requirements in Service model fine-tuning process 필요 - train the model on a separate layer on sentence encodings. few-shot learning text-in and text-out API 제공 Transformer block Encoder block Decoder block Sentence generation No direct generation Direct generation 69
  • 70. • 동향 – Meta model – Teacher-Student Architecture – BERT-as-a-Service 70 추가
  • 72. 72
  • 73. multilingual BERT • Multi-lingual/한글 BERT 문제 – multilingual BERT • 여러 언어별 text representations from wiki text of 104 different languages • 110k shared WordPiece vocabulary • understands the context from different languages without any paired or language-aligned training data • zero-shot knowledge transfer in M-BERT without any cross-language object – Cross-lingual language model (XLM) • M-BERT보다 높은 성능 – causal language modeling (CLM) – masked language modeling (MLM) – translation language modeling (TLM) 73 추가
  • 75. • Issues – Long Inference time for service preparation – Model is sometimes underfit • Research Directions – Enhancing the training performance of model (Model 수정의 문제) • XLNet • RoBERTa – dynamic masking instead of static masking in MLM task – NSP task 없이 MLM task만으로 train (large batch size) – use BBPE as a tokenizer • MT-DNN • T5 (Text-to-Text Transfer Transformer) – Scale-down Transformer (while maintining performance) • TinyBERT • DistilBERT (Knowledge Distillation) • ALBERT (A Lite BERT) • Transformer.zip (Quantization, Pruning) 75 추가
  • 76. 76