SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Big Bird: Transformers for Longer
Sequences
딥러닝 논문 읽기 모임
자연어처리팀 : 문의현, 백지윤, 조진욱, 황경진
발표자 : 백지윤
Manzil Zaheer, Guru Guruganesh, Avinava Dubey
2020 NeurIPS
Contents
• 1 Introduction Full-attention in Transformer 

• 2 BIGBIRD Architecture 

• 3 Theoretical Results about Sparse Attention Mechanism

• 4 Experiments & Results
• 5 Conclusion
Introduction
1. Introduction - Transformer
Transformer's key principle - Self-Attention
Softmax
α1
α2
α3
1. Introduction - Transformer
a1
self-attention
"layer"
Full-attention
• 단점 :
1) O(n^2) 의 시간,공간 복잡도
• 2) 1) 의 영향을 받아, Dmodel 사이즈
(;512) 에도 제한을 받음
Layer Type
Complexity
per Layer
Sequential
Operations
Maximum
Path Length
Self-
Attention
O(n^2 * d) O(1) O(1)
나 는 건물주가 되고
Dmodel
싶다
Dmodel
Q / K K1 K2 K3 K4
Q1
Q2
Q3
Q4
Quadratic !
• 1) full-attention 의 장점은 그대로 활용하면서 내
적 연산은 줄일 수 있는 방법이 없을까?
• 2) 1) 의 sparse attention 도 기존 full-
attention 을 사용했을 때의 장점 (;expressivity,
flexibility) 을 그대로 가져올 수 있을까?
Related Work
• Dmodel 사이즈(;512) 의 제한을 받아들이고, 연관된
더 작은 contexts 를 선택하여 반복하는 방식
SpanBERT, ORQA, REALM, RAG
• Quadratic 연산 자체를 줄이는 방식
- Random attention
- Window attention
- Global attention
- BIGBIRD
Reformer, Longformer
BIGBIRD Architecture
Big Bird
• activation : softmax
• N(i) : 인접한 토큰 간에만 연산을 수행
• H : # of heads
Layer Type
Complexity
per Layer
Sequential
Operations
Maximum
Path Length
Self-
Attention
O(n^2 * d) O(1) O(1)
Q / K K1 K2 K3 K4
Q1 0 0
Q2 0 0 0
Q3 0 0 0 0
Q4 0 0 0 0
나는
건물주가
되고
싶다
1 1
1
조건 1. 각 노드 (;토큰) 간 평균적 경로 길이가 짧아야 한
다
조건 2. Notion of locality
<graph sparsification problem>
1. small average path length
• P : P 의 확률로 연결 여부가 결정
Layer Type
Complexity
per Layer
Sequential
Operations
Maximum
Path Length
Self-
Attention
O(n^2 * d) O(1) O(1)
Erdos-Renyi model
근거 1. Bounded to O(log n)
“The average distances in random graphs with given expected degress”
“Distribution of shortest path lengths in subcritical Erdos-Renyi networks”
첫 번째 고유값과 두 번째 고유값과의 유의미한 거리 차 -> rapid mixing time
for random walks -> 각 노드 pair간 정보가 빨리 전달됨
근거 2. Rapid mixing time
“External eigenvalues of critical Erdos-Renyi graphs”
“Spectral radii of sparse random matrices”
2. Notion of locality
• NLP, 전산생물학 등은 ‘sequential’ 한 단어
를 다루기 때문에 이웃 토큰간의 인접성 을 지
키는 것이 매우 중요
• 이웃 토큰간 인접성을 측정하는 방법 :
clustering coefficient (;특정 노드와 이웃
한 노드들이 연결되어있을 확률)
clustering coefficient
• V : 특정 노드
• Kv : 특정 노드의 차수
• Nv : 특정 노드의 이웃들끼리 연결된 edge
개수
• cc(v) = 2 * Nv / Kv * (Kv -1)
( 2 * 0 ) / ( 4*3 ) = 0
small-world graphs (Watts and Strogatz)
https://uoguelph-engg3130.github.io/engg3130/lectures/lecture08.html
small-world graphs
https://uoguelph-engg3130.github.io/engg3130/lectures/lecture08.html
1. 각각 양 방향으로 w/2 개씩 총 w개의 이웃을 가지는 N개의 노드로 구성된 그래프 만들기
ex. 10 nodes, 4 neighbors
2. 약간의 randomness 추가
=> Big Bird
Theoretical Results about Sparse Attention
Mechanism
1. Universal Approximators
Universal Approximation Theorem
https://www.youtube.com/watch?v=vnkGn4r62Q8
: 1개의 hidden layer 을 가진 NN 을 이용해
어떠한 함수든 근사시킬 수 있다는 이론
Ex.
…
3 * ( f(x) - f(x- 0.1))
3
0.1
x
b=0
b=0.1
1
1
Output
3
-3
1. Universal Approximators
“ 어떠한 star graph 를 포함하는 sparse
attention mechanism 도 모두 universal
approximator 이 될 수 있다 ”
• Fcd : permutation equivariant 하고 범위가 무한대가 아닌 bounded 된 function space
f: [0,1]nxd -> ℝnxd (n : # of tokens, d : d-dimensional embeddings)
• TD : H ; # of heads, m ; head size, q ; hidden layer dim
• d (f,g) :
H,m,q
p
https://www.youtube.com/watch?v=sfy6qJIRyvg&t=1551s
<Are Transformers universal approximators of sequence-to-sequence functions?>
1. Universal Approximators
“ 어떠한 star graph 를 포함하는 sparse
attention mechanism 도 모두 universal
approximator 이 될 수 있다 ”
Approximate Fcd by piece-wise constant functions using Feed Forward
https://www.youtube.com/watch?v=sfy6qJIRyvg&t=1551s
<Are Transformers universal approximators of sequence-to-sequence functions?>
STEP 1.
• Fcd : permutation equivariant 하고 범위가 무한대가 아닌 bounded 된 function space
[0,1]nxd -> G = {0,δ,2δ, …,1-δ}nxd
Delta cubes
1. Universal Approximators
“ 어떠한 star graph 를 포함하는 sparse
attention mechanism 도 모두 universal
approximator 이 될 수 있다 ”
Approximate piece-wise constant functions by modified transformers using sparse attention
https://www.youtube.com/watch?v=sfy6qJIRyvg&t=1551s
<Are Transformers universal approximators of sequence-to-sequence functions?>
STEP 2.
• Contextual mappings 란 ? 이전 선행 논문에서, transformer 의 attention 이 ‘contextual mapping’ 을
하기 때문에 universal approximator 역할을 한다 라고 주장
1. Universal Approximators
“ 어떠한 star graph 를 포함하는 sparse
attention mechanism 도 모두 universal
approximator 이 될 수 있다 ”
Approximate piece-wise constant functions by modified transformers using sparse attention
https://www.youtube.com/watch?v=sfy6qJIRyvg&t=1551s
<Are Transformers universal approximators of sequence-to-sequence functions?>
STEP 2.
• 이전 선행 논문에서는 해당 내용을 transformation attention 의 permutation equivariant 한 특징을 사용하여 증명.
1. Universal Approximators
“ 어떠한 star graph 를 포함하는 sparse
attention mechanism 도 모두 universal
approximator 이 될 수 있다 ”
Approximate piece-wise constant functions by modified transformers using sparse attention
https://www.youtube.com/watch?v=sfy6qJIRyvg&t=1551s
<Are Transformers universal approximators of sequence-to-sequence functions?>
STEP 2.
• Sparse attention 은 full attention 이 아니기 때문에 동일하게 증명할 수 없음 !
- sparse shift operator : 특정 범위에 있는 entries 들을 shift 함 (directed sparse attention graphg D 가 그 정도를 결정)
- additional global token
1. Universal Approximators
“ 어떠한 star graph 를 포함하는 sparse
attention mechanism 도 모두 universal
approximator 이 될 수 있다 ”
Approximate modified transformers by original transformers using sparse attention
https://www.youtube.com/watch?v=sfy6qJIRyvg&t=1551s
<Are Transformers universal approximators of sequence-to-sequence functions?>
STEP 3.
2.Turing Completeness
• Turing Complete : 어떤 프로래밍 언어나 추상 기계가 튜링 기계와 동일한 계산 능력을 가진다
(튜링 기계 : 특수한 테이프를 기반으로 작동하는 기계)
- 조건 1 : 특정 분기가 있어야 함 (즉 “if ~ 라면” “어떻게 행동할지”)
- 조건 2. 임의의 충분한 메모리가 있어야 함
https://www.youtube.com/watch?v=RPQD7-AOjMI
=> Full attention 을 사용한 Transformer 가 turing complete 하다고 밝히는 선행 논문들이 있음. 이것을 기반으로
Modified transformer 도 turing complete 하다고 증명을 함
3. Limitations
worst case 의 경우 입력 시퀀스 길이 만큼의
layer 가 필요함
Layer Type
Complexity
per Layer
Sequential
Operations
Maximum
Path Length
Self-
Attention
O(n^2 * d) O(1) O(1)
Experiments & results
NLP - Pretraining and MLM
• BIGBIRD 의 ITC/ETC 버전을 만들고
사전 훈련을 진행 (마스킹 된 토큰의 임
의 하위 집합을 예측하는 작업이 포함)
• 사전 훈련을 위해 4개의 표준 데이터 셋
을 사용
• 배치 크기는 32-64로 설정
• 문서 최대 길이는 512 토큰에서
4096 토큰으로 증가
NLP - QA
NLP - QA
NLP - Classification
NLP - Summarization
NLP - Summarization
Genomics - pretraining and MLM
Genomics - promoter region prediction &
Chromatin-Profile prediction
Conclusion
Conclusion
• BIGBIRD 는 QA,classification 의
다수의 tasks 에서 SOTA 달성
• Genomics 결과를 통해 향후 nlp 이
외의 분야에도 활용될 여지가 보임

Contenu connexe

Tendances

Tendances (20)

[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization
 
Webinar - Building Custom Extensions With AppDynamics
Webinar - Building Custom Extensions With AppDynamicsWebinar - Building Custom Extensions With AppDynamics
Webinar - Building Custom Extensions With AppDynamics
 
Zero shot learning
Zero shot learning Zero shot learning
Zero shot learning
 
Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Learning spatiotemporal features with 3 d convolutional networks
Learning spatiotemporal features with 3 d convolutional networksLearning spatiotemporal features with 3 d convolutional networks
Learning spatiotemporal features with 3 d convolutional networks
 
Past present and future of Recommender Systems: an Industry Perspective
Past present and future of Recommender Systems: an Industry PerspectivePast present and future of Recommender Systems: an Industry Perspective
Past present and future of Recommender Systems: an Industry Perspective
 
Helm Charts Security 101
Helm Charts Security 101Helm Charts Security 101
Helm Charts Security 101
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
 
The Complete Guide to Service Mesh
The Complete Guide to Service MeshThe Complete Guide to Service Mesh
The Complete Guide to Service Mesh
 
Magento NodeJS Microservices — Yegor Shytikov | Magento Meetup Online #11
Magento NodeJS Microservices — Yegor Shytikov | Magento Meetup Online #11Magento NodeJS Microservices — Yegor Shytikov | Magento Meetup Online #11
Magento NodeJS Microservices — Yegor Shytikov | Magento Meetup Online #11
 
Reinforcement learning slides
Reinforcement learning slidesReinforcement learning slides
Reinforcement learning slides
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
Software Craftsmanship : en Pratique - AgileTour
Software Craftsmanship : en Pratique - AgileTourSoftware Craftsmanship : en Pratique - AgileTour
Software Craftsmanship : en Pratique - AgileTour
 
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with DiscussionsIntroduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
 
Cilium - Network security for microservices
Cilium - Network security for microservicesCilium - Network security for microservices
Cilium - Network security for microservices
 
Collaborative Filtering - MF, NCF, NGCF
Collaborative Filtering - MF, NCF, NGCFCollaborative Filtering - MF, NCF, NGCF
Collaborative Filtering - MF, NCF, NGCF
 
ADP: Driving Faster Customer Onboarding with MuleSoft - Michael Bevilacqua, V...
ADP: Driving Faster Customer Onboarding with MuleSoft - Michael Bevilacqua, V...ADP: Driving Faster Customer Onboarding with MuleSoft - Michael Bevilacqua, V...
ADP: Driving Faster Customer Onboarding with MuleSoft - Michael Bevilacqua, V...
 
Shap
ShapShap
Shap
 
Simple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in MahoutSimple Matrix Factorization for Recommendation in Mahout
Simple Matrix Factorization for Recommendation in Mahout
 

Similaire à Big Bird - Transformers for Longer Sequences

Deferred Shading
Deferred ShadingDeferred Shading
Deferred Shading
종빈 오
 
이산치6보고서
이산치6보고서이산치6보고서
이산치6보고서
KimChangHoen
 

Similaire à Big Bird - Transformers for Longer Sequences (20)

네이버 NLP Challenge 후기
네이버 NLP Challenge 후기네이버 NLP Challenge 후기
네이버 NLP Challenge 후기
 
Python의 계산성능 향상을 위해 Fortran, C, CUDA-C, OpenCL-C 코드들과 연동하기
Python의 계산성능 향상을 위해 Fortran, C, CUDA-C, OpenCL-C 코드들과 연동하기Python의 계산성능 향상을 위해 Fortran, C, CUDA-C, OpenCL-C 코드들과 연동하기
Python의 계산성능 향상을 위해 Fortran, C, CUDA-C, OpenCL-C 코드들과 연동하기
 
DP 알고리즘에 대해 알아보자.pdf
DP 알고리즘에 대해 알아보자.pdfDP 알고리즘에 대해 알아보자.pdf
DP 알고리즘에 대해 알아보자.pdf
 
EveryBody Tensorflow module2 GIST Jan 2018 Korean
EveryBody Tensorflow module2 GIST Jan 2018 KoreanEveryBody Tensorflow module2 GIST Jan 2018 Korean
EveryBody Tensorflow module2 GIST Jan 2018 Korean
 
Deferred Shading
Deferred ShadingDeferred Shading
Deferred Shading
 
Efficient linear skyline algorithm in two dimensional space
Efficient linear skyline algorithm in two dimensional spaceEfficient linear skyline algorithm in two dimensional space
Efficient linear skyline algorithm in two dimensional space
 
DeepWalk: Online Learning of Social Representations
DeepWalk: Online Learning of Social RepresentationsDeepWalk: Online Learning of Social Representations
DeepWalk: Online Learning of Social Representations
 
LSTM 네트워크 이해하기
LSTM 네트워크 이해하기LSTM 네트워크 이해하기
LSTM 네트워크 이해하기
 
[14.10.21] Far Cry and DX9 번역(shaderstudy)
[14.10.21] Far Cry and DX9 번역(shaderstudy)[14.10.21] Far Cry and DX9 번역(shaderstudy)
[14.10.21] Far Cry and DX9 번역(shaderstudy)
 
Graph neural network #2-2 (heterogeneous graph transformer)
Graph neural network #2-2 (heterogeneous graph transformer)Graph neural network #2-2 (heterogeneous graph transformer)
Graph neural network #2-2 (heterogeneous graph transformer)
 
Attention is all you need 설명
Attention is all you need 설명Attention is all you need 설명
Attention is all you need 설명
 
2.supervised learning(epoch#2)-3
2.supervised learning(epoch#2)-32.supervised learning(epoch#2)-3
2.supervised learning(epoch#2)-3
 
파이콘 한국 2019 튜토리얼 - LRP (Part 2)
파이콘 한국 2019 튜토리얼 - LRP (Part 2)파이콘 한국 2019 튜토리얼 - LRP (Part 2)
파이콘 한국 2019 튜토리얼 - LRP (Part 2)
 
실전프로젝트 정서경 양현찬
실전프로젝트 정서경 양현찬실전프로젝트 정서경 양현찬
실전프로젝트 정서경 양현찬
 
이산치6보고서
이산치6보고서이산치6보고서
이산치6보고서
 
[Tf2017] day4 jwkang_pub
[Tf2017] day4 jwkang_pub[Tf2017] day4 jwkang_pub
[Tf2017] day4 jwkang_pub
 
Attention is all you need
Attention is all you needAttention is all you need
Attention is all you need
 
Deep learning overview
Deep learning overviewDeep learning overview
Deep learning overview
 
Lecture 4: Neural Networks I
Lecture 4: Neural Networks ILecture 4: Neural Networks I
Lecture 4: Neural Networks I
 
PowerVR Low Level GLSL Optimisation
PowerVR Low Level GLSL Optimisation PowerVR Low Level GLSL Optimisation
PowerVR Low Level GLSL Optimisation
 

Plus de taeseon ryu

VoxelNet
VoxelNetVoxelNet
VoxelNet
taeseon ryu
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
taeseon ryu
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
taeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
taeseon ryu
 

Plus de taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 

Big Bird - Transformers for Longer Sequences

  • 1. Big Bird: Transformers for Longer Sequences 딥러닝 논문 읽기 모임 자연어처리팀 : 문의현, 백지윤, 조진욱, 황경진 발표자 : 백지윤 Manzil Zaheer, Guru Guruganesh, Avinava Dubey 2020 NeurIPS
  • 2. Contents • 1 Introduction Full-attention in Transformer • 2 BIGBIRD Architecture • 3 Theoretical Results about Sparse Attention Mechanism • 4 Experiments & Results • 5 Conclusion
  • 4. 1. Introduction - Transformer Transformer's key principle - Self-Attention Softmax α1 α2 α3
  • 5. 1. Introduction - Transformer a1 self-attention "layer"
  • 6. Full-attention • 단점 : 1) O(n^2) 의 시간,공간 복잡도 • 2) 1) 의 영향을 받아, Dmodel 사이즈 (;512) 에도 제한을 받음 Layer Type Complexity per Layer Sequential Operations Maximum Path Length Self- Attention O(n^2 * d) O(1) O(1) 나 는 건물주가 되고 Dmodel 싶다 Dmodel Q / K K1 K2 K3 K4 Q1 Q2 Q3 Q4 Quadratic ! • 1) full-attention 의 장점은 그대로 활용하면서 내 적 연산은 줄일 수 있는 방법이 없을까? • 2) 1) 의 sparse attention 도 기존 full- attention 을 사용했을 때의 장점 (;expressivity, flexibility) 을 그대로 가져올 수 있을까?
  • 7. Related Work • Dmodel 사이즈(;512) 의 제한을 받아들이고, 연관된 더 작은 contexts 를 선택하여 반복하는 방식 SpanBERT, ORQA, REALM, RAG • Quadratic 연산 자체를 줄이는 방식 - Random attention - Window attention - Global attention - BIGBIRD Reformer, Longformer
  • 9. Big Bird • activation : softmax • N(i) : 인접한 토큰 간에만 연산을 수행 • H : # of heads Layer Type Complexity per Layer Sequential Operations Maximum Path Length Self- Attention O(n^2 * d) O(1) O(1) Q / K K1 K2 K3 K4 Q1 0 0 Q2 0 0 0 Q3 0 0 0 0 Q4 0 0 0 0 나는 건물주가 되고 싶다 1 1 1 조건 1. 각 노드 (;토큰) 간 평균적 경로 길이가 짧아야 한 다 조건 2. Notion of locality <graph sparsification problem>
  • 10. 1. small average path length • P : P 의 확률로 연결 여부가 결정 Layer Type Complexity per Layer Sequential Operations Maximum Path Length Self- Attention O(n^2 * d) O(1) O(1) Erdos-Renyi model 근거 1. Bounded to O(log n) “The average distances in random graphs with given expected degress” “Distribution of shortest path lengths in subcritical Erdos-Renyi networks” 첫 번째 고유값과 두 번째 고유값과의 유의미한 거리 차 -> rapid mixing time for random walks -> 각 노드 pair간 정보가 빨리 전달됨 근거 2. Rapid mixing time “External eigenvalues of critical Erdos-Renyi graphs” “Spectral radii of sparse random matrices”
  • 11. 2. Notion of locality • NLP, 전산생물학 등은 ‘sequential’ 한 단어 를 다루기 때문에 이웃 토큰간의 인접성 을 지 키는 것이 매우 중요 • 이웃 토큰간 인접성을 측정하는 방법 : clustering coefficient (;특정 노드와 이웃 한 노드들이 연결되어있을 확률) clustering coefficient • V : 특정 노드 • Kv : 특정 노드의 차수 • Nv : 특정 노드의 이웃들끼리 연결된 edge 개수 • cc(v) = 2 * Nv / Kv * (Kv -1) ( 2 * 0 ) / ( 4*3 ) = 0 small-world graphs (Watts and Strogatz) https://uoguelph-engg3130.github.io/engg3130/lectures/lecture08.html
  • 12. small-world graphs https://uoguelph-engg3130.github.io/engg3130/lectures/lecture08.html 1. 각각 양 방향으로 w/2 개씩 총 w개의 이웃을 가지는 N개의 노드로 구성된 그래프 만들기 ex. 10 nodes, 4 neighbors 2. 약간의 randomness 추가
  • 14. Theoretical Results about Sparse Attention Mechanism
  • 15. 1. Universal Approximators Universal Approximation Theorem https://www.youtube.com/watch?v=vnkGn4r62Q8 : 1개의 hidden layer 을 가진 NN 을 이용해 어떠한 함수든 근사시킬 수 있다는 이론 Ex. … 3 * ( f(x) - f(x- 0.1)) 3 0.1 x b=0 b=0.1 1 1 Output 3 -3
  • 16. 1. Universal Approximators “ 어떠한 star graph 를 포함하는 sparse attention mechanism 도 모두 universal approximator 이 될 수 있다 ” • Fcd : permutation equivariant 하고 범위가 무한대가 아닌 bounded 된 function space f: [0,1]nxd -> ℝnxd (n : # of tokens, d : d-dimensional embeddings) • TD : H ; # of heads, m ; head size, q ; hidden layer dim • d (f,g) : H,m,q p https://www.youtube.com/watch?v=sfy6qJIRyvg&t=1551s <Are Transformers universal approximators of sequence-to-sequence functions?>
  • 17. 1. Universal Approximators “ 어떠한 star graph 를 포함하는 sparse attention mechanism 도 모두 universal approximator 이 될 수 있다 ” Approximate Fcd by piece-wise constant functions using Feed Forward https://www.youtube.com/watch?v=sfy6qJIRyvg&t=1551s <Are Transformers universal approximators of sequence-to-sequence functions?> STEP 1. • Fcd : permutation equivariant 하고 범위가 무한대가 아닌 bounded 된 function space [0,1]nxd -> G = {0,δ,2δ, …,1-δ}nxd Delta cubes
  • 18. 1. Universal Approximators “ 어떠한 star graph 를 포함하는 sparse attention mechanism 도 모두 universal approximator 이 될 수 있다 ” Approximate piece-wise constant functions by modified transformers using sparse attention https://www.youtube.com/watch?v=sfy6qJIRyvg&t=1551s <Are Transformers universal approximators of sequence-to-sequence functions?> STEP 2. • Contextual mappings 란 ? 이전 선행 논문에서, transformer 의 attention 이 ‘contextual mapping’ 을 하기 때문에 universal approximator 역할을 한다 라고 주장
  • 19. 1. Universal Approximators “ 어떠한 star graph 를 포함하는 sparse attention mechanism 도 모두 universal approximator 이 될 수 있다 ” Approximate piece-wise constant functions by modified transformers using sparse attention https://www.youtube.com/watch?v=sfy6qJIRyvg&t=1551s <Are Transformers universal approximators of sequence-to-sequence functions?> STEP 2. • 이전 선행 논문에서는 해당 내용을 transformation attention 의 permutation equivariant 한 특징을 사용하여 증명.
  • 20. 1. Universal Approximators “ 어떠한 star graph 를 포함하는 sparse attention mechanism 도 모두 universal approximator 이 될 수 있다 ” Approximate piece-wise constant functions by modified transformers using sparse attention https://www.youtube.com/watch?v=sfy6qJIRyvg&t=1551s <Are Transformers universal approximators of sequence-to-sequence functions?> STEP 2. • Sparse attention 은 full attention 이 아니기 때문에 동일하게 증명할 수 없음 ! - sparse shift operator : 특정 범위에 있는 entries 들을 shift 함 (directed sparse attention graphg D 가 그 정도를 결정) - additional global token
  • 21. 1. Universal Approximators “ 어떠한 star graph 를 포함하는 sparse attention mechanism 도 모두 universal approximator 이 될 수 있다 ” Approximate modified transformers by original transformers using sparse attention https://www.youtube.com/watch?v=sfy6qJIRyvg&t=1551s <Are Transformers universal approximators of sequence-to-sequence functions?> STEP 3.
  • 22. 2.Turing Completeness • Turing Complete : 어떤 프로래밍 언어나 추상 기계가 튜링 기계와 동일한 계산 능력을 가진다 (튜링 기계 : 특수한 테이프를 기반으로 작동하는 기계) - 조건 1 : 특정 분기가 있어야 함 (즉 “if ~ 라면” “어떻게 행동할지”) - 조건 2. 임의의 충분한 메모리가 있어야 함 https://www.youtube.com/watch?v=RPQD7-AOjMI => Full attention 을 사용한 Transformer 가 turing complete 하다고 밝히는 선행 논문들이 있음. 이것을 기반으로 Modified transformer 도 turing complete 하다고 증명을 함
  • 23. 3. Limitations worst case 의 경우 입력 시퀀스 길이 만큼의 layer 가 필요함 Layer Type Complexity per Layer Sequential Operations Maximum Path Length Self- Attention O(n^2 * d) O(1) O(1)
  • 25. NLP - Pretraining and MLM • BIGBIRD 의 ITC/ETC 버전을 만들고 사전 훈련을 진행 (마스킹 된 토큰의 임 의 하위 집합을 예측하는 작업이 포함) • 사전 훈련을 위해 4개의 표준 데이터 셋 을 사용 • 배치 크기는 32-64로 설정 • 문서 최대 길이는 512 토큰에서 4096 토큰으로 증가
  • 32. Genomics - promoter region prediction & Chromatin-Profile prediction
  • 34. Conclusion • BIGBIRD 는 QA,classification 의 다수의 tasks 에서 SOTA 달성 • Genomics 결과를 통해 향후 nlp 이 외의 분야에도 활용될 여지가 보임