SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
WHAT DO VISION TRANSFORMERS LEARN?
A VISUAL EXPLORATION
2023. 2. 5
김준철, 이해원, 류채은, 조경진, 이찬혁
Amin Ghiasi Hamid Kazemi Eitan Borgnia Steven Reich Manli Shu1 Micah Goldblum Andrew Gordon Wilson Tom Goldstein
Preprint 2022
2
• Background
• Introduction
• ViT Feature Visualization
• Last-Layer Token Mixing
• Comparison of ViTs and CNNs
• ViTs with Language Model Supervision
• Discussion
Contents
3
Background ViT
• Feed forward network
• CLS Token
patch
patch
patch
patch
patch
patch
patch
patch
CLS
- 이미지를 4개 패치로 나누는 경우
CLS 토큰을 concat 후 Attention 연산
- 최종적으로 CLS 토큰만 분류에 쓰인다.
Attention
• ViT Architecture
( b, n, d ) ( b, n+1, d )
4
Background CLIP
• Contrastive learning
N 개의 배치데이터 text-image pairs 를 바탕으로 Positive pair는 가까워지도록 negative pair 는 멀어지도록
CE Loss 를 활용하여 Pre-train 진행 (파란색 블록 : positive pair, 흰색 블록 : n-1 개의 negative pair)
• Zero-shot learning
평가하고자 하는 클래스 개수만큼의 A photo of [object] 를 넣어 class 를 예측한다.
5
Background CNN Feature Visualization
• From noise scratch to goal
• Optimization Objectives
Gradient descent를 활용하여
입력 노이즈 변환
다양한 목표 중 원하는 방향을 정할 수
있으며 목표값이 최대화 되도록
입력 이미지 Gradient descent
출처 : https://distill.pub/2017/feature-visualization/
Background CNN Feature Visualization
• The Enemy of Feature Visualization
- CNN responds strongly to nonsensical high-frequency patterns
• Regularization – Frequency Regularization
- Regularization total variation or Augmentation like blur or Bilateral Filter 출처 : https://distill.pub/2017/feature-visualization/
7
Introduction
1. Applying our tools to the relatively high-dimensional features of the position-wise feed forward
layer results in successful and informative visualizations.
(기존의 방법론을 low-dimensional feature 인 key, query, value에 적용했을 때 해석 불가능한 문제가 있었기 때문에 새로운 방
법론 제시)
2. Authors show that patch-wise image activation patterns for ViT features essentially behave like s
aliency maps, highlighting the regions of the image a given feature attends to.
(각 패치는 global information 을 저장하는 방식이 아니라 패치간의 위치관계를 가지고 있음을 확인)
3. Authors compare the behavior of ViTs and CNNs, finding that ViTs make better use of backgrou
nd information and rely less on high-frequency, textural attributes.
(ViTs 와 CNNs 의 공통점과 차이점에 대한 분석)
4. Authors investigate the effect of natural language supervision with CLIP on the types of feature
s extracted by ViTs.
(CLIP, ViT 를 활용한 natural language supervision의 역할 분석)
• Definition problem
트랜스포머 기반의 모델이 다양한 Task에 많이 쓰이고 있음에도 불구하고 inductive biases 혹은
transformer 가 배우는 경향성에 대해서 거의 모르고 있음
8
ViT Feature Visualization
Keywords
- From random noise (gradient steps)
- Penalize total variation
- Image Augmentation (Jitter, ColorShift, Gaussian smoothing)
9
ViT Feature Visualization
• Feature Vectore Definition
4 개의 패치가 4개의
벡터로 이루어져 있을 때
각 패치별 i 번째 벡터들을
concat 하여 구성
패치 개수만큼의
feature vector 구성 가능
• Optimization object
특정 layer의 피쳐벡터의 크기가
가장 커지도록 최적화 방향 설정
10
ViT Feature Visualization
• Variation regularization
x : input, a : augmentation, TV : total variation, λ : regularization parameter, k : number of samples
입력 이미지로 만들어진 Augmentation sample 에 대해
Feature vector 와 Total Variation 의 크기가 가장 커지도록 최적화 하여
특정 layer에 가장 활성화된 입력 이미지를 찾아가게 된다.
• Best Augmentation combination
GS : Gaussian Smoothing, CS : Color Shift
Jitter : random Brightness, contast, saturation
11
ViT Feature Visualization
• Multi-headed attention layer visualization • ViT feed Forward layer
768 -> 3072
3072 -> 768
시각화 대상
Low-dimension key, query, value 에서 피쳐가 해석되지 않아 FC1 이후의 의 3072 벡터사이즈에서 시각화
12
ViT Feature Visualization
• Examples
13
Q & A
14
Last-Layer Token Mixing
Last-Layer Token Mixing
• Feature activation map
- ViT 는 Saliency map 처럼 각 물체의 전반적인 특징을 바탕으로 정보를 전달하는 것을 유추 할 수 있다.
* Saliency map : 픽셀의 변화가 큰 지점을 시각화 하여 관심있는 물체와 관심이 없는 부위를 나누는 방법
각 패치는 마지막 전 레이어까지 CNN 과 유사하게 spatial information을 유지하도록 학습하며
마지막 layer가 담당하는 Globalization 에 대해 알아 볼 필요가 있다.
Last-Layer Token Mixing
• Globalization
- 위 그림은 쇼핑카트에 마지막 레이어에서 대해 가장 활성도가 높은 조합을 나타냄
- 오른쪽의 Activation pattern 이 이미지 내에서 전체적으로 Uniform하게 활성화 됨
- ViT 는 Classification 결과를 낼 때 CLS 토큰만을 활용하는데 위 과정을 통해
CLS 토큰이 모든 토큰에 globalization을 진행하는 것을 알 수 있음
Left : Our method
Middle : Most activated image
Right : activation map
Last-layer 의 Globalization 에 대한 두 가지 세부 실험 진행
Last-Layer Token Mixing
• Isolating CLS
1~11 layer에서 CLS 토큰을 제외하고 Attention 연산을 진행하고 12 마지막 레이어에서만 attention을
적용 한 후 CLS 토큰만을 활용하여 분류를 했을 때도 성능 하락이 크지 않음
-> CLS 토큰은 마지막 레이어에서 모든 토큰과의 globalization이 진행됨을 알 수 있음
• Experiment 1
The CLS token plays a relatively minor role throughout the network and is not used for
globalization until the last layer
Last-Layer Token Mixing
• Patch Average & Patch Maximum
Fine-Tuning 없이 Patch Average 혹은 Patch Maximum을 적용했을 때 성능 하락이 크지 않음을 확인
할 수 있음
대부분의 패치가 대략적으로 동일한 정보를 담고 있음을 알 수 있음
• Experiment 2
The last-layer globalization is not exclusive to the CLS token, but actually occurs across every patch
19
Comparison of ViTs and CNNs
Comparison of ViTs and CNNs
• Similarity
CNNs 는 early layer에서 color, edge, texture 를 인식하고 deeper layer로 갈수록 물체 복잡한 구조를
인식하는데 이는 ViTs 도 유사하다.
위 이미지는 CNNs 아래 이미지는 ViTs 를 나타냄
Comparison of ViTs and CNNs
• Difference
• Foreground & Background masking Test
CNNs 과 달리 ViTs 는 Foreground 만 남기거나
Background만 남겼을 때 성능이 뛰어난 것을 확인
할 수 있었다.
Comparison of ViTs and CNNs
• Difference
• Low-pass filtering Test
CNNs 과 달리 ViTs 는 High-Frequency
정보가 사라지더라도 성능을 비교적
잘 유지하는 것을 보였다.
• Low-pass filtering
23
ViTs with Language Model Supervision
ViTs with Language Model Supervision
• Visualization Language Model Supervision
ViT 모델은 특정 object의 특징에 대해 배우는 반면에 Language Model Supervision 기반으로 배운
CLIP 모델의 경우에는 명사 뿐만 아니라 전치사 형용사 부사의 의미도 시각화 하는 것을 확인 할 수 있다.
25
Discussion
1. Authors introduce a framework for optimization-based feature visualization.
We then identify which components of a ViT are most amenable to producing interpretable images, finding
that the high-dimensional inner projection of the feed-forward layer is suitable while the key, query, and valu
e features of self-attention are not.
2. Authors observe that ViTs preserve spatial information of the patches even for individual channels across
all layers with the exception of the last layer, indicating that the networks learn spatial relationships from scratch.
We further show that the sudden disappearance of localization information in the last attention layer results
from a learned token mixing behavior that resembles average pooling
3. Authors find that ViTs make better use of background information and are able to make vastly superior
predictions relative to CNNs when exposed only to image backgrounds despite the seemingly counter-intuitive
property that ViTs are not as sensitive as CNNs to the loss of high-frequency information, which one might
expect to be critical for making effective use of background
4. Authors show that ViTs trained with language model supervision learn more semantic and conceptual
features, rather than object-specific visual features as is typical of classifiers
26
Q & A
27
Reference
https://junstar92.tistory.com/153 : Saliency map
https://distill.pub/2017/feature-visualization/ : Feature visualization

Contenu connexe

Tendances

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang
 
Architecting Security and Governance Across Multi Accounts
Architecting Security and Governance Across Multi AccountsArchitecting Security and Governance Across Multi Accounts
Architecting Security and Governance Across Multi AccountsAmazon Web Services
 
Recommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking SystemRecommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking Systemivaderivader
 
Event Streaming in the Telco Industry with Apache Kafka® and Confluent
Event Streaming in the Telco Industry with Apache Kafka® and ConfluentEvent Streaming in the Telco Industry with Apache Kafka® and Confluent
Event Streaming in the Telco Industry with Apache Kafka® and Confluentconfluent
 
Efficiency and Cost Optimization: Rightsizing Cloud Resources with Datadog
Efficiency and Cost Optimization: Rightsizing Cloud Resources with DatadogEfficiency and Cost Optimization: Rightsizing Cloud Resources with Datadog
Efficiency and Cost Optimization: Rightsizing Cloud Resources with DatadogCloudability
 
Accelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowAccelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowDatabricks
 
Deep dive ECS & Fargate Deep Dive
Deep dive ECS & Fargate Deep DiveDeep dive ECS & Fargate Deep Dive
Deep dive ECS & Fargate Deep DiveAmazon Web Services
 
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxGrafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxRomanKhavronenko
 
Architecting for High Availability
Architecting for High AvailabilityArchitecting for High Availability
Architecting for High AvailabilityAmazon Web Services
 
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and PrometheusRobust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and PrometheusManasi Vartak
 
Kyma: Extending Business systems with Kubernetes, Istio and <fill the blank>.
Kyma: Extending Business systems with Kubernetes, Istio and <fill the blank>.Kyma: Extending Business systems with Kubernetes, Istio and <fill the blank>.
Kyma: Extending Business systems with Kubernetes, Istio and <fill the blank>.SAP HANA Cloud Platform
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at SpotifyErik Bernhardsson
 
Introduction to Time Series Analytics with Microsoft Azure
Introduction to Time Series Analytics with Microsoft AzureIntroduction to Time Series Analytics with Microsoft Azure
Introduction to Time Series Analytics with Microsoft AzureCodit
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsAdrian Cockcroft
 
Best Practices in Planning a Large-Scale Migration to AWS - AWS Online Tech T...
Best Practices in Planning a Large-Scale Migration to AWS - AWS Online Tech T...Best Practices in Planning a Large-Scale Migration to AWS - AWS Online Tech T...
Best Practices in Planning a Large-Scale Migration to AWS - AWS Online Tech T...Amazon Web Services
 
“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...
“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...
“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...Edge AI and Vision Alliance
 
How HSBC Uses Serverless to Process Millions of Transactions in Real Time (FS...
How HSBC Uses Serverless to Process Millions of Transactions in Real Time (FS...How HSBC Uses Serverless to Process Millions of Transactions in Real Time (FS...
How HSBC Uses Serverless to Process Millions of Transactions in Real Time (FS...Amazon Web Services
 
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes EverythingThe Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes EverythingKai Wähner
 

Tendances (20)

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
 
Architecting Security and Governance Across Multi Accounts
Architecting Security and Governance Across Multi AccountsArchitecting Security and Governance Across Multi Accounts
Architecting Security and Governance Across Multi Accounts
 
Recommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking SystemRecommending What Video to Watch Next: A Multitask Ranking System
Recommending What Video to Watch Next: A Multitask Ranking System
 
Event Streaming in the Telco Industry with Apache Kafka® and Confluent
Event Streaming in the Telco Industry with Apache Kafka® and ConfluentEvent Streaming in the Telco Industry with Apache Kafka® and Confluent
Event Streaming in the Telco Industry with Apache Kafka® and Confluent
 
Efficiency and Cost Optimization: Rightsizing Cloud Resources with Datadog
Efficiency and Cost Optimization: Rightsizing Cloud Resources with DatadogEfficiency and Cost Optimization: Rightsizing Cloud Resources with Datadog
Efficiency and Cost Optimization: Rightsizing Cloud Resources with Datadog
 
Deep Dive Amazon EC2
Deep Dive Amazon EC2Deep Dive Amazon EC2
Deep Dive Amazon EC2
 
Accelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflowAccelerate Your ML Pipeline with AutoML and MLflow
Accelerate Your ML Pipeline with AutoML and MLflow
 
Deep dive ECS & Fargate Deep Dive
Deep dive ECS & Fargate Deep DiveDeep dive ECS & Fargate Deep Dive
Deep dive ECS & Fargate Deep Dive
 
Intro to SageMaker
Intro to SageMakerIntro to SageMaker
Intro to SageMaker
 
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxGrafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
 
Architecting for High Availability
Architecting for High AvailabilityArchitecting for High Availability
Architecting for High Availability
 
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and PrometheusRobust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
 
Kyma: Extending Business systems with Kubernetes, Istio and <fill the blank>.
Kyma: Extending Business systems with Kubernetes, Istio and <fill the blank>.Kyma: Extending Business systems with Kubernetes, Istio and <fill the blank>.
Kyma: Extending Business systems with Kubernetes, Istio and <fill the blank>.
 
Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at Spotify
 
Introduction to Time Series Analytics with Microsoft Azure
Introduction to Time Series Analytics with Microsoft AzureIntroduction to Time Series Analytics with Microsoft Azure
Introduction to Time Series Analytics with Microsoft Azure
 
Netflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and OpsNetflix on Cloud - combined slides for Dev and Ops
Netflix on Cloud - combined slides for Dev and Ops
 
Best Practices in Planning a Large-Scale Migration to AWS - AWS Online Tech T...
Best Practices in Planning a Large-Scale Migration to AWS - AWS Online Tech T...Best Practices in Planning a Large-Scale Migration to AWS - AWS Online Tech T...
Best Practices in Planning a Large-Scale Migration to AWS - AWS Online Tech T...
 
“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...
“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...
“MLOps: Managing Data and Workflows for Efficient Model Development and Deplo...
 
How HSBC Uses Serverless to Process Millions of Transactions in Real Time (FS...
How HSBC Uses Serverless to Process Millions of Transactions in Real Time (FS...How HSBC Uses Serverless to Process Millions of Transactions in Real Time (FS...
How HSBC Uses Serverless to Process Millions of Transactions in Real Time (FS...
 
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes EverythingThe Rise Of Event Streaming – Why Apache Kafka Changes Everything
The Rise Of Event Streaming – Why Apache Kafka Changes Everything
 

Similaire à WHAT DO VISION TRANSFORMERS LEARN A VISUAL EXPLORATION.pdf

네이버 NLP Challenge 후기
네이버 NLP Challenge 후기네이버 NLP Challenge 후기
네이버 NLP Challenge 후기Jangwon Park
 
Convolutional Neural Networks(CNN) / Stanford cs231n 2017 lecture 5 / MLAI@UO...
Convolutional Neural Networks(CNN) / Stanford cs231n 2017 lecture 5 / MLAI@UO...Convolutional Neural Networks(CNN) / Stanford cs231n 2017 lecture 5 / MLAI@UO...
Convolutional Neural Networks(CNN) / Stanford cs231n 2017 lecture 5 / MLAI@UO...changedaeoh
 
[한국어] Neural Architecture Search with Reinforcement Learning
[한국어] Neural Architecture Search with Reinforcement Learning[한국어] Neural Architecture Search with Reinforcement Learning
[한국어] Neural Architecture Search with Reinforcement LearningKiho Suh
 
180624 mobile visionnet_baeksucon_jwkang_pub
180624 mobile visionnet_baeksucon_jwkang_pub180624 mobile visionnet_baeksucon_jwkang_pub
180624 mobile visionnet_baeksucon_jwkang_pubJaewook. Kang
 
History of Vision AI
History of Vision AIHistory of Vision AI
History of Vision AITae Young Lee
 
1. klaytn intro
1. klaytn intro1. klaytn intro
1. klaytn intro전 민규
 
[Paper Review] Visualizing and understanding convolutional networks
[Paper Review] Visualizing and understanding convolutional networks[Paper Review] Visualizing and understanding convolutional networks
[Paper Review] Visualizing and understanding convolutional networksKorea, Sejong University.
 
Intriguing properties of contrastive losses
Intriguing properties of contrastive lossesIntriguing properties of contrastive losses
Intriguing properties of contrastive lossestaeseon ryu
 
네트워크 경량화 이모저모 @ 2020 DLD
네트워크 경량화 이모저모 @ 2020 DLD네트워크 경량화 이모저모 @ 2020 DLD
네트워크 경량화 이모저모 @ 2020 DLDKim Junghoon
 
딥러닝 논문읽기 efficient netv2 논문리뷰
딥러닝 논문읽기 efficient netv2  논문리뷰딥러닝 논문읽기 efficient netv2  논문리뷰
딥러닝 논문읽기 efficient netv2 논문리뷰taeseon ryu
 
Image Deep Learning 실무적용
Image Deep Learning 실무적용Image Deep Learning 실무적용
Image Deep Learning 실무적용Youngjae Kim
 
A Beginner's guide to understanding Autoencoder
A Beginner's guide to understanding AutoencoderA Beginner's guide to understanding Autoencoder
A Beginner's guide to understanding AutoencoderLee Seungeun
 
CNN Architecture A to Z
CNN Architecture A to ZCNN Architecture A to Z
CNN Architecture A to ZLEE HOSEONG
 
Convolutional neural networks
Convolutional neural networksConvolutional neural networks
Convolutional neural networksHyunjinBae3
 
Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection창기 문
 
Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection창기 문
 
180525 mobile visionnet_hanlim_extended
180525 mobile visionnet_hanlim_extended180525 mobile visionnet_hanlim_extended
180525 mobile visionnet_hanlim_extendedJaewook. Kang
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksSanghoon Yoon
 
Deep Learning Into Advance - 1. Image, ConvNet
Deep Learning Into Advance - 1. Image, ConvNetDeep Learning Into Advance - 1. Image, ConvNet
Deep Learning Into Advance - 1. Image, ConvNetHyojun Kim
 

Similaire à WHAT DO VISION TRANSFORMERS LEARN A VISUAL EXPLORATION.pdf (20)

Review MLP Mixer
Review MLP MixerReview MLP Mixer
Review MLP Mixer
 
네이버 NLP Challenge 후기
네이버 NLP Challenge 후기네이버 NLP Challenge 후기
네이버 NLP Challenge 후기
 
Convolutional Neural Networks(CNN) / Stanford cs231n 2017 lecture 5 / MLAI@UO...
Convolutional Neural Networks(CNN) / Stanford cs231n 2017 lecture 5 / MLAI@UO...Convolutional Neural Networks(CNN) / Stanford cs231n 2017 lecture 5 / MLAI@UO...
Convolutional Neural Networks(CNN) / Stanford cs231n 2017 lecture 5 / MLAI@UO...
 
[한국어] Neural Architecture Search with Reinforcement Learning
[한국어] Neural Architecture Search with Reinforcement Learning[한국어] Neural Architecture Search with Reinforcement Learning
[한국어] Neural Architecture Search with Reinforcement Learning
 
180624 mobile visionnet_baeksucon_jwkang_pub
180624 mobile visionnet_baeksucon_jwkang_pub180624 mobile visionnet_baeksucon_jwkang_pub
180624 mobile visionnet_baeksucon_jwkang_pub
 
History of Vision AI
History of Vision AIHistory of Vision AI
History of Vision AI
 
1. klaytn intro
1. klaytn intro1. klaytn intro
1. klaytn intro
 
[Paper Review] Visualizing and understanding convolutional networks
[Paper Review] Visualizing and understanding convolutional networks[Paper Review] Visualizing and understanding convolutional networks
[Paper Review] Visualizing and understanding convolutional networks
 
Intriguing properties of contrastive losses
Intriguing properties of contrastive lossesIntriguing properties of contrastive losses
Intriguing properties of contrastive losses
 
네트워크 경량화 이모저모 @ 2020 DLD
네트워크 경량화 이모저모 @ 2020 DLD네트워크 경량화 이모저모 @ 2020 DLD
네트워크 경량화 이모저모 @ 2020 DLD
 
딥러닝 논문읽기 efficient netv2 논문리뷰
딥러닝 논문읽기 efficient netv2  논문리뷰딥러닝 논문읽기 efficient netv2  논문리뷰
딥러닝 논문읽기 efficient netv2 논문리뷰
 
Image Deep Learning 실무적용
Image Deep Learning 실무적용Image Deep Learning 실무적용
Image Deep Learning 실무적용
 
A Beginner's guide to understanding Autoencoder
A Beginner's guide to understanding AutoencoderA Beginner's guide to understanding Autoencoder
A Beginner's guide to understanding Autoencoder
 
CNN Architecture A to Z
CNN Architecture A to ZCNN Architecture A to Z
CNN Architecture A to Z
 
Convolutional neural networks
Convolutional neural networksConvolutional neural networks
Convolutional neural networks
 
Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection
 
Summary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detectionSummary in recent advances in deep learning for object detection
Summary in recent advances in deep learning for object detection
 
180525 mobile visionnet_hanlim_extended
180525 mobile visionnet_hanlim_extended180525 mobile visionnet_hanlim_extended
180525 mobile visionnet_hanlim_extended
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Deep Learning Into Advance - 1. Image, ConvNet
Deep Learning Into Advance - 1. Image, ConvNetDeep Learning Into Advance - 1. Image, ConvNet
Deep Learning Into Advance - 1. Image, ConvNet
 

Plus de taeseon ryu

OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...taeseon ryu
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splattingtaeseon ryu
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptxtaeseon ryu
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정taeseon ryu
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithmtaeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu
 

Plus de taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 

WHAT DO VISION TRANSFORMERS LEARN A VISUAL EXPLORATION.pdf

  • 1. WHAT DO VISION TRANSFORMERS LEARN? A VISUAL EXPLORATION 2023. 2. 5 김준철, 이해원, 류채은, 조경진, 이찬혁 Amin Ghiasi Hamid Kazemi Eitan Borgnia Steven Reich Manli Shu1 Micah Goldblum Andrew Gordon Wilson Tom Goldstein Preprint 2022
  • 2. 2 • Background • Introduction • ViT Feature Visualization • Last-Layer Token Mixing • Comparison of ViTs and CNNs • ViTs with Language Model Supervision • Discussion Contents
  • 3. 3 Background ViT • Feed forward network • CLS Token patch patch patch patch patch patch patch patch CLS - 이미지를 4개 패치로 나누는 경우 CLS 토큰을 concat 후 Attention 연산 - 최종적으로 CLS 토큰만 분류에 쓰인다. Attention • ViT Architecture ( b, n, d ) ( b, n+1, d )
  • 4. 4 Background CLIP • Contrastive learning N 개의 배치데이터 text-image pairs 를 바탕으로 Positive pair는 가까워지도록 negative pair 는 멀어지도록 CE Loss 를 활용하여 Pre-train 진행 (파란색 블록 : positive pair, 흰색 블록 : n-1 개의 negative pair) • Zero-shot learning 평가하고자 하는 클래스 개수만큼의 A photo of [object] 를 넣어 class 를 예측한다.
  • 5. 5 Background CNN Feature Visualization • From noise scratch to goal • Optimization Objectives Gradient descent를 활용하여 입력 노이즈 변환 다양한 목표 중 원하는 방향을 정할 수 있으며 목표값이 최대화 되도록 입력 이미지 Gradient descent 출처 : https://distill.pub/2017/feature-visualization/
  • 6. Background CNN Feature Visualization • The Enemy of Feature Visualization - CNN responds strongly to nonsensical high-frequency patterns • Regularization – Frequency Regularization - Regularization total variation or Augmentation like blur or Bilateral Filter 출처 : https://distill.pub/2017/feature-visualization/
  • 7. 7 Introduction 1. Applying our tools to the relatively high-dimensional features of the position-wise feed forward layer results in successful and informative visualizations. (기존의 방법론을 low-dimensional feature 인 key, query, value에 적용했을 때 해석 불가능한 문제가 있었기 때문에 새로운 방 법론 제시) 2. Authors show that patch-wise image activation patterns for ViT features essentially behave like s aliency maps, highlighting the regions of the image a given feature attends to. (각 패치는 global information 을 저장하는 방식이 아니라 패치간의 위치관계를 가지고 있음을 확인) 3. Authors compare the behavior of ViTs and CNNs, finding that ViTs make better use of backgrou nd information and rely less on high-frequency, textural attributes. (ViTs 와 CNNs 의 공통점과 차이점에 대한 분석) 4. Authors investigate the effect of natural language supervision with CLIP on the types of feature s extracted by ViTs. (CLIP, ViT 를 활용한 natural language supervision의 역할 분석) • Definition problem 트랜스포머 기반의 모델이 다양한 Task에 많이 쓰이고 있음에도 불구하고 inductive biases 혹은 transformer 가 배우는 경향성에 대해서 거의 모르고 있음
  • 8. 8 ViT Feature Visualization Keywords - From random noise (gradient steps) - Penalize total variation - Image Augmentation (Jitter, ColorShift, Gaussian smoothing)
  • 9. 9 ViT Feature Visualization • Feature Vectore Definition 4 개의 패치가 4개의 벡터로 이루어져 있을 때 각 패치별 i 번째 벡터들을 concat 하여 구성 패치 개수만큼의 feature vector 구성 가능 • Optimization object 특정 layer의 피쳐벡터의 크기가 가장 커지도록 최적화 방향 설정
  • 10. 10 ViT Feature Visualization • Variation regularization x : input, a : augmentation, TV : total variation, λ : regularization parameter, k : number of samples 입력 이미지로 만들어진 Augmentation sample 에 대해 Feature vector 와 Total Variation 의 크기가 가장 커지도록 최적화 하여 특정 layer에 가장 활성화된 입력 이미지를 찾아가게 된다. • Best Augmentation combination GS : Gaussian Smoothing, CS : Color Shift Jitter : random Brightness, contast, saturation
  • 11. 11 ViT Feature Visualization • Multi-headed attention layer visualization • ViT feed Forward layer 768 -> 3072 3072 -> 768 시각화 대상 Low-dimension key, query, value 에서 피쳐가 해석되지 않아 FC1 이후의 의 3072 벡터사이즈에서 시각화
  • 15. Last-Layer Token Mixing • Feature activation map - ViT 는 Saliency map 처럼 각 물체의 전반적인 특징을 바탕으로 정보를 전달하는 것을 유추 할 수 있다. * Saliency map : 픽셀의 변화가 큰 지점을 시각화 하여 관심있는 물체와 관심이 없는 부위를 나누는 방법 각 패치는 마지막 전 레이어까지 CNN 과 유사하게 spatial information을 유지하도록 학습하며 마지막 layer가 담당하는 Globalization 에 대해 알아 볼 필요가 있다.
  • 16. Last-Layer Token Mixing • Globalization - 위 그림은 쇼핑카트에 마지막 레이어에서 대해 가장 활성도가 높은 조합을 나타냄 - 오른쪽의 Activation pattern 이 이미지 내에서 전체적으로 Uniform하게 활성화 됨 - ViT 는 Classification 결과를 낼 때 CLS 토큰만을 활용하는데 위 과정을 통해 CLS 토큰이 모든 토큰에 globalization을 진행하는 것을 알 수 있음 Left : Our method Middle : Most activated image Right : activation map Last-layer 의 Globalization 에 대한 두 가지 세부 실험 진행
  • 17. Last-Layer Token Mixing • Isolating CLS 1~11 layer에서 CLS 토큰을 제외하고 Attention 연산을 진행하고 12 마지막 레이어에서만 attention을 적용 한 후 CLS 토큰만을 활용하여 분류를 했을 때도 성능 하락이 크지 않음 -> CLS 토큰은 마지막 레이어에서 모든 토큰과의 globalization이 진행됨을 알 수 있음 • Experiment 1 The CLS token plays a relatively minor role throughout the network and is not used for globalization until the last layer
  • 18. Last-Layer Token Mixing • Patch Average & Patch Maximum Fine-Tuning 없이 Patch Average 혹은 Patch Maximum을 적용했을 때 성능 하락이 크지 않음을 확인 할 수 있음 대부분의 패치가 대략적으로 동일한 정보를 담고 있음을 알 수 있음 • Experiment 2 The last-layer globalization is not exclusive to the CLS token, but actually occurs across every patch
  • 20. Comparison of ViTs and CNNs • Similarity CNNs 는 early layer에서 color, edge, texture 를 인식하고 deeper layer로 갈수록 물체 복잡한 구조를 인식하는데 이는 ViTs 도 유사하다. 위 이미지는 CNNs 아래 이미지는 ViTs 를 나타냄
  • 21. Comparison of ViTs and CNNs • Difference • Foreground & Background masking Test CNNs 과 달리 ViTs 는 Foreground 만 남기거나 Background만 남겼을 때 성능이 뛰어난 것을 확인 할 수 있었다.
  • 22. Comparison of ViTs and CNNs • Difference • Low-pass filtering Test CNNs 과 달리 ViTs 는 High-Frequency 정보가 사라지더라도 성능을 비교적 잘 유지하는 것을 보였다. • Low-pass filtering
  • 23. 23 ViTs with Language Model Supervision
  • 24. ViTs with Language Model Supervision • Visualization Language Model Supervision ViT 모델은 특정 object의 특징에 대해 배우는 반면에 Language Model Supervision 기반으로 배운 CLIP 모델의 경우에는 명사 뿐만 아니라 전치사 형용사 부사의 의미도 시각화 하는 것을 확인 할 수 있다.
  • 25. 25 Discussion 1. Authors introduce a framework for optimization-based feature visualization. We then identify which components of a ViT are most amenable to producing interpretable images, finding that the high-dimensional inner projection of the feed-forward layer is suitable while the key, query, and valu e features of self-attention are not. 2. Authors observe that ViTs preserve spatial information of the patches even for individual channels across all layers with the exception of the last layer, indicating that the networks learn spatial relationships from scratch. We further show that the sudden disappearance of localization information in the last attention layer results from a learned token mixing behavior that resembles average pooling 3. Authors find that ViTs make better use of background information and are able to make vastly superior predictions relative to CNNs when exposed only to image backgrounds despite the seemingly counter-intuitive property that ViTs are not as sensitive as CNNs to the loss of high-frequency information, which one might expect to be critical for making effective use of background 4. Authors show that ViTs trained with language model supervision learn more semantic and conceptual features, rather than object-specific visual features as is typical of classifiers
  • 27. 27 Reference https://junstar92.tistory.com/153 : Saliency map https://distill.pub/2017/feature-visualization/ : Feature visualization