DaViT.pdf

•

0 j'aime•6 vues

This paper proposes DaViT, a vision transformer architecture that uses both spatial and channel attention to efficiently capture global context. Spatial attention performs local interactions across spatial locations while channel attention captures global representations by attending to all spatial positions across channels. Together, they complement each other to achieve state-of-the-art performance on image classification, semantic segmentation, and object detection tasks, with linear computational complexity scaling to high-resolution inputs.

Ingénierie

Paper review:
2023.01.19
Uploaded on ArXiv: April 2022

Background
• For Transformer models global context modelling capabilities, the computational complexity
grows quadratically.
• It limits their ability to scale up to high-resolution scenarios.
• Local attention on spatially local windows benefit for linear complexity, but with a loss of global
contextual information.
• It is important to design an architecture that can capture global contexts while maintaining
efficiency.

Introduction
• Effective vision transformer architecture that can capture global context while maintaining
computational efficiency.
• Exploits self-attention mechanisms with both “spatial tokens” and “channel tokens”.
• With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature
dimension.
• With channel tokens, it is inversed: the channel dimension defines the token scope, and the spatial dimension defines the
token feature dimension.
• Tokens along the sequence direction are further grouped for both spatial and channel tokens to maintain the linear
complexity of the entire model.
• These two self-attentions complement each other.
• Since each channel token contains an abstract representation of the entire image -> the channel attention naturally captures
global interactions and representations by taking all spatial positions into account when computing attention scores between
channels.
• The spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which
in turn helps the global information modeling in channel attention.
• DaViT achieved state-of-the-art performance on four different tasks with efficient computations.

Attention
• Standard global self-attention
• Complexity of O(2P2C + 4PC2)
• Spatial window-based self-attention
• Complexity of O(2PPwC+4PC2)
• Linear complexity with spatial size P
• Channel Group Attention
• Complexity of O(6PC2)
• Linear complexity with spatial size P
Nw: Number of windows, Ng: Number of channel group, Cg: Channels per group, Ch: Channels per head

Comparisons of efficiency vs. performance

Results – Image Classification
and Semantic Segmentation

Recommandé

Image Segmentation Using Deep Learning : A surveyNUPUR YADAV

04 Deep CNN (Ch_01 to Ch_3).pptxZainULABIDIN496386

[20240422_LabSeminar_Huy]Taming_Effect.pptxthanhdowork

PR-366: A ConvNet for 2020sJinwon Lee

Survey of Attention mechanismSwatiNarkhede1

[Paper] Multiscale Vision Transformers(MVit)Susang Kim

PR-183: MixNet: Mixed Depthwise Convolutional KernelsJinwon Lee

A Generalization of Transformer Networks to Graphs.pptxssuser2624f71

Recommandé

Image Segmentation Using Deep Learning : A surveyNUPUR YADAV

04 Deep CNN (Ch_01 to Ch_3).pptxZainULABIDIN496386

[20240422_LabSeminar_Huy]Taming_Effect.pptxthanhdowork

PR-366: A ConvNet for 2020sJinwon Lee

Survey of Attention mechanismSwatiNarkhede1

[Paper] Multiscale Vision Transformers(MVit)Susang Kim

PR-183: MixNet: Mixed Depthwise Convolutional KernelsJinwon Lee

A Generalization of Transformer Networks to Graphs.pptxssuser2624f71

Performance Analysis of Lattice QCD with APGAS Programming ModelKoichi Shirahata

Transformer Mods for Document Length InputsSujit Pal

240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...thanhdowork

Survey of Attention mechanism & Use in Computer VisionSwatiNarkhede1

240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...thanhdowork

Deep learning for 3-D Scene Reconstruction and Modeling Yu Huang

Convolutional Neural Networks : Popular Architecturesananth

ConvNeXt: A ConvNet for the 2020s explainedSushant Gautam

“How Transformers are Changing the Direction of Deep Learning Architectures,”...Edge AI and Vision Alliance

Presentation vision transformersppt.pptxhtn540

PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksJinwon Lee

final_project_1_2k21cse07.pptxshwetabhagat25

PR243: Designing Network Design SpacesJinwon Lee

Moldable pipelines for CNNs on heterogeneous edge devicesLEGATO project

Faster R-CNN - PR012Jinwon Lee

Week5-Faster R-CNN.pptxfahmi324663

Andrea Sini ThesisAndrea Sini, MBA

Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev

Fisheye Omnidirectional View in Autonomous DrivingYu Huang

Design For Accessibility: Getting it right from the startQuintin Balsdon

Thermal Engineering Unit - I & II . pptDineshKumar4165

Contenu connexe

Similaire à DaViT.pdf

Performance Analysis of Lattice QCD with APGAS Programming ModelKoichi Shirahata

Transformer Mods for Document Length InputsSujit Pal

240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...thanhdowork

Survey of Attention mechanism & Use in Computer VisionSwatiNarkhede1

240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...thanhdowork

Deep learning for 3-D Scene Reconstruction and Modeling Yu Huang

Convolutional Neural Networks : Popular Architecturesananth

ConvNeXt: A ConvNet for the 2020s explainedSushant Gautam

“How Transformers are Changing the Direction of Deep Learning Architectures,”...Edge AI and Vision Alliance

Presentation vision transformersppt.pptxhtn540

PR-108: MobileNetV2: Inverted Residuals and Linear BottlenecksJinwon Lee

final_project_1_2k21cse07.pptxshwetabhagat25

PR243: Designing Network Design SpacesJinwon Lee

Moldable pipelines for CNNs on heterogeneous edge devicesLEGATO project

Faster R-CNN - PR012Jinwon Lee

Week5-Faster R-CNN.pptxfahmi324663

Andrea Sini ThesisAndrea Sini, MBA

Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev

Fisheye Omnidirectional View in Autonomous DrivingYu Huang

Similaire à DaViT.pdf (20)

Performance Analysis of Lattice QCD with APGAS Programming Model

Transformer Mods for Document Length Inputs

240226_Thanh_LabSeminar[Structure-Aware Transformer for Graph Representation ...

Survey of Attention mechanism & Use in Computer Vision

240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...

Deep learning for 3-D Scene Reconstruction and Modeling

Convolutional Neural Networks : Popular Architectures

ConvNeXt: A ConvNet for the 2020s explained

“How Transformers are Changing the Direction of Deep Learning Architectures,”...

Presentation vision transformersppt.pptx

PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks

final_project_1_2k21cse07.pptx

PR243: Designing Network Design Spaces

Moldable pipelines for CNNs on heterogeneous edge devices

Faster R-CNN - PR012

Week5-Faster R-CNN.pptx

Andrea Sini Thesis

Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...

Fisheye Omnidirectional View in Autonomous Driving

Dernier

Design For Accessibility: Getting it right from the startQuintin Balsdon

Thermal Engineering Unit - I & II . pptDineshKumar4165

Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698

Minimum and Maximum Modes of microprocessor 8086anil_gaur

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698

Employee leave management system project.Kamal Acharya

Unit 2- Effective stress & Permeability.pdfRagavanV2

UNIT - IV - Air Compressors and its Performancesivaprakash250

University management System project report..pdfKamal Acharya

Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoordharasingh5698

data_management_and _data_science_cheat_sheet.pdfJiananWang21

(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Generative AI or GenAI technology based PPTbhaskargani46

Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies

COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Kandungan 087776558899

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)

Dernier (20)

Design For Accessibility: Getting it right from the start

Thermal Engineering Unit - I & II . ppt

Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking

Minimum and Maximum Modes of microprocessor 8086

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking

Employee leave management system project.

Unit 2- Effective stress & Permeability.pdf

UNIT - IV - Air Compressors and its Performance

University management System project report..pdf

Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor

data_management_and _data_science_cheat_sheet.pdf

(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7

Generative AI or GenAI technology based PPT

Standard vs Custom Battery Packs - Decoding the Power Play

COST-EFFETIVE and Energy Efficient BUILDINGS ptx

Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand

Work-Permit-Receiver-in-Saudi-Aramco.pptx

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...

DaViT.pdf

1. Paper review: 2023.01.19 Uploaded on ArXiv: April 2022

2. Background • For Transformer models global context modelling capabilities, the computational complexity grows quadratically. • It limits their ability to scale up to high-resolution scenarios. • Local attention on spatially local windows benefit for linear complexity, but with a loss of global contextual information. • It is important to design an architecture that can capture global contexts while maintaining efficiency.

3. Introduction • Effective vision transformer architecture that can capture global context while maintaining computational efficiency. • Exploits self-attention mechanisms with both “spatial tokens” and “channel tokens”. • With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. • With channel tokens, it is inversed: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. • Tokens along the sequence direction are further grouped for both spatial and channel tokens to maintain the linear complexity of the entire model. • These two self-attentions complement each other. • Since each channel token contains an abstract representation of the entire image -> the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels. • The spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. • DaViT achieved state-of-the-art performance on four different tasks with efficient computations.

4. Spatial and Channel Dual Attention

5. Attention • Standard global self-attention • Complexity of O(2P2C + 4PC2) • Spatial window-based self-attention • Complexity of O(2PPwC+4PC2) • Linear complexity with spatial size P • Channel Group Attention • Complexity of O(6PC2) • Linear complexity with spatial size P Nw: Number of windows, Ng: Number of channel group, Cg: Channels per group, Ch: Channels per head

6. Dual Attention Block Architecture

7. Comparisons of efficiency vs. performance

8. Results – Image Classification and Semantic Segmentation

9. Results – Object Detection