SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
In-Datacenter Performance Analysis of a
Tensor Processing UnitTM
6th May, 2018
PR12 Paper Review
Jinwon Lee
Samsung Electronics
References
Most figures and slides are from
 Norman P.Jouppi, et al., "In-Datacenter PerformanceAnalysis of aTensor
Processing Unit", 44th IEEE/ACM International Symposium on Computer
Architecture (ISCA-44),Toronto, Canada, June 2017.
https://arxiv.org/abs/1704.04760
 David Patterson, "Evaluation of theTensor Processing Unit: A Deep Neural
Network Accelerator for the Datacenter", NAE Regional Meeting, April 2017.
https://sites.google.com/view/naeregionalsymposium
 Kaz Sato, “An in-depth look at Google’s firstTensor Processing Unit (TPU)”,
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-
googles-first-tensor-processing-unit-tpu
Authors
A Golden Age in Microprocessor Design
• Stunning progress in microprocessor design 40 years ≈ 106x faster!
• Three architectural innovations (~1000x)
 Width: 8163264 bit (~8x)
 Instruction level parallelism:
4-10 clock cycles per instruction to 4+ instructions per clock cycle (~10-20x)
 Multicore: 1 processor to 16 cores (~16x)
• Clock rate: 3 to 4000 MHz (~1000x thru technology & architecture)
• Made possible by IC technology:
 Moore’s Law: growth in transistor count (2X every 1.5 years)
 Dennard Scaling: power/transistor shrinks at same rate as transistors are
added (constant per mm2 of silicon)
End of Growth of Performance?
What’s Left?
• Since
 Transistors not getting much better
 Power budget not getting much higher
 Already switched from 1 inefficient processor/chip to N efficient
processors/chip
• Only path left is Domain Specific Architetures
 Just do a few tasks, but extremely well
TPU Origin
• Starting as far back as 2006, Google engineers had discussions about
deploying GPUs, FPGAs, or custom ASICs in their data centers.They
concluded that they can use the excess capacity of the large data
centers.
• The conversation changed in 2013 when it was projected that if
people used voice search for 3 minutes a day using speech
recognition DNNs, it would have required Google’s data centers to
double in order to meet computation demands.
• Google then started a high-priority project to quickly produce a
custom ASIC for inference.
• The goal was to improve cost-performance by 10x over GPUs.
• Given this mandate, theTPU was designed, verified, built, and
deployed in data centers in just 15 months
TPU
• Built on a 28nm process
• Runs @700MHz
• Consumes 40W when
running
• Connected to its host via a
PCIe Gen3 x16 bus
• TPU card to replace a disk
• Up to 4 cards / server
3 Kinds of Popular NNs
• Multi-Layer Perceptrons(MLP)
 Each new layer is a set of nonlinear functions of weighted sum of all outputs
( fully connected) from a prior one
• Convolutional Neural Networks(CNN)
 Each ensuing layer is a set of nonlinear functions of weighted sums of
spatially nearby subsets of outputs from the prior layer, which also reuses the
weights.
• Recurrent Neural Networks(RNN)
 Each subsequent layer is a collection of nonlinear functions of weighted sums
of outputs and the previous state.The most popular RNN is Long Short-Term
Memory (LSTM).
Inference DatacenterWorkload(95%)
TPU Architecture and Implementation
• Add as accelerators to existing servers
 So connect over I/O Bus(“PCIe”)
 TPU ≈ matrix accelerator on I/O bus
• Host server sends it instructions like a Floating Point Unit
 Unlike GPU that fetches and executes own instructions
• The goal was to run whole inference models in theTPU to reduce
interactions with the host CPU and to be flexible enough to match
the NN needs of 2015 and beyond
TPU Block Diagram
TPU High Level Architecture
• Matrix Multiply Unit is the heart of theTPU
 65,536(256x256) 8-bit MAC units
 The matrix unit holds one 64 KiB tile of weights
plus one for double-buffering
 >25x as many MACs vs GPU, >100x as many MACs vs CPU
• Peak performance: 92TOPS = 65,536 x 2 x 700M
• The 16-bit products are collected in the 4 MiB of 32-bit Accumulators below
the matrix unit.
 The 4MiB represents 4096, 256-element, 32-bit accumulators
 operations / byte @peak performance : 1350  round up : 2048  double
buffering : 4096
TPU High Level Architecture
• The weights for the matrix unit are staged
through an on-chip Weight FIFO that reads
from an off-chip 8 GiB DRAM called Weight Memory
 Two 2133MHz DDR3 DRAM channels
 for inference, weights are read-only
 8 GiB supports many simultaneously active models
• The intermediate results are held in the 24 MiB on-chip Unified Buffer,
which can serve as inputs to the Matrix Unit
 The 24 MiB size was picked in part to match the pitch of the Matrix Unit on the die
and, given the short development schedule
Floorplan ofTPU Die
• The Unified Buffer is
almost a third of the die
• Matrix Multiply Unit is a
quarter
• Control is just 2%
RISC, CISC and theTPU Instruction Set
• Most modern CPUs are heavily influenced by the Reduced Instruction
Set Computer (RISC) design style
 With RISC, the focus is to define simple instructions (e.g., load, store, add
and multiply) that are commonly used by the majority of applications and
then to execute those instructions as fast as possible.
• A Complex Instruction Set Computer(CISC) design focuses on
implementing high-level instructions that run more complex tasks
(such as calculating multiply-and-add many times) with each
instruction.
 The average clock cycles per instruction (CPI) of these CISC instructions is
typically 10 to 20
• TPU choose the CISC style
TPU Instructions
• It has about a dozen instructions overall, but below five are the key ones
TPU Instructions
• The CISC MatrixMultiply instruction is 12 bytes
 3 are Unified Buffer address; 2 are accumulator address; 4 are length
(sometimes 2 dimensions for convolutions); and the rest are opcode and
flags.
• Average clock cycles per instruction : > 10
• 4-stage overlapped execution, 1 instruction type / stage
 Execute other instructions while matrix multiplier busy
• Complexity in SW
 No branches, in-order issue, SW controlled buffers, SW controlled pipeline
synchronization
Systolic Execution in Matrix Array
• Problem : Reading a large SRAM uses much more power than
arithmetic
• Solution : Using “Systolic Execution” to save energy by reducing
reads and writes of the Unified Buffer
• A systolic array is a two dimensional collection of arithmetic units
that each independently compute a partial result as a function of
inputs from other arithmetic units that are considered upstream to
each unit
• It is similar to blood being pumped through the human circulatory
system by heart, which is the origin of the systolic name
Systolic Array(Example – vector input)
Systolic Array(Example – matrix input)
TPU Systolic Array
• In theTPU, the systolic array is
rotated
• Weights are loaded from the top
and the input data flows into the
array in from the left
• Weights are preloaded and take
effect with the advancing wave
alongside the first data of a new
block
Software Stack
• Software stack is split into a User Space
Driver and a Kernel Driver.
• The Kernel Driver is lightweight and
handles only memory management
and interrupts.
• The User Space driver changes
frequently. It sets up and controlsTPU
execution, reformats data intoTPU
order, translates API calls intoTPU
instructions, and turns them into an
application binary.
Relative Performances : 3 Contemporary Chips
*TPU is less than half die size of the Intel Haswell processor
• K80 andTPU in 28nm process, Haswell fabbed in intel 22nm process
• These chips and platforms chosen for comparison because widely deployed in
Google data centers
Relative Performance : 3 Platforms
• These chips and platforms chosen for comparison because widely
deployed in Google data centers
Performance Comparison
• Roofline Performance model
 This simple visual model is not perfect, yet
it offers insights on the causes of
performance bottlenecks.
 TheY-axis is performance in floating-point
operations per second, thus the peak
computation rate forms the “flat” part of
the roofline.
 The X-axis is operational intensity,
measured as floating-point operations per
DRAM byte accessed.
TPU Die Roofline
• TheTPU has a long “slanted” part of
its roofline, where operational
intensity means that performance is
limited by memory bandwidth.
• Five of the six applications are
happily bumping their heads against
the ceiling
• MLPs and LSTMs are memory bound,
and CNNs are computation bound.
CPU & GPU Rooflines
Log Rooflines for CPU, GPU andTPU
Linear Rooflines for CPU, GPU andTPU
Why So Far Below Rooflines? (MLP0)
• Response time is the reason
• Researchers have demonstrated that small increases in response
time cause customers to use a service less
• Inference prefers latency over throughput
TPU & GPU Relative Performance to CPU
• GM : Geometric Mean
• WM :Weighted Mean
Performance /Watt
ImprovingTPU : Move “Ridge Point” to the Left
• Current DRAM
 2 DDR 2133MHz  34GB/s
• Replace with GDDR5 like in K80
 BW : 34GB/s  180GB/s
 Move to Ridge Point from 1350 to 250
 This improvement would expand die size by about 10%. However, higher
memory bandwidth reduces pressure on the Unified Buffer, so reducing the
Unified Buffer to 14 MiB could gain back 10% in area.
Maximum MiB of the 24 MiB Unified Buffer used per NN app
RevisedTPU Raised Roofline
Performance /Watt Original & RevisedTPU
Overall Performance /Watt
Energy Proportionality
Evaluation ofTPU Designs
• Below table shows the differences between the model results and
the hardware performance counters, which average below 10%.
Weighted MeanTPU Relative Performance
Weighted MeanTPU Relative Performance
• First, increasing memory bandwidth ( memory ) has the biggest
impact: performance improves 3X on average when memory
increases 4X
• Second, clock rate has little benefit on average with or without more
accumulators.The reason is the MLPs and LSTMs are memory bound
but only the CNNs are compute bound
 Increasing the clock rate by 4X has almost no impact on MLPs and LSTMs
but improves performance of CNNs by about 2X.
• Third, the average performance slightly degrades when the matrix
unit expands from 256x256 to 512x512 for all apps
 The issue is analogous to internal fragmentation of large pages

Contenu connexe

Tendances

High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance ComputingDell World
 
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from FacebookEdge AI and Vision Alliance
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveJason Shih
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsDarius Barušauskas
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balázs Hidasi
 
Introduction to Keras
Introduction to KerasIntroduction to Keras
Introduction to KerasJohn Ramey
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance ComputingDivyen Patel
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksJustin Brandenburg
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun YooJaeJun Yoo
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...PyData
 

Tendances (20)

rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
 
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
“Practical DNN Quantization Techniques and Tools,” a Presentation from Facebook
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Tensorflow
TensorflowTensorflow
Tensorflow
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Parallelformers
ParallelformersParallelformers
Parallelformers
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
 
Introduction to Keras
Introduction to KerasIntroduction to Keras
Introduction to Keras
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural Networks
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Data Parallel Deep Learning
Data Parallel Deep LearningData Parallel Deep Learning
Data Parallel Deep Learning
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...
 

Similaire à In datacenter performance analysis of a tensor processing unit

Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit Ganesan Narayanasamy
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptxruvex
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
Intel new processors
Intel new processorsIntel new processors
Intel new processorszaid_b
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel acceleratorBaharJV
 
Fundamentals.pptx
Fundamentals.pptxFundamentals.pptx
Fundamentals.pptxdhivyak49
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Centerinside-BigData.com
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Cloudera, Inc.
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의DzH QWuynh
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMSlide_N
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Fisnik Kraja
 

Similaire à In datacenter performance analysis of a tensor processing unit (20)

Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
 
Fundamentals.pptx
Fundamentals.pptxFundamentals.pptx
Fundamentals.pptx
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
processor struct
processor structprocessor struct
processor struct
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTM
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 

Plus de Jinwon Lee

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sJinwon Lee
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersJinwon Lee
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...Jinwon Lee
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionJinwon Lee
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...Jinwon Lee
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)Jinwon Lee
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorJinwon Lee
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...Jinwon Lee
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesJinwon Lee
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionJinwon Lee
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementJinwon Lee
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...Jinwon Lee
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsJinwon Lee
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksJinwon Lee
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionJinwon Lee
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignJinwon Lee
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorJinwon Lee
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...Jinwon Lee
 

Plus de Jinwon Lee (20)

PR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020sPR-366: A ConvNet for 2020s
PR-366: A ConvNet for 2020s
 
PR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision LearnersPR-355: Masked Autoencoders Are Scalable Vision Learners
PR-355: Masked Autoencoders Are Scalable Vision Learners
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 
PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)PR-284: End-to-End Object Detection with Transformers(DETR)
PR-284: End-to-End Object Detection with Transformers(DETR)
 
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorPR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object Detector
 
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
 
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
PR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object DetectionPR-217: EfficientDet: Scalable and Efficient Object Detection
PR-217: EfficientDet: Scalable and Efficient Object Detection
 
PR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental ImprovementPR-207: YOLOv3: An Incremental Improvement
PR-207: YOLOv3: An Incremental Improvement
 
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...
 
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional KernelsPR-183: MixNet: Mixed Depthwise Convolutional Kernels
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
 
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionPR-155: Exploring Randomly Wired Neural Networks for Image Recognition
PR-155: Exploring Randomly Wired Neural Networks for Image Recognition
 
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
 
PR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox DetectorPR-132: SSD: Single Shot MultiBox Detector
PR-132: SSD: Single Shot MultiBox Detector
 
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
 

Dernier

FULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhi
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | DelhiFULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhi
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhisoniya singh
 
Thane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call GirlsThane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call GirlsPooja Nehwal
 
(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Call Girls in Nagpur High Profile
 
Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...
Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...
Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...nagunakhan
 
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...Pooja Nehwal
 
(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)
(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)
(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)kojalkojal131
 
Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...
Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...
Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...Pooja Nehwal
 
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...Pooja Nehwal
 
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...ranjana rawat
 
Book Sex Workers Available Pune Call Girls Yerwada 6297143586 Call Hot India...
Book Sex Workers Available Pune Call Girls Yerwada  6297143586 Call Hot India...Book Sex Workers Available Pune Call Girls Yerwada  6297143586 Call Hot India...
Book Sex Workers Available Pune Call Girls Yerwada 6297143586 Call Hot India...Call Girls in Nagpur High Profile
 
Develop Keyboard Skill.pptx er power point
Develop Keyboard Skill.pptx er power pointDevelop Keyboard Skill.pptx er power point
Develop Keyboard Skill.pptx er power pointGetawu
 
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311  Call Girls in Thane , Independent Escort Service ThanePallawi 9167673311  Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service ThanePooja Nehwal
 
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Naicy mandal
 
(ZARA) Call Girls Jejuri ( 7001035870 ) HI-Fi Pune Escorts Service
(ZARA) Call Girls Jejuri ( 7001035870 ) HI-Fi Pune Escorts Service(ZARA) Call Girls Jejuri ( 7001035870 ) HI-Fi Pune Escorts Service
(ZARA) Call Girls Jejuri ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VVIP Pune Call Girls Kalyani Nagar (7001035870) Pune Escorts Nearby with Comp...
VVIP Pune Call Girls Kalyani Nagar (7001035870) Pune Escorts Nearby with Comp...VVIP Pune Call Girls Kalyani Nagar (7001035870) Pune Escorts Nearby with Comp...
VVIP Pune Call Girls Kalyani Nagar (7001035870) Pune Escorts Nearby with Comp...Call Girls in Nagpur High Profile
 
NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...
NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...
NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...Amil baba
 
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
《伯明翰城市大学毕业证成绩单购买》学历证书学位证书区别《复刻原版1:1伯明翰城市大学毕业证书|修改BCU成绩单PDF版》Q微信741003700《BCU学...
《伯明翰城市大学毕业证成绩单购买》学历证书学位证书区别《复刻原版1:1伯明翰城市大学毕业证书|修改BCU成绩单PDF版》Q微信741003700《BCU学...《伯明翰城市大学毕业证成绩单购买》学历证书学位证书区别《复刻原版1:1伯明翰城市大学毕业证书|修改BCU成绩单PDF版》Q微信741003700《BCU学...
《伯明翰城市大学毕业证成绩单购买》学历证书学位证书区别《复刻原版1:1伯明翰城市大学毕业证书|修改BCU成绩单PDF版》Q微信741003700《BCU学...ur8mqw8e
 
Russian Call Girls Kolkata Chhaya 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls Kolkata Chhaya 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls Kolkata Chhaya 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls Kolkata Chhaya 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 

Dernier (20)

FULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhi
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | DelhiFULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhi
FULL ENJOY - 8264348440 Call Girls in Hauz Khas | Delhi
 
Thane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call GirlsThane Escorts, (Pooja 09892124323), Thane Call Girls
Thane Escorts, (Pooja 09892124323), Thane Call Girls
 
(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(ANIKA) Wanwadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...Top Rated  Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
Top Rated Pune Call Girls Shirwal ⟟ 6297143586 ⟟ Call Me For Genuine Sex Ser...
 
Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...
Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...
Russian Escorts in lucknow 💗 9719455033 💥 Lovely Lasses: Radiant Beauties Shi...
 
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
 
(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)
(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)
(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)
 
Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...
Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...
Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...
 
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
9892124323, Call Girl in Juhu Call Girls Services (Rate ₹8.5K) 24×7 with Hote...
 
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
 
Book Sex Workers Available Pune Call Girls Yerwada 6297143586 Call Hot India...
Book Sex Workers Available Pune Call Girls Yerwada  6297143586 Call Hot India...Book Sex Workers Available Pune Call Girls Yerwada  6297143586 Call Hot India...
Book Sex Workers Available Pune Call Girls Yerwada 6297143586 Call Hot India...
 
Develop Keyboard Skill.pptx er power point
Develop Keyboard Skill.pptx er power pointDevelop Keyboard Skill.pptx er power point
Develop Keyboard Skill.pptx er power point
 
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311  Call Girls in Thane , Independent Escort Service ThanePallawi 9167673311  Call Girls in Thane , Independent Escort Service Thane
Pallawi 9167673311 Call Girls in Thane , Independent Escort Service Thane
 
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
 
(ZARA) Call Girls Jejuri ( 7001035870 ) HI-Fi Pune Escorts Service
(ZARA) Call Girls Jejuri ( 7001035870 ) HI-Fi Pune Escorts Service(ZARA) Call Girls Jejuri ( 7001035870 ) HI-Fi Pune Escorts Service
(ZARA) Call Girls Jejuri ( 7001035870 ) HI-Fi Pune Escorts Service
 
VVIP Pune Call Girls Kalyani Nagar (7001035870) Pune Escorts Nearby with Comp...
VVIP Pune Call Girls Kalyani Nagar (7001035870) Pune Escorts Nearby with Comp...VVIP Pune Call Girls Kalyani Nagar (7001035870) Pune Escorts Nearby with Comp...
VVIP Pune Call Girls Kalyani Nagar (7001035870) Pune Escorts Nearby with Comp...
 
NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...
NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...
NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...
 
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
 
《伯明翰城市大学毕业证成绩单购买》学历证书学位证书区别《复刻原版1:1伯明翰城市大学毕业证书|修改BCU成绩单PDF版》Q微信741003700《BCU学...
《伯明翰城市大学毕业证成绩单购买》学历证书学位证书区别《复刻原版1:1伯明翰城市大学毕业证书|修改BCU成绩单PDF版》Q微信741003700《BCU学...《伯明翰城市大学毕业证成绩单购买》学历证书学位证书区别《复刻原版1:1伯明翰城市大学毕业证书|修改BCU成绩单PDF版》Q微信741003700《BCU学...
《伯明翰城市大学毕业证成绩单购买》学历证书学位证书区别《复刻原版1:1伯明翰城市大学毕业证书|修改BCU成绩单PDF版》Q微信741003700《BCU学...
 
Russian Call Girls Kolkata Chhaya 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls Kolkata Chhaya 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls Kolkata Chhaya 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls Kolkata Chhaya 🤌 8250192130 🚀 Vip Call Girls Kolkata
 

In datacenter performance analysis of a tensor processing unit

  • 1. In-Datacenter Performance Analysis of a Tensor Processing UnitTM 6th May, 2018 PR12 Paper Review Jinwon Lee Samsung Electronics
  • 2. References Most figures and slides are from  Norman P.Jouppi, et al., "In-Datacenter PerformanceAnalysis of aTensor Processing Unit", 44th IEEE/ACM International Symposium on Computer Architecture (ISCA-44),Toronto, Canada, June 2017. https://arxiv.org/abs/1704.04760  David Patterson, "Evaluation of theTensor Processing Unit: A Deep Neural Network Accelerator for the Datacenter", NAE Regional Meeting, April 2017. https://sites.google.com/view/naeregionalsymposium  Kaz Sato, “An in-depth look at Google’s firstTensor Processing Unit (TPU)”, https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at- googles-first-tensor-processing-unit-tpu
  • 4. A Golden Age in Microprocessor Design • Stunning progress in microprocessor design 40 years ≈ 106x faster! • Three architectural innovations (~1000x)  Width: 8163264 bit (~8x)  Instruction level parallelism: 4-10 clock cycles per instruction to 4+ instructions per clock cycle (~10-20x)  Multicore: 1 processor to 16 cores (~16x) • Clock rate: 3 to 4000 MHz (~1000x thru technology & architecture) • Made possible by IC technology:  Moore’s Law: growth in transistor count (2X every 1.5 years)  Dennard Scaling: power/transistor shrinks at same rate as transistors are added (constant per mm2 of silicon)
  • 5. End of Growth of Performance?
  • 6. What’s Left? • Since  Transistors not getting much better  Power budget not getting much higher  Already switched from 1 inefficient processor/chip to N efficient processors/chip • Only path left is Domain Specific Architetures  Just do a few tasks, but extremely well
  • 7. TPU Origin • Starting as far back as 2006, Google engineers had discussions about deploying GPUs, FPGAs, or custom ASICs in their data centers.They concluded that they can use the excess capacity of the large data centers. • The conversation changed in 2013 when it was projected that if people used voice search for 3 minutes a day using speech recognition DNNs, it would have required Google’s data centers to double in order to meet computation demands. • Google then started a high-priority project to quickly produce a custom ASIC for inference. • The goal was to improve cost-performance by 10x over GPUs. • Given this mandate, theTPU was designed, verified, built, and deployed in data centers in just 15 months
  • 8. TPU • Built on a 28nm process • Runs @700MHz • Consumes 40W when running • Connected to its host via a PCIe Gen3 x16 bus • TPU card to replace a disk • Up to 4 cards / server
  • 9. 3 Kinds of Popular NNs • Multi-Layer Perceptrons(MLP)  Each new layer is a set of nonlinear functions of weighted sum of all outputs ( fully connected) from a prior one • Convolutional Neural Networks(CNN)  Each ensuing layer is a set of nonlinear functions of weighted sums of spatially nearby subsets of outputs from the prior layer, which also reuses the weights. • Recurrent Neural Networks(RNN)  Each subsequent layer is a collection of nonlinear functions of weighted sums of outputs and the previous state.The most popular RNN is Long Short-Term Memory (LSTM).
  • 11. TPU Architecture and Implementation • Add as accelerators to existing servers  So connect over I/O Bus(“PCIe”)  TPU ≈ matrix accelerator on I/O bus • Host server sends it instructions like a Floating Point Unit  Unlike GPU that fetches and executes own instructions • The goal was to run whole inference models in theTPU to reduce interactions with the host CPU and to be flexible enough to match the NN needs of 2015 and beyond
  • 13. TPU High Level Architecture • Matrix Multiply Unit is the heart of theTPU  65,536(256x256) 8-bit MAC units  The matrix unit holds one 64 KiB tile of weights plus one for double-buffering  >25x as many MACs vs GPU, >100x as many MACs vs CPU • Peak performance: 92TOPS = 65,536 x 2 x 700M • The 16-bit products are collected in the 4 MiB of 32-bit Accumulators below the matrix unit.  The 4MiB represents 4096, 256-element, 32-bit accumulators  operations / byte @peak performance : 1350  round up : 2048  double buffering : 4096
  • 14. TPU High Level Architecture • The weights for the matrix unit are staged through an on-chip Weight FIFO that reads from an off-chip 8 GiB DRAM called Weight Memory  Two 2133MHz DDR3 DRAM channels  for inference, weights are read-only  8 GiB supports many simultaneously active models • The intermediate results are held in the 24 MiB on-chip Unified Buffer, which can serve as inputs to the Matrix Unit  The 24 MiB size was picked in part to match the pitch of the Matrix Unit on the die and, given the short development schedule
  • 15. Floorplan ofTPU Die • The Unified Buffer is almost a third of the die • Matrix Multiply Unit is a quarter • Control is just 2%
  • 16. RISC, CISC and theTPU Instruction Set • Most modern CPUs are heavily influenced by the Reduced Instruction Set Computer (RISC) design style  With RISC, the focus is to define simple instructions (e.g., load, store, add and multiply) that are commonly used by the majority of applications and then to execute those instructions as fast as possible. • A Complex Instruction Set Computer(CISC) design focuses on implementing high-level instructions that run more complex tasks (such as calculating multiply-and-add many times) with each instruction.  The average clock cycles per instruction (CPI) of these CISC instructions is typically 10 to 20 • TPU choose the CISC style
  • 17. TPU Instructions • It has about a dozen instructions overall, but below five are the key ones
  • 18. TPU Instructions • The CISC MatrixMultiply instruction is 12 bytes  3 are Unified Buffer address; 2 are accumulator address; 4 are length (sometimes 2 dimensions for convolutions); and the rest are opcode and flags. • Average clock cycles per instruction : > 10 • 4-stage overlapped execution, 1 instruction type / stage  Execute other instructions while matrix multiplier busy • Complexity in SW  No branches, in-order issue, SW controlled buffers, SW controlled pipeline synchronization
  • 19. Systolic Execution in Matrix Array • Problem : Reading a large SRAM uses much more power than arithmetic • Solution : Using “Systolic Execution” to save energy by reducing reads and writes of the Unified Buffer • A systolic array is a two dimensional collection of arithmetic units that each independently compute a partial result as a function of inputs from other arithmetic units that are considered upstream to each unit • It is similar to blood being pumped through the human circulatory system by heart, which is the origin of the systolic name
  • 20. Systolic Array(Example – vector input)
  • 21. Systolic Array(Example – matrix input)
  • 22. TPU Systolic Array • In theTPU, the systolic array is rotated • Weights are loaded from the top and the input data flows into the array in from the left • Weights are preloaded and take effect with the advancing wave alongside the first data of a new block
  • 23. Software Stack • Software stack is split into a User Space Driver and a Kernel Driver. • The Kernel Driver is lightweight and handles only memory management and interrupts. • The User Space driver changes frequently. It sets up and controlsTPU execution, reformats data intoTPU order, translates API calls intoTPU instructions, and turns them into an application binary.
  • 24. Relative Performances : 3 Contemporary Chips *TPU is less than half die size of the Intel Haswell processor • K80 andTPU in 28nm process, Haswell fabbed in intel 22nm process • These chips and platforms chosen for comparison because widely deployed in Google data centers
  • 25. Relative Performance : 3 Platforms • These chips and platforms chosen for comparison because widely deployed in Google data centers
  • 26. Performance Comparison • Roofline Performance model  This simple visual model is not perfect, yet it offers insights on the causes of performance bottlenecks.  TheY-axis is performance in floating-point operations per second, thus the peak computation rate forms the “flat” part of the roofline.  The X-axis is operational intensity, measured as floating-point operations per DRAM byte accessed.
  • 27. TPU Die Roofline • TheTPU has a long “slanted” part of its roofline, where operational intensity means that performance is limited by memory bandwidth. • Five of the six applications are happily bumping their heads against the ceiling • MLPs and LSTMs are memory bound, and CNNs are computation bound.
  • 28. CPU & GPU Rooflines
  • 29. Log Rooflines for CPU, GPU andTPU
  • 30. Linear Rooflines for CPU, GPU andTPU
  • 31. Why So Far Below Rooflines? (MLP0) • Response time is the reason • Researchers have demonstrated that small increases in response time cause customers to use a service less • Inference prefers latency over throughput
  • 32. TPU & GPU Relative Performance to CPU • GM : Geometric Mean • WM :Weighted Mean
  • 34. ImprovingTPU : Move “Ridge Point” to the Left • Current DRAM  2 DDR 2133MHz  34GB/s • Replace with GDDR5 like in K80  BW : 34GB/s  180GB/s  Move to Ridge Point from 1350 to 250  This improvement would expand die size by about 10%. However, higher memory bandwidth reduces pressure on the Unified Buffer, so reducing the Unified Buffer to 14 MiB could gain back 10% in area. Maximum MiB of the 24 MiB Unified Buffer used per NN app
  • 39. Evaluation ofTPU Designs • Below table shows the differences between the model results and the hardware performance counters, which average below 10%.
  • 41. Weighted MeanTPU Relative Performance • First, increasing memory bandwidth ( memory ) has the biggest impact: performance improves 3X on average when memory increases 4X • Second, clock rate has little benefit on average with or without more accumulators.The reason is the MLPs and LSTMs are memory bound but only the CNNs are compute bound  Increasing the clock rate by 4X has almost no impact on MLPs and LSTMs but improves performance of CNNs by about 2X. • Third, the average performance slightly degrades when the matrix unit expands from 256x256 to 512x512 for all apps  The issue is analogous to internal fragmentation of large pages