2. References
Most figures and slides are from
Norman P.Jouppi, et al., "In-Datacenter PerformanceAnalysis of aTensor
Processing Unit", 44th IEEE/ACM International Symposium on Computer
Architecture (ISCA-44),Toronto, Canada, June 2017.
https://arxiv.org/abs/1704.04760
David Patterson, "Evaluation of theTensor Processing Unit: A Deep Neural
Network Accelerator for the Datacenter", NAE Regional Meeting, April 2017.
https://sites.google.com/view/naeregionalsymposium
Kaz Sato, “An in-depth look at Google’s firstTensor Processing Unit (TPU)”,
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-
googles-first-tensor-processing-unit-tpu
4. A Golden Age in Microprocessor Design
• Stunning progress in microprocessor design 40 years ≈ 106x faster!
• Three architectural innovations (~1000x)
Width: 8163264 bit (~8x)
Instruction level parallelism:
4-10 clock cycles per instruction to 4+ instructions per clock cycle (~10-20x)
Multicore: 1 processor to 16 cores (~16x)
• Clock rate: 3 to 4000 MHz (~1000x thru technology & architecture)
• Made possible by IC technology:
Moore’s Law: growth in transistor count (2X every 1.5 years)
Dennard Scaling: power/transistor shrinks at same rate as transistors are
added (constant per mm2 of silicon)
6. What’s Left?
• Since
Transistors not getting much better
Power budget not getting much higher
Already switched from 1 inefficient processor/chip to N efficient
processors/chip
• Only path left is Domain Specific Architetures
Just do a few tasks, but extremely well
7. TPU Origin
• Starting as far back as 2006, Google engineers had discussions about
deploying GPUs, FPGAs, or custom ASICs in their data centers.They
concluded that they can use the excess capacity of the large data
centers.
• The conversation changed in 2013 when it was projected that if
people used voice search for 3 minutes a day using speech
recognition DNNs, it would have required Google’s data centers to
double in order to meet computation demands.
• Google then started a high-priority project to quickly produce a
custom ASIC for inference.
• The goal was to improve cost-performance by 10x over GPUs.
• Given this mandate, theTPU was designed, verified, built, and
deployed in data centers in just 15 months
8. TPU
• Built on a 28nm process
• Runs @700MHz
• Consumes 40W when
running
• Connected to its host via a
PCIe Gen3 x16 bus
• TPU card to replace a disk
• Up to 4 cards / server
9. 3 Kinds of Popular NNs
• Multi-Layer Perceptrons(MLP)
Each new layer is a set of nonlinear functions of weighted sum of all outputs
( fully connected) from a prior one
• Convolutional Neural Networks(CNN)
Each ensuing layer is a set of nonlinear functions of weighted sums of
spatially nearby subsets of outputs from the prior layer, which also reuses the
weights.
• Recurrent Neural Networks(RNN)
Each subsequent layer is a collection of nonlinear functions of weighted sums
of outputs and the previous state.The most popular RNN is Long Short-Term
Memory (LSTM).
11. TPU Architecture and Implementation
• Add as accelerators to existing servers
So connect over I/O Bus(“PCIe”)
TPU ≈ matrix accelerator on I/O bus
• Host server sends it instructions like a Floating Point Unit
Unlike GPU that fetches and executes own instructions
• The goal was to run whole inference models in theTPU to reduce
interactions with the host CPU and to be flexible enough to match
the NN needs of 2015 and beyond
13. TPU High Level Architecture
• Matrix Multiply Unit is the heart of theTPU
65,536(256x256) 8-bit MAC units
The matrix unit holds one 64 KiB tile of weights
plus one for double-buffering
>25x as many MACs vs GPU, >100x as many MACs vs CPU
• Peak performance: 92TOPS = 65,536 x 2 x 700M
• The 16-bit products are collected in the 4 MiB of 32-bit Accumulators below
the matrix unit.
The 4MiB represents 4096, 256-element, 32-bit accumulators
operations / byte @peak performance : 1350 round up : 2048 double
buffering : 4096
14. TPU High Level Architecture
• The weights for the matrix unit are staged
through an on-chip Weight FIFO that reads
from an off-chip 8 GiB DRAM called Weight Memory
Two 2133MHz DDR3 DRAM channels
for inference, weights are read-only
8 GiB supports many simultaneously active models
• The intermediate results are held in the 24 MiB on-chip Unified Buffer,
which can serve as inputs to the Matrix Unit
The 24 MiB size was picked in part to match the pitch of the Matrix Unit on the die
and, given the short development schedule
15. Floorplan ofTPU Die
• The Unified Buffer is
almost a third of the die
• Matrix Multiply Unit is a
quarter
• Control is just 2%
16. RISC, CISC and theTPU Instruction Set
• Most modern CPUs are heavily influenced by the Reduced Instruction
Set Computer (RISC) design style
With RISC, the focus is to define simple instructions (e.g., load, store, add
and multiply) that are commonly used by the majority of applications and
then to execute those instructions as fast as possible.
• A Complex Instruction Set Computer(CISC) design focuses on
implementing high-level instructions that run more complex tasks
(such as calculating multiply-and-add many times) with each
instruction.
The average clock cycles per instruction (CPI) of these CISC instructions is
typically 10 to 20
• TPU choose the CISC style
17. TPU Instructions
• It has about a dozen instructions overall, but below five are the key ones
18. TPU Instructions
• The CISC MatrixMultiply instruction is 12 bytes
3 are Unified Buffer address; 2 are accumulator address; 4 are length
(sometimes 2 dimensions for convolutions); and the rest are opcode and
flags.
• Average clock cycles per instruction : > 10
• 4-stage overlapped execution, 1 instruction type / stage
Execute other instructions while matrix multiplier busy
• Complexity in SW
No branches, in-order issue, SW controlled buffers, SW controlled pipeline
synchronization
19. Systolic Execution in Matrix Array
• Problem : Reading a large SRAM uses much more power than
arithmetic
• Solution : Using “Systolic Execution” to save energy by reducing
reads and writes of the Unified Buffer
• A systolic array is a two dimensional collection of arithmetic units
that each independently compute a partial result as a function of
inputs from other arithmetic units that are considered upstream to
each unit
• It is similar to blood being pumped through the human circulatory
system by heart, which is the origin of the systolic name
22. TPU Systolic Array
• In theTPU, the systolic array is
rotated
• Weights are loaded from the top
and the input data flows into the
array in from the left
• Weights are preloaded and take
effect with the advancing wave
alongside the first data of a new
block
23. Software Stack
• Software stack is split into a User Space
Driver and a Kernel Driver.
• The Kernel Driver is lightweight and
handles only memory management
and interrupts.
• The User Space driver changes
frequently. It sets up and controlsTPU
execution, reformats data intoTPU
order, translates API calls intoTPU
instructions, and turns them into an
application binary.
24. Relative Performances : 3 Contemporary Chips
*TPU is less than half die size of the Intel Haswell processor
• K80 andTPU in 28nm process, Haswell fabbed in intel 22nm process
• These chips and platforms chosen for comparison because widely deployed in
Google data centers
25. Relative Performance : 3 Platforms
• These chips and platforms chosen for comparison because widely
deployed in Google data centers
26. Performance Comparison
• Roofline Performance model
This simple visual model is not perfect, yet
it offers insights on the causes of
performance bottlenecks.
TheY-axis is performance in floating-point
operations per second, thus the peak
computation rate forms the “flat” part of
the roofline.
The X-axis is operational intensity,
measured as floating-point operations per
DRAM byte accessed.
27. TPU Die Roofline
• TheTPU has a long “slanted” part of
its roofline, where operational
intensity means that performance is
limited by memory bandwidth.
• Five of the six applications are
happily bumping their heads against
the ceiling
• MLPs and LSTMs are memory bound,
and CNNs are computation bound.
31. Why So Far Below Rooflines? (MLP0)
• Response time is the reason
• Researchers have demonstrated that small increases in response
time cause customers to use a service less
• Inference prefers latency over throughput
32. TPU & GPU Relative Performance to CPU
• GM : Geometric Mean
• WM :Weighted Mean
34. ImprovingTPU : Move “Ridge Point” to the Left
• Current DRAM
2 DDR 2133MHz 34GB/s
• Replace with GDDR5 like in K80
BW : 34GB/s 180GB/s
Move to Ridge Point from 1350 to 250
This improvement would expand die size by about 10%. However, higher
memory bandwidth reduces pressure on the Unified Buffer, so reducing the
Unified Buffer to 14 MiB could gain back 10% in area.
Maximum MiB of the 24 MiB Unified Buffer used per NN app
39. Evaluation ofTPU Designs
• Below table shows the differences between the model results and
the hardware performance counters, which average below 10%.
41. Weighted MeanTPU Relative Performance
• First, increasing memory bandwidth ( memory ) has the biggest
impact: performance improves 3X on average when memory
increases 4X
• Second, clock rate has little benefit on average with or without more
accumulators.The reason is the MLPs and LSTMs are memory bound
but only the CNNs are compute bound
Increasing the clock rate by 4X has almost no impact on MLPs and LSTMs
but improves performance of CNNs by about 2X.
• Third, the average performance slightly degrades when the matrix
unit expands from 256x256 to 512x512 for all apps
The issue is analogous to internal fragmentation of large pages