customization of a deep learning accelerator, based on NVDLA
1. D15-3 Customization of a
Deep Learning Accelerator
Shien-Chun Luo
Industrial Technology Research Institute
25 April 2019
2. Agenda
Object Detection Demonstration
Design a High Efficient Accelerator
Our Solutions and Some Results
2
3. Demonstration of Object Detection
3
• 256-MAC DLA @ 150 MHz
• ZCU102 FPGA (used 40% of 600k logics)
• Ubuntu on ARM A53, 1.2GHz
• USB camera input, display port output
• Tiny YOLO v1, 448 x 448 RGB input
• 8-CONV & 1-FC, 3.2 GOPs of NN
• Detection layer uses CPU
• VOC dataset, 20 categories
• Original FP-32, MAP= 40%
• Retrained INT-8 by TF, MAP=35%
• Avg 8 FPS
• Execution time, CONV ~79 ms , FC ~48 ms
4. FPGA Object Detection Setup
4
DRAM (1GB)
Input Image
Model Weights
OS Controlled Space
DRAM
CTRL
DP
USB ARM CPU
(FPGA)
DLA
(Processing System)
Temp Activations
Output Data
Reserved
for DLA
(64~256MB)
Program INIT
Set parameters
Load Weight
Image Capture (YUV)
Re-Format to RGB
Activate DLA
Post Processing
Display
DLA Finished
6. 3 Steps to Achieve Our Goal
6
1. Increase MAC PEs with high utilization
2. Increase data supplement to those PEs
3. Improve energy efficiency, adaptive to the models
Throughput
Computation Power
3
2
Concepts of step 1~ 3 Take Alexnet for example
Throughput Curves
given various DRAM BW
11. Let’s Use a Customizable Architecture
11
1. Variable CONV processing resources
• 64-MAC PE cluster to 2048-MAC PE cluster
for a single convolution processer
• Variable volume of convolutional buffer
2. Configurable NN operator processors
• Options for batch normalization, PReLU,
scale, bias, quantization, element-wise
operators
• Options for down sample ( like pool)
operators
• Options for nonlinear LUTs
3. Custom memories and host CPUs
• Can be driven by MCU or CPU
• Shared or private DRAM/SRAMArchitecture revised based on NVDLA
12. DLA Features – Inherence and Our Changes
12
1. [Inherence] Channel-first CONV strategy
• Released data dependency, share input feature cube
• Any kernel size (n x m) ~100% utilization if deep channels
2. [Add tool to verify] Layer fusion to save memory access
• Fuse popular layer stack [ CONV – BN – PReLU – Pool ]
• Verified reduce activations access
3. [Add tool to verify] Program time hiding
• Verified program the N+1th layer, when running the Nth layer
4. [Revised HW] Depth-wise CONV support
• Revised HW from DMA to ACC
5. [Future work] DMA of fast data dimension change
• Adding fast up-sample algorithms, data dimension reorder changes
width
height
IN
IN
Channel first
Plane first
13. Standard Inference collaborating ONNC
on Linux Machines
13
Model Graph
Model weights
Kernel Mode Driver
(KMD)
User Mode Driver
(UMD)
Flow Controller
(MCU or CPU)
DLA
HW
User’s
Framework
Compiler (ONNC)
HardwareAPI and Driver for Linux
online | offline
Framework Converter
ONNX Graph
Model weights
with quantize
information
Parser
CPU Tasks
Loadable
Files
Framework
Conversion
Quantize
info extract
(Tensorflow)
DLA Tasks
Compiler
14. Bottom-up (Baremetal) Verification Flow
14
Model
Weights
Model
Prototxt
Model
Parser
Layer
Fusion
Layer
Partition
DLA REG
CFGs
API
HW-aware
Quantize
Insertion
QAT or PTQ
Weight
Conversion
& Partition
Quantized
Weights
Simple API Example :
• Use “YOLO”, “RESNET-50” as a function call, if no breakdown to sub-tasks
• Use { RW REG, INTR, POLL} inside the API, fit for general C compiler
Two packages to inserted to main
1. Load quantized weights
2. Call API (NN functions )
(next slide)
15. Integer Model Quantization Flows
15
Native Training
Graph
Caffe Prototxt
Darknet CFG / ONNX /...
Network
Converter
Compiler
(Baremetal /
ONNC)
Native Training
Weights
(TF/Caffe/Darknet/...)
Weights with
Quantize Info
(TensorFlow )
DLA
Driver
Retrain NN Graph
Retrain Weights
(TensorFlow)
DLA
HW
Quantize-aware training (QAT)
less accuracy loss
Weight
Converter
Post-training
quantization (PTQ),
more accuracy loss
Without HW nor Compiler results, PTQ can be available
■ Require some testing data set ■ Tiny YOLO v1 MAP 40% 15%
Have basic HW inference fusion details, QAT can be available
■ Require training and testing data set ■ Tiny YOLO v1 MAP 40% 35%
16. Tiny YOLO v1 Inference Example
16
1 CONV
2 BN
3 Scale
4 ReLU
5 Pool
6 CONV
7 BN
8 Scale
9 ReLU
10 Pool
11 CONV
12 BN
13 Scale
14 ReLU
15 Pool
16 CONV
17 BN
18 Scale
19 ReLU
20 Pool
21 CONV
22 BN
23 Scale
24 ReLU
25 Pool
26 CONV
27 BN
28 Scale
29 ReLU
30 Pool
31 CONV
32 BN
33 Scale
34 ReLU
35 CONV
36 BN
37 Scale
38 ReLU
39 FC
Macro layer 1
Macro layer 2
Macro layer 3
Macro layer 4
Macro layer 5
Macro layer 6
Macro layer 7
Macro layer 8
FC9
Tiny YOLO v1
(39 DNN layers)
HW Inference Queue
(9 macro layers) ↓
• Originally,8-bit data,minimal
feature maps DRAM access = 27.7MB
• Use fusion, total feature map DRAM
access = 6.2MB
• Total weights remain 27 MB
Fuse 5 layers into 1 macro layer
[CONV–BN–Scale–PReLU–Pool]
Reduce intermediate activation access
* Detection layer is done by CPU
17. A Quick Glance of RTL Results
17
Layer
Queue CFG
Weight
Generator
DLA
RTL
DRAM
Model
Caffe
format
Checker
Calculator
VPI
results
results
HEX
Layer Data DIM OPs
64M-DLA
Cycles
256M-DLA
Cycles
Hybrid1 448x448x3 193 M 5.8 M 4.39 M
Hybrid2 224x224x16 472 M 4.25 M 2.23 M
Hybrid3 112x112x32 467 M 3.94 M 1.12 M
Hybrid4 56x56x64 465 M 3.82 M 1.04 M
Hybrid5 28x28x128 464 M 3.71 M 0.97 M
Hybrid6 14x14x256 463 M 3.69 M 0.95 M
Hybrid7 7x7x512 463 M 3.66 M 2.41 M
Hybrid8 7x7x1024 231 M 3.52 M 1.6 M
FC9 12540 37 M 14.19 M 9.23 M
Summary 3250 M 46.57 M 23.9 M
18. Equation-based Profiler using
64 MAC / 128 KB configuration
18
Network Total Cycle Clock Rate
Run Time per
Frame
FPS
AlexNet 61 M 400MHz 152.20ms 6.57
GoogLeNet 27 M 400MHz 67.83ms 14.74
ResNet50 111 M 400MHz 278.65ms 3.59
VGG16 395 M 400MHz 987.55ms 1.01
Tiny YOLO v1 45 M 400MHz 112.67ms 8.88
Tiny YOLO v2 83 M 400MHz 208.47ms 4.80
Tiny YOLO v3 55 M 400MHz 136.21ms 7.34
• Keep improving accuracy as more RTL simulations of models are done
• The same DRAM BW model (~0.5 GBps) is used
19. Equation-based Profiler using
256 MAC / 128 KB configuration
19
Network Total Cycle Clock Rate
Run Time per
Frame
FPS
AlexNet 49M 400MHz 122.17ms 8.19
GoogLeNet 11M 400MHz 28.40ms 35.22
ResNet50 76M 400MHz 189.74ms 5.27
VGG16 214 M 400MHz 535.99ms 1.87
Tiny YOLO v1 26 M 400MHz 65.30ms 15.31
Tiny YOLO v2 48 M 400MHz 121.07ms 8.26
Tiny YOLO v3 24 M 400MHz 61.55ms 16.25
• Keep improving accuracy as more RTL simulations of models are done
• The same DRAM BW model (~0.5 GBps) is used
20. Use the Profiler to Find a Design Target
20
Such as…..
I want a 30fps real-time tiny-YOLO inference, what is the HW SPEC ?
21. ASIC Implementation
USB Accelerator for Legacy Machines
21
USB
Bridge
USB
GPIF
DRAM
IF
RISC-V
Cache
DLA
AXI
Parallel bus SDK + API (Linux)
DRAM
A
P
BPeripherals
SOC or FPGA
FPGA
Prototype
Fused layer inputFused layer output
• TSMC-65nm
• 3.2 x 3.2 mm
• 64-M, 128KB
• nv_small
Layout View
22. Chip and Board
22
• Technology: TSMC 65nm, Core : 1 V
• Performance: 64 MAC, 50 GOPs @ 400 MHz
• DLA Avg Power : 60 mW
EVA Board & Die Photo … More information about this
chip will be published later
23. Conclusions
23
Adapt and customize HW resources, if you already have
some candidate models
An end-to-end solution of edge AI application is here for
your reference
Require especially-tight cooperation among HW-SW-
training in integer DLAs
~ Thank for Your Attention ~