https://sites.google.com/view/itri-icl-dla/
(Public Information Share) This is our lightweight DNN inference processor presentation, including a system solution (from Caffe prototxt to HW controls files), hardware features, and an example of object detection (Tiny YOLO) RTL simulation results. We modified open-source NVDLA, small configuration, and developed a RISC-V MCU in this accelerating system.
1. A Lightweight
DNN Inference Processor
design, system, tools, and applications
羅賢君 Shien-Chun Luo Oct. 2018
工業技術研究院 Industrial Technology Research Institute (ITRI)
資訊與通訊研究所 Information and Communication Research Lab (ICL)
2. Roofline Model
- Key to Design DNN Inference Engine
1. More parallel PEs with high utilization
▪ Efficient parallel PE structure, interconnect
▪ Proper memory hierarchy
2. Increase data supplement
▪ High bandwidth data access
▪ Reduce data movement or compress data
3. Improve energy efficiency
▪ Adaptive resource to models
▪ Low-power skills
Performance(Operations)
Operational Intensity (operations/byte)
Computation
↓ Bound 2
3
2
↑ Computation
Bound 1
2
3. Segment & Position
ARM’s Project Trillium
• Performance of > 4.6 TOP/s
• Efficiency of > 3 TOPs/W (7nm process)
• On-chip SRAM size up-to 1MB
Our targeting DNN accelerating solution
• Performance of 50 GOP/s ~ 200 GOP/s
• Efficiency about 1 TOPs/W (65nm process)
• On-chip SRAM size ≤ 256KB
Figure sourced from : ARM Project Trillium
4. We Started from nVIDIA Open-source
Deep Learning Accelerator (DLA)
What have ITRI done
1. A bug-fixed, fully-compatible to NVDLA HW version (can use of NV’s tools)
2. A model translation tool – compile DNN model to DLA configuration files
3. An adaptive quantization flow – convert FP weights to HW-specific 8-bit precision
4. End-to-end verifications – we show an object detection (YOLO) in this presentation
HW Overview
Features
1. Variable HW resource
2. Suit for 3D convolution
3. Buffer data reuse
4. Hetero-layer fusion
5. Ping-pong CFG registers
5. 1. Variable HW resource-PE#, buffer size
• Search an efficient resource to models
• Adaptive performance & power consumption
2. Suit for 3D convolution
• Released data dependency, share input feature cube
• Output pixel first, share IN, avoiding partial sum storage
• Support any kernel size (n x m) ,the same data flow
• Close to 100% PE utilization
3. Buffer data reuse
• Reuse input or weight in the next layer
• Benefit large layer partition, or batch
4. Hetero-layer fusion
• Fuse popular layer stack [ CONV – BN – PReLU – Pooling ]
• Greatly reduce the DRAM access data
5. Ping-pong CFG registers
• Configure N and N+1 layer simultaneously
• Cover the configuration time during layer change
DLA Features - Overview width
height
IN IN
IN
OUT
kernels
Stride 1, no pad
Channel first Plane first
3D CONV example
6. DLA Features - Why Configurable Resource is important ?
Alexnet (~0.73 GOP, 61M weights)
• Huge fully connected weights
• DRAM speed dominates
• Computation power cannot help
GoogleNet (~3.2 GOP, 7M weights)
• Small filter size (1x1)
• Benefit parallelism in CNN operations
• Computation power dominates
• DRAM speed cannot help
ResNet50 (~7.8 GOP, 25M weights)
• Large CNN operations, large weights
• Residual directly add two data cubes
DRAM speed dominates
• Computation power and DRAM speed are
evenly important
Performance Gradient
7. Original NVDLA Framework, DEV Flow
Caffe Prototxt
Caffe Model
(weights)
Parser
HW SPEC
Layer ID
Compiler
(Optimization)
Wisdom DIR
• layer details
Loadable file
• HW CONFIGs
• Layers’ CONFIGs
Kernel Mode Driver
(KMD)
• Translate a layer to
HW binary CFGs
• Handle IRQ
User Mode Driver
(UMD)
• Allocate address
• Function call :
layer by layer
inference
Flow Controller
(MCU or CPU)
• Load HW binary
CONFIGs
• Handle IRQ
DLA
HW
Input Compiler (binary version)
HardwareAPI and Driver
Formatted Weights
online | offline
8. ITRI DLA-Lite Simplified Flow - Overview
MCU
NVM
(optional)
DRAM
DMA
GPIF
DNN
Accelerator
Host System
(ARM-based, x86, …)
Program GPIF
DNN Model
Translate /
Format
Tools
HW
resource
allocation
Quantized
Re-train
Weights
Performance
Estimation
DEV ToolsHW Architecture
1. Find an efficient setup of HW resources
2. Setup system address allocation
3. Generate “translated” inference commands
4. Generate “formatted” model parameters
Inference command package ( to compile for MCU)
Inference weight package
9. ITRI DLA-Lite Simplified Flow – DEV tools
DNN Model
Parameters
Caffe Model
Prototxt
Model
Parser
Layer
Fusion
Layer
Partition
Check Layer
Sequence
Check HW
buffer size
DLA CFG
Commands
MCU
Instructions
MCU
compiler
DLA CFG translator
Memory allocator
HW-aware
Quantize
Insertion (TF)
Accuracy
Retrain (TF)
Parameter
Partition
Formatted
Quantized
Weights
Weight format
writer
• Before inference, initialize 2
packages into memory
• After inference, load images
and activate MCU and DLA
• API Example : “YOLO”, “RESNET-50” as a function call,
no breakdown to sub-tasks
• Easy for predefined DNN, future updated by venders
which is like the input.txn
file in NVDLA v1 testbench
Two binary packages
1. compiled MCU instructions
2. formatted weights
10. Popular NN Computer Vision Tasks
“You Only Look Once“ (YOLO)
Object detection (OD)
application is verified and
demonstrated
Figure sourced from :
Arthur Ouaknine’s Medium log
11. Object Detection Inference (1/2)
-- Layer Fusion
ID type
1 CONV
2 BN
3 Scale
4 ReLU
5 Pool
6 CONV
7 BN
8 Scale
9 ReLU
10 Pool
11 CONV
12 BN
13 Scale
14 ReLU
15 Pool
16 CONV
17 BN
18 Scale
19 ReLU
20 Pool
21 CONV
22 BN
23 Scale
24 ReLU
25 Pool
26 CONV
27 BN
28 Scale
29 ReLU
30 Pool
31 CONV
32 BN
33 Scale
34 ReLU
35 CONV
36 BN
37 Scale
38 ReLU
39 FC
Layer
Number
Hybrid Layer 1
Hybrid Layer 2
Hybrid Layer 3
Hybrid Layer 4
Hybrid Layer 5
Hybrid Layer 6
Hybrid Layer 7
Hybrid Layer 8
FC9
Tiny YOLO v1
(39 DNN layers)
HW Inference
Queue (9 layers)
Hybrid layer supports [CONV–BN–Scale–PReLU–
Pool] 5-layer combination
• Originally,8-bit data,Minimal feature maps
DRAM access = 27.7MB
• Use [CONV–BN–Scale–PReLU–Pool] fusion, total
feature map DRAM access = 6.2MB
Why reduce DRAM access important
(Weight = 27MB)
• Originally, @30 FPS,DRAM BW = 1.64 GB/s
• After fusion, @30 FPS,DRAM BW = 996 MB/s
HW : 64 Cores, 128KB SRAM
* Detection layer is done by CPU
12. Object Detection Inference (2/2)
–RTL Results
Conv.
layer
Input Data
Dimension
RTL
Cycle #
OPs
OPs /
cycle
UTIL
Hybrid1 448x448x3 5.80M 193M 33 26.0%
Hybrid2 224x224x16 4.25M 472M 111 86.8%
Hybrid3 112x112x32 3.94M 467M 119 92.7%
Hybrid4 56x56x64 3.82M 465M 122 95.1%
Hybrid5 28x28x128 3.71M 464M 125 97.6%
Hybrid6 14x14x256 3.69M 463M 126 98.1%
Hybrid7 7x7x512 3.66M 463M 126 98.7%
Hybrid8 7x7x1024 3.52M 231M 66 51.3%
FC9 12540 14.19M 37M 2.6 2.0%
Summary 46.57M 3.25G 70
Note: MAC (CONV+FC) total OPs = 3.18G
Total weights = 27M
Use 64 cores, 128KB SRAM
Peak performance = 128 OPs/ cycle
Result analysis
• Utilization (86%~98%) in CNN layers
• DRAM BW and SRAM size affects
hybrid layer 1 and 8
• FC is highly DRAM BW dominated
Have some detailed
partitions (by DEV tool)
Config
file
Weight
Generator
DLA
RTL
DRAM
Model
VPI
hex
Caffe
format
Trans
13. DLA Product Prototypes (1/2)
• FPGA–based standalone product
• CFG file is packed to a C function, compiled to ARM
• Running a defined DNN inference
• Update DNN CFG & models by venders
Example 1 --- as a standalone ID Camera
DRAM
DLA Input Data
Model Weights
OS Memory Space
DRAM
CTRL
HDMI
USB ARM CPU
(FPGA)
DLA
(Processing System) Activations
14. AXI
DLA
AXI
Private Memory
MCU
Main CPUUSB HDMI
APB DRAM CTRL
DMA
DRAM
Example 3
--- as a SoC IP
Video Capture ( )
DNN_CALL( )
Data Fusion ( )
Decision ( )
USB – DLA on FPGA
USB - DLA in ASIC,
dev board
• USB accelerating stick + SDK
• Help legacy facilities to equip DNN
acceleration
• DNN accelerator IP
• Conventional IP business + DEV
tool chains
DLA Product Prototypes (2/2)
Example 2
--- as a Plug and Play Stick
similar to Movidius / Gyrfalcon stick
execute whole model inference, instead of
convolution function only
15. USB acceleration system & ASIC Design
USB to
GPIF
GPIF
Data CTRL
DRAM
CTRL / IF
RISC-V
Cache
DLA
(64 MAC)
AXI
Parallel bus SDK + API
DRAM
A
P
BPeripherals
DLA-Lite System SPEC
• 400MHz core, 100MHz board
• 64CONV MAC, 128KB CONV SRAM
• 50 GOPs peak CNN performance
• Targeting power consumption 50mW
ASIC Preliminary Info (floorplan view)
• TSMC 65nm
• Die size: 3,200 x 3,200 μm2
• Core: 2,500 x 2,500 μm2
64KB
CONV
Buffer32
MAC
32
MAC
BN
PReLU
Pool
Processor
ACC
CONV
DMA
CONV
Sequencer
Data
IO
CTRL
RISC-V
PLL
AXI
DMA interface
64KB
CONV
Buffer
16. THANK YOU!
QUESTIONS AND COMMENTS?
technical contact : scluo@itri.org.tw , yhchu@itri.org.tw
business contact : victor.wang@itri.org.tw