SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
D15-3 Customization of a
Deep Learning Accelerator
Shien-Chun Luo
Industrial Technology Research Institute
25 April 2019
Agenda
 Object Detection Demonstration
 Design a High Efficient Accelerator
 Our Solutions and Some Results
2
Demonstration of Object Detection
3
• 256-MAC DLA @ 150 MHz
• ZCU102 FPGA (used 40% of 600k logics)
• Ubuntu on ARM A53, 1.2GHz
• USB camera input, display port output
• Tiny YOLO v1, 448 x 448 RGB input
• 8-CONV & 1-FC, 3.2 GOPs of NN
• Detection layer uses CPU
• VOC dataset, 20 categories
• Original FP-32, MAP= 40%
• Retrained INT-8 by TF, MAP=35%
• Avg 8 FPS
• Execution time, CONV ~79 ms , FC ~48 ms
FPGA Object Detection Setup
4
DRAM (1GB)
Input Image
Model Weights
OS Controlled Space
DRAM
CTRL
DP
USB ARM CPU
(FPGA)
DLA
(Processing System)
Temp Activations
Output Data
Reserved
for DLA
(64~256MB)
Program INIT
Set parameters
Load Weight
Image Capture (YUV)
Re-Format to RGB
Activate DLA
Post Processing
Display
DLA Finished
5
Design a High Efficient
Accelerator
3 Steps to Achieve Our Goal
6
1. Increase MAC PEs with high utilization
2. Increase data supplement to those PEs
3. Improve energy efficiency, adaptive to the models
Throughput
Computation Power
3
2
Concepts of step 1~ 3 Take Alexnet for example
Throughput Curves
given various DRAM BW
FPS/Throughput of Various Models
-- profiled using 256MAC, 128KB, INT8, DLA inference
7
Profiles of Classical Classification
Models (1)
8
AlexNet (224) Memory Bandwidth (GBps) Mem BW
Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s
100 MHz 51 GOPs 7.1 9.0 10.4 11.2 11.5 1.6
200 MHz 102 GOPs 10.0 14.2 18.0 20.7 21.8 2.2
400 MHz 205 GOPs 12.6 20.0 28.4 35.9 39.4 3.1
800 MHz 410 GOPs 14.3 25.2 40.1 56.8 66.0 4.6
1000 MHz 512 GOPs 14.6 26.6 43.6 64.3 76.4 5.2
Computational Power
Sensitivity --> 2.1 3.0 4.2 5.7 6.6
Inception v1 (224) Memory Bandwidth (GBps) Mem BW
Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s
100 MHz 51 GOPs 8.8 9.1 9.2 9.3 9.3 1.1
200 MHz 102 GOPs 16.6 17.6 18.1 18.4 18.5 1.1
400 MHz 205 GOPs 28.3 33.1 35.2 36.2 36.6 1.3
800 MHz 410 GOPs 41.2 56.6 66.2 70.4 71.8 1.7
1000 MHz 512 GOPs 44.7 65.2 79.6 86.8 88.9 2.0
Computational Power
Sensitivity --> 5.1 7.2 8.7 9.4 9.6
AlexNet prefers
More memory
bandwidth due to
heavy-weight FC
layers
Inception prefers
More computation
power because
CNN computation
dominates
↑ Edge devices may limit the budge of DRAM(256MAC)
Profiles of Classical Classification
Models (2)
9
ResNet50 (224) Memory Bandwidth (GBps) Mem BW
Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s
100 MHz 51 GOPs 5.0 5.1 5.1 5.1 5.1 1.0
200 MHz 102 GOPs 9.2 10.0 10.1 10.1 10.2 1.1
400 MHz 205 GOPs 13.0 18.4 20.1 20.2 20.3 1.6
800 MHz 410 GOPs 15.6 26.0 36.9 40.1 40.4 2.6
1000 MHz 512 GOPs 16.1 28.0 42.3 49.5 50.4 3.1
Computational Power
Sensitivity --> 3.2 5.5 8.3 9.7 9.9
MobileNet v1 (224) Memory Bandwidth (GBps) Mem BW
Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s
100 MHz 51 GOPs 31.1 33.0 33.5 33.6 33.7 1.1
200 MHz 102 GOPs 51.2 62.2 66.0 66.9 67.2 1.3
400 MHz 205 GOPs 62.9 102.5 124.3 131.9 133.4 2.1
800 MHz 410 GOPs 64.0 125.8 204.9 248.6 259.7 4.1
1000 MHz 512 GOPs 64.0 127.9 226.0 299.7 318.5 5.0
Computational Power
Sensitivity --> 2.1 3.9 6.8 8.9 9.5
Resnet prefers
Evenly memory
bandwidth and
computation power
Mobilenet prefers
More memory
bandwidth because
DW-CONV layers
reduce computation but
increase activations to
RW memory
↑ Edge devices may limit the budge of DRAM(256MAC)
10
Our Solutions and Some Results
Let’s Use a Customizable Architecture
11
1. Variable CONV processing resources
• 64-MAC PE cluster to 2048-MAC PE cluster
for a single convolution processer
• Variable volume of convolutional buffer
2. Configurable NN operator processors
• Options for batch normalization, PReLU,
scale, bias, quantization, element-wise
operators
• Options for down sample ( like pool)
operators
• Options for nonlinear LUTs
3. Custom memories and host CPUs
• Can be driven by MCU or CPU
• Shared or private DRAM/SRAMArchitecture revised based on NVDLA
DLA Features – Inherence and Our Changes
12
1. [Inherence] Channel-first CONV strategy
• Released data dependency, share input feature cube
• Any kernel size (n x m) ~100% utilization if deep channels
2. [Add tool to verify] Layer fusion to save memory access
• Fuse popular layer stack [ CONV – BN – PReLU – Pool ]
• Verified  reduce activations access
3. [Add tool to verify] Program time hiding
• Verified program the N+1th layer, when running the Nth layer
4. [Revised HW] Depth-wise CONV support
• Revised HW from DMA to ACC
5. [Future work] DMA of fast data dimension change
• Adding fast up-sample algorithms, data dimension reorder changes
width
height
IN
IN
Channel first
Plane first
Standard Inference collaborating ONNC
on Linux Machines
13
Model Graph
Model weights
Kernel Mode Driver
(KMD)
User Mode Driver
(UMD)
Flow Controller
(MCU or CPU)
DLA
HW
User’s
Framework
Compiler (ONNC)
HardwareAPI and Driver for Linux
 online | offline 
Framework Converter
ONNX Graph
Model weights
with quantize
information
Parser
CPU Tasks
Loadable
Files
Framework
Conversion
Quantize
info extract
(Tensorflow)
DLA Tasks
Compiler
Bottom-up (Baremetal) Verification Flow
14
Model
Weights
Model
Prototxt
Model
Parser
Layer
Fusion
Layer
Partition
DLA REG
CFGs
API
HW-aware
Quantize
Insertion
QAT or PTQ
Weight
Conversion
& Partition
Quantized
Weights
Simple API Example :
• Use “YOLO”, “RESNET-50” as a function call, if no breakdown to sub-tasks
• Use { RW REG, INTR, POLL} inside the API, fit for general C compiler
Two packages to inserted to main
1. Load quantized weights
2. Call API (NN functions )
(next slide)
Integer Model Quantization Flows
15
Native Training
Graph
Caffe Prototxt
Darknet CFG / ONNX /...
Network
Converter
Compiler
(Baremetal /
ONNC)
Native Training
Weights
(TF/Caffe/Darknet/...)
Weights with
Quantize Info
(TensorFlow )
DLA
Driver
Retrain NN Graph
Retrain Weights
(TensorFlow)
DLA
HW
Quantize-aware training (QAT)
 less accuracy loss
Weight
Converter
Post-training
quantization (PTQ),
 more accuracy loss
Without HW nor Compiler results, PTQ can be available
■ Require some testing data set ■ Tiny YOLO v1 MAP 40%  15%
Have basic HW inference fusion details, QAT can be available
■ Require training and testing data set ■ Tiny YOLO v1 MAP 40%  35%
Tiny YOLO v1 Inference Example
16
1 CONV
2 BN
3 Scale
4 ReLU
5 Pool
6 CONV
7 BN
8 Scale
9 ReLU
10 Pool
11 CONV
12 BN
13 Scale
14 ReLU
15 Pool
16 CONV
17 BN
18 Scale
19 ReLU
20 Pool
21 CONV
22 BN
23 Scale
24 ReLU
25 Pool
26 CONV
27 BN
28 Scale
29 ReLU
30 Pool
31 CONV
32 BN
33 Scale
34 ReLU
35 CONV
36 BN
37 Scale
38 ReLU
39 FC
Macro layer 1
Macro layer 2
Macro layer 3
Macro layer 4
Macro layer 5
Macro layer 6
Macro layer 7
Macro layer 8
FC9
Tiny YOLO v1 
(39 DNN layers)
HW Inference Queue
(9 macro layers) ↓
• Originally,8-bit data,minimal
feature maps DRAM access = 27.7MB
• Use fusion, total feature map DRAM
access = 6.2MB
• Total weights remain 27 MB
Fuse 5 layers into 1 macro layer
[CONV–BN–Scale–PReLU–Pool]
Reduce intermediate activation access
* Detection layer is done by CPU
A Quick Glance of RTL Results
17
Layer
Queue CFG
Weight
Generator
DLA
RTL
DRAM
Model
Caffe
format
Checker
Calculator
VPI
results
results
HEX
Layer Data DIM OPs
64M-DLA
Cycles
256M-DLA
Cycles
Hybrid1 448x448x3 193 M 5.8 M 4.39 M
Hybrid2 224x224x16 472 M 4.25 M 2.23 M
Hybrid3 112x112x32 467 M 3.94 M 1.12 M
Hybrid4 56x56x64 465 M 3.82 M 1.04 M
Hybrid5 28x28x128 464 M 3.71 M 0.97 M
Hybrid6 14x14x256 463 M 3.69 M 0.95 M
Hybrid7 7x7x512 463 M 3.66 M 2.41 M
Hybrid8 7x7x1024 231 M 3.52 M 1.6 M
FC9 12540 37 M 14.19 M 9.23 M
Summary 3250 M 46.57 M 23.9 M
Equation-based Profiler using
64 MAC / 128 KB configuration
18
Network Total Cycle Clock Rate
Run Time per
Frame
FPS
AlexNet 61 M 400MHz 152.20ms 6.57
GoogLeNet 27 M 400MHz 67.83ms 14.74
ResNet50 111 M 400MHz 278.65ms 3.59
VGG16 395 M 400MHz 987.55ms 1.01
Tiny YOLO v1 45 M 400MHz 112.67ms 8.88
Tiny YOLO v2 83 M 400MHz 208.47ms 4.80
Tiny YOLO v3 55 M 400MHz 136.21ms 7.34
• Keep improving accuracy as more RTL simulations of models are done
• The same DRAM BW model (~0.5 GBps) is used
Equation-based Profiler using
256 MAC / 128 KB configuration
19
Network Total Cycle Clock Rate
Run Time per
Frame
FPS
AlexNet 49M 400MHz 122.17ms 8.19
GoogLeNet 11M 400MHz 28.40ms 35.22
ResNet50 76M 400MHz 189.74ms 5.27
VGG16 214 M 400MHz 535.99ms 1.87
Tiny YOLO v1 26 M 400MHz 65.30ms 15.31
Tiny YOLO v2 48 M 400MHz 121.07ms 8.26
Tiny YOLO v3 24 M 400MHz 61.55ms 16.25
• Keep improving accuracy as more RTL simulations of models are done
• The same DRAM BW model (~0.5 GBps) is used
Use the Profiler to Find a Design Target
20
Such as…..
I want a 30fps real-time tiny-YOLO inference, what is the HW SPEC ?
ASIC Implementation
USB Accelerator for Legacy Machines
21
USB
Bridge
USB
GPIF
DRAM
IF
RISC-V
Cache
DLA
AXI
Parallel bus SDK + API (Linux)
DRAM
A
P
BPeripherals
SOC or FPGA
FPGA
Prototype
Fused layer inputFused layer output
• TSMC-65nm
• 3.2 x 3.2 mm
• 64-M, 128KB
• nv_small
Layout View
Chip and Board
22
• Technology: TSMC 65nm, Core : 1 V
• Performance: 64 MAC, 50 GOPs @ 400 MHz
• DLA Avg Power : 60 mW
EVA Board & Die Photo … More information about this
chip will be published later
Conclusions
23
 Adapt and customize HW resources, if you already have
some candidate models
 An end-to-end solution of edge AI application is here for
your reference
 Require especially-tight cooperation among HW-SW-
training in integer DLAs
~ Thank for Your Attention ~

Contenu connexe

Tendances

Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFBrendan Gregg
 
Soc architecture and design
Soc architecture and designSoc architecture and design
Soc architecture and designSatya Harish
 
SFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress UpdateSFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress UpdateLinaro
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
MemVerge: The Software Stack for CXL Environments
MemVerge: The Software Stack for CXL EnvironmentsMemVerge: The Software Stack for CXL Environments
MemVerge: The Software Stack for CXL EnvironmentsMemory Fabric Forum
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesAMD
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu WorksZhen Wei
 
Session 8,9 PCI Express
Session 8,9 PCI ExpressSession 8,9 PCI Express
Session 8,9 PCI ExpressSubhash Iyer
 
TRex Traffic Generator - Hanoch Haim
TRex Traffic Generator - Hanoch HaimTRex Traffic Generator - Hanoch Haim
TRex Traffic Generator - Hanoch Haimharryvanhaaren
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsHisaki Ohara
 
Prerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrencyPrerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrencyViller Hsiao
 
Linux Performance Analysis and Tools
Linux Performance Analysis and ToolsLinux Performance Analysis and Tools
Linux Performance Analysis and ToolsBrendan Gregg
 
[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅NAVER D2
 
DDR, GDDR, HBM SDRAM Memory
DDR, GDDR, HBM SDRAM MemoryDDR, GDDR, HBM SDRAM Memory
DDR, GDDR, HBM SDRAM MemorySubhajit Sahu
 

Tendances (20)

Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
Soc architecture and design
Soc architecture and designSoc architecture and design
Soc architecture and design
 
SFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress UpdateSFO15-302: Energy Aware Scheduling: Progress Update
SFO15-302: Energy Aware Scheduling: Progress Update
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
MemVerge: The Software Stack for CXL Environments
MemVerge: The Software Stack for CXL EnvironmentsMemVerge: The Software Stack for CXL Environments
MemVerge: The Software Stack for CXL Environments
 
Interrupts on xv6
Interrupts on xv6Interrupts on xv6
Interrupts on xv6
 
AMD vs Intel
AMD vs Intel AMD vs Intel
AMD vs Intel
 
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip ArchitecturesISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
from Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Worksfrom Binary to Binary: How Qemu Works
from Binary to Binary: How Qemu Works
 
Session 8,9 PCI Express
Session 8,9 PCI ExpressSession 8,9 PCI Express
Session 8,9 PCI Express
 
Userspace networking
Userspace networkingUserspace networking
Userspace networking
 
TRex Traffic Generator - Hanoch Haim
TRex Traffic Generator - Hanoch HaimTRex Traffic Generator - Hanoch Haim
TRex Traffic Generator - Hanoch Haim
 
Intel DPDK Step by Step instructions
Intel DPDK Step by Step instructionsIntel DPDK Step by Step instructions
Intel DPDK Step by Step instructions
 
eMMC 5.0 Total IP Solution
eMMC 5.0 Total IP SolutioneMMC 5.0 Total IP Solution
eMMC 5.0 Total IP Solution
 
Prerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrencyPrerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrency
 
Linux Performance Analysis and Tools
Linux Performance Analysis and ToolsLinux Performance Analysis and Tools
Linux Performance Analysis and Tools
 
[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅
 
DDR, GDDR, HBM SDRAM Memory
DDR, GDDR, HBM SDRAM MemoryDDR, GDDR, HBM SDRAM Memory
DDR, GDDR, HBM SDRAM Memory
 

Similaire à customization of a deep learning accelerator, based on NVDLA

Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioOPNFV
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programminginside-BigData.com
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdfhellobank1
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Danielle Womboldt
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Community
 
PowerDRC/LVS 2.0.1 released by POLYTEDA
PowerDRC/LVS 2.0.1 released by POLYTEDAPowerDRC/LVS 2.0.1 released by POLYTEDA
PowerDRC/LVS 2.0.1 released by POLYTEDAAlexander Grudanov
 
On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...Jorge E. López de Vergara Méndez
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance BenchmarkBigstep
 
IDF'16 San Francisco - Overclocking Session
IDF'16 San Francisco - Overclocking SessionIDF'16 San Francisco - Overclocking Session
IDF'16 San Francisco - Overclocking SessionHWBOT
 
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdfJunZhao68
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationBigstep
 
Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Community
 
Cyclone II FPGA Overview
Cyclone II FPGA OverviewCyclone II FPGA Overview
Cyclone II FPGA OverviewPremier Farnell
 
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptxQ1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptxMemory Fabric Forum
 

Similaire à customization of a deep learning accelerator, based on NVDLA (20)

Brkdct 3101
Brkdct 3101Brkdct 3101
Brkdct 3101
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
BURA Supercomputer
BURA SupercomputerBURA Supercomputer
BURA Supercomputer
 
Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
 
PowerDRC/LVS 2.0.1 released by POLYTEDA
PowerDRC/LVS 2.0.1 released by POLYTEDAPowerDRC/LVS 2.0.1 released by POLYTEDA
PowerDRC/LVS 2.0.1 released by POLYTEDA
 
On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...On the feasibility of 40 Gbps network data capture and retention with general...
On the feasibility of 40 Gbps network data capture and retention with general...
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance Benchmark
 
Moving object detection on FPGA
Moving object detection on FPGAMoving object detection on FPGA
Moving object detection on FPGA
 
IDF'16 San Francisco - Overclocking Session
IDF'16 San Francisco - Overclocking SessionIDF'16 San Francisco - Overclocking Session
IDF'16 San Francisco - Overclocking Session
 
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
MaPU-HPCA2016
MaPU-HPCA2016MaPU-HPCA2016
MaPU-HPCA2016
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective
 
Cyclone II FPGA Overview
Cyclone II FPGA OverviewCyclone II FPGA Overview
Cyclone II FPGA Overview
 
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptxQ1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
 

Dernier

Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfrs7054576148
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 

Dernier (20)

FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 

customization of a deep learning accelerator, based on NVDLA

  • 1. D15-3 Customization of a Deep Learning Accelerator Shien-Chun Luo Industrial Technology Research Institute 25 April 2019
  • 2. Agenda  Object Detection Demonstration  Design a High Efficient Accelerator  Our Solutions and Some Results 2
  • 3. Demonstration of Object Detection 3 • 256-MAC DLA @ 150 MHz • ZCU102 FPGA (used 40% of 600k logics) • Ubuntu on ARM A53, 1.2GHz • USB camera input, display port output • Tiny YOLO v1, 448 x 448 RGB input • 8-CONV & 1-FC, 3.2 GOPs of NN • Detection layer uses CPU • VOC dataset, 20 categories • Original FP-32, MAP= 40% • Retrained INT-8 by TF, MAP=35% • Avg 8 FPS • Execution time, CONV ~79 ms , FC ~48 ms
  • 4. FPGA Object Detection Setup 4 DRAM (1GB) Input Image Model Weights OS Controlled Space DRAM CTRL DP USB ARM CPU (FPGA) DLA (Processing System) Temp Activations Output Data Reserved for DLA (64~256MB) Program INIT Set parameters Load Weight Image Capture (YUV) Re-Format to RGB Activate DLA Post Processing Display DLA Finished
  • 5. 5 Design a High Efficient Accelerator
  • 6. 3 Steps to Achieve Our Goal 6 1. Increase MAC PEs with high utilization 2. Increase data supplement to those PEs 3. Improve energy efficiency, adaptive to the models Throughput Computation Power 3 2 Concepts of step 1~ 3 Take Alexnet for example Throughput Curves given various DRAM BW
  • 7. FPS/Throughput of Various Models -- profiled using 256MAC, 128KB, INT8, DLA inference 7
  • 8. Profiles of Classical Classification Models (1) 8 AlexNet (224) Memory Bandwidth (GBps) Mem BW Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s 100 MHz 51 GOPs 7.1 9.0 10.4 11.2 11.5 1.6 200 MHz 102 GOPs 10.0 14.2 18.0 20.7 21.8 2.2 400 MHz 205 GOPs 12.6 20.0 28.4 35.9 39.4 3.1 800 MHz 410 GOPs 14.3 25.2 40.1 56.8 66.0 4.6 1000 MHz 512 GOPs 14.6 26.6 43.6 64.3 76.4 5.2 Computational Power Sensitivity --> 2.1 3.0 4.2 5.7 6.6 Inception v1 (224) Memory Bandwidth (GBps) Mem BW Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s 100 MHz 51 GOPs 8.8 9.1 9.2 9.3 9.3 1.1 200 MHz 102 GOPs 16.6 17.6 18.1 18.4 18.5 1.1 400 MHz 205 GOPs 28.3 33.1 35.2 36.2 36.6 1.3 800 MHz 410 GOPs 41.2 56.6 66.2 70.4 71.8 1.7 1000 MHz 512 GOPs 44.7 65.2 79.6 86.8 88.9 2.0 Computational Power Sensitivity --> 5.1 7.2 8.7 9.4 9.6 AlexNet prefers More memory bandwidth due to heavy-weight FC layers Inception prefers More computation power because CNN computation dominates ↑ Edge devices may limit the budge of DRAM(256MAC)
  • 9. Profiles of Classical Classification Models (2) 9 ResNet50 (224) Memory Bandwidth (GBps) Mem BW Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s 100 MHz 51 GOPs 5.0 5.1 5.1 5.1 5.1 1.0 200 MHz 102 GOPs 9.2 10.0 10.1 10.1 10.2 1.1 400 MHz 205 GOPs 13.0 18.4 20.1 20.2 20.3 1.6 800 MHz 410 GOPs 15.6 26.0 36.9 40.1 40.4 2.6 1000 MHz 512 GOPs 16.1 28.0 42.3 49.5 50.4 3.1 Computational Power Sensitivity --> 3.2 5.5 8.3 9.7 9.9 MobileNet v1 (224) Memory Bandwidth (GBps) Mem BW Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s 100 MHz 51 GOPs 31.1 33.0 33.5 33.6 33.7 1.1 200 MHz 102 GOPs 51.2 62.2 66.0 66.9 67.2 1.3 400 MHz 205 GOPs 62.9 102.5 124.3 131.9 133.4 2.1 800 MHz 410 GOPs 64.0 125.8 204.9 248.6 259.7 4.1 1000 MHz 512 GOPs 64.0 127.9 226.0 299.7 318.5 5.0 Computational Power Sensitivity --> 2.1 3.9 6.8 8.9 9.5 Resnet prefers Evenly memory bandwidth and computation power Mobilenet prefers More memory bandwidth because DW-CONV layers reduce computation but increase activations to RW memory ↑ Edge devices may limit the budge of DRAM(256MAC)
  • 10. 10 Our Solutions and Some Results
  • 11. Let’s Use a Customizable Architecture 11 1. Variable CONV processing resources • 64-MAC PE cluster to 2048-MAC PE cluster for a single convolution processer • Variable volume of convolutional buffer 2. Configurable NN operator processors • Options for batch normalization, PReLU, scale, bias, quantization, element-wise operators • Options for down sample ( like pool) operators • Options for nonlinear LUTs 3. Custom memories and host CPUs • Can be driven by MCU or CPU • Shared or private DRAM/SRAMArchitecture revised based on NVDLA
  • 12. DLA Features – Inherence and Our Changes 12 1. [Inherence] Channel-first CONV strategy • Released data dependency, share input feature cube • Any kernel size (n x m) ~100% utilization if deep channels 2. [Add tool to verify] Layer fusion to save memory access • Fuse popular layer stack [ CONV – BN – PReLU – Pool ] • Verified  reduce activations access 3. [Add tool to verify] Program time hiding • Verified program the N+1th layer, when running the Nth layer 4. [Revised HW] Depth-wise CONV support • Revised HW from DMA to ACC 5. [Future work] DMA of fast data dimension change • Adding fast up-sample algorithms, data dimension reorder changes width height IN IN Channel first Plane first
  • 13. Standard Inference collaborating ONNC on Linux Machines 13 Model Graph Model weights Kernel Mode Driver (KMD) User Mode Driver (UMD) Flow Controller (MCU or CPU) DLA HW User’s Framework Compiler (ONNC) HardwareAPI and Driver for Linux  online | offline  Framework Converter ONNX Graph Model weights with quantize information Parser CPU Tasks Loadable Files Framework Conversion Quantize info extract (Tensorflow) DLA Tasks Compiler
  • 14. Bottom-up (Baremetal) Verification Flow 14 Model Weights Model Prototxt Model Parser Layer Fusion Layer Partition DLA REG CFGs API HW-aware Quantize Insertion QAT or PTQ Weight Conversion & Partition Quantized Weights Simple API Example : • Use “YOLO”, “RESNET-50” as a function call, if no breakdown to sub-tasks • Use { RW REG, INTR, POLL} inside the API, fit for general C compiler Two packages to inserted to main 1. Load quantized weights 2. Call API (NN functions ) (next slide)
  • 15. Integer Model Quantization Flows 15 Native Training Graph Caffe Prototxt Darknet CFG / ONNX /... Network Converter Compiler (Baremetal / ONNC) Native Training Weights (TF/Caffe/Darknet/...) Weights with Quantize Info (TensorFlow ) DLA Driver Retrain NN Graph Retrain Weights (TensorFlow) DLA HW Quantize-aware training (QAT)  less accuracy loss Weight Converter Post-training quantization (PTQ),  more accuracy loss Without HW nor Compiler results, PTQ can be available ■ Require some testing data set ■ Tiny YOLO v1 MAP 40%  15% Have basic HW inference fusion details, QAT can be available ■ Require training and testing data set ■ Tiny YOLO v1 MAP 40%  35%
  • 16. Tiny YOLO v1 Inference Example 16 1 CONV 2 BN 3 Scale 4 ReLU 5 Pool 6 CONV 7 BN 8 Scale 9 ReLU 10 Pool 11 CONV 12 BN 13 Scale 14 ReLU 15 Pool 16 CONV 17 BN 18 Scale 19 ReLU 20 Pool 21 CONV 22 BN 23 Scale 24 ReLU 25 Pool 26 CONV 27 BN 28 Scale 29 ReLU 30 Pool 31 CONV 32 BN 33 Scale 34 ReLU 35 CONV 36 BN 37 Scale 38 ReLU 39 FC Macro layer 1 Macro layer 2 Macro layer 3 Macro layer 4 Macro layer 5 Macro layer 6 Macro layer 7 Macro layer 8 FC9 Tiny YOLO v1  (39 DNN layers) HW Inference Queue (9 macro layers) ↓ • Originally,8-bit data,minimal feature maps DRAM access = 27.7MB • Use fusion, total feature map DRAM access = 6.2MB • Total weights remain 27 MB Fuse 5 layers into 1 macro layer [CONV–BN–Scale–PReLU–Pool] Reduce intermediate activation access * Detection layer is done by CPU
  • 17. A Quick Glance of RTL Results 17 Layer Queue CFG Weight Generator DLA RTL DRAM Model Caffe format Checker Calculator VPI results results HEX Layer Data DIM OPs 64M-DLA Cycles 256M-DLA Cycles Hybrid1 448x448x3 193 M 5.8 M 4.39 M Hybrid2 224x224x16 472 M 4.25 M 2.23 M Hybrid3 112x112x32 467 M 3.94 M 1.12 M Hybrid4 56x56x64 465 M 3.82 M 1.04 M Hybrid5 28x28x128 464 M 3.71 M 0.97 M Hybrid6 14x14x256 463 M 3.69 M 0.95 M Hybrid7 7x7x512 463 M 3.66 M 2.41 M Hybrid8 7x7x1024 231 M 3.52 M 1.6 M FC9 12540 37 M 14.19 M 9.23 M Summary 3250 M 46.57 M 23.9 M
  • 18. Equation-based Profiler using 64 MAC / 128 KB configuration 18 Network Total Cycle Clock Rate Run Time per Frame FPS AlexNet 61 M 400MHz 152.20ms 6.57 GoogLeNet 27 M 400MHz 67.83ms 14.74 ResNet50 111 M 400MHz 278.65ms 3.59 VGG16 395 M 400MHz 987.55ms 1.01 Tiny YOLO v1 45 M 400MHz 112.67ms 8.88 Tiny YOLO v2 83 M 400MHz 208.47ms 4.80 Tiny YOLO v3 55 M 400MHz 136.21ms 7.34 • Keep improving accuracy as more RTL simulations of models are done • The same DRAM BW model (~0.5 GBps) is used
  • 19. Equation-based Profiler using 256 MAC / 128 KB configuration 19 Network Total Cycle Clock Rate Run Time per Frame FPS AlexNet 49M 400MHz 122.17ms 8.19 GoogLeNet 11M 400MHz 28.40ms 35.22 ResNet50 76M 400MHz 189.74ms 5.27 VGG16 214 M 400MHz 535.99ms 1.87 Tiny YOLO v1 26 M 400MHz 65.30ms 15.31 Tiny YOLO v2 48 M 400MHz 121.07ms 8.26 Tiny YOLO v3 24 M 400MHz 61.55ms 16.25 • Keep improving accuracy as more RTL simulations of models are done • The same DRAM BW model (~0.5 GBps) is used
  • 20. Use the Profiler to Find a Design Target 20 Such as….. I want a 30fps real-time tiny-YOLO inference, what is the HW SPEC ?
  • 21. ASIC Implementation USB Accelerator for Legacy Machines 21 USB Bridge USB GPIF DRAM IF RISC-V Cache DLA AXI Parallel bus SDK + API (Linux) DRAM A P BPeripherals SOC or FPGA FPGA Prototype Fused layer inputFused layer output • TSMC-65nm • 3.2 x 3.2 mm • 64-M, 128KB • nv_small Layout View
  • 22. Chip and Board 22 • Technology: TSMC 65nm, Core : 1 V • Performance: 64 MAC, 50 GOPs @ 400 MHz • DLA Avg Power : 60 mW EVA Board & Die Photo … More information about this chip will be published later
  • 23. Conclusions 23  Adapt and customize HW resources, if you already have some candidate models  An end-to-end solution of edge AI application is here for your reference  Require especially-tight cooperation among HW-SW- training in integer DLAs ~ Thank for Your Attention ~