SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
A Lightweight
DNN Inference Processor
design, system, tools, and applications
羅賢君 Shien-Chun Luo Oct. 2018
工業技術研究院 Industrial Technology Research Institute (ITRI)
資訊與通訊研究所 Information and Communication Research Lab (ICL)
Roofline Model
- Key to Design DNN Inference Engine
1. More parallel PEs with high utilization
▪ Efficient parallel PE structure, interconnect
▪ Proper memory hierarchy
2. Increase data supplement
▪ High bandwidth data access
▪ Reduce data movement or compress data
3. Improve energy efficiency
▪ Adaptive resource to models
▪ Low-power skills
Performance(Operations)
Operational Intensity (operations/byte)
Computation
↓ Bound 2
3
2
↑ Computation
Bound 1
2
Segment & Position
ARM’s Project Trillium
• Performance of > 4.6 TOP/s
• Efficiency of > 3 TOPs/W (7nm process)
• On-chip SRAM size up-to 1MB
Our targeting DNN accelerating solution
• Performance of 50 GOP/s ~ 200 GOP/s
• Efficiency about 1 TOPs/W (65nm process)
• On-chip SRAM size ≤ 256KB
Figure sourced from : ARM Project Trillium
We Started from nVIDIA Open-source
Deep Learning Accelerator (DLA)
What have ITRI done
1. A bug-fixed, fully-compatible to NVDLA HW version (can use of NV’s tools)
2. A model translation tool – compile DNN model to DLA configuration files
3. An adaptive quantization flow – convert FP weights to HW-specific 8-bit precision
4. End-to-end verifications – we show an object detection (YOLO) in this presentation
HW Overview
Features
1. Variable HW resource
2. Suit for 3D convolution
3. Buffer data reuse
4. Hetero-layer fusion
5. Ping-pong CFG registers
1. Variable HW resource-PE#, buffer size
• Search an efficient resource to models
• Adaptive performance & power consumption
2. Suit for 3D convolution
• Released data dependency, share input feature cube
• Output pixel first, share IN, avoiding partial sum storage
• Support any kernel size (n x m) ,the same data flow
• Close to 100% PE utilization
3. Buffer data reuse
• Reuse input or weight in the next layer
• Benefit large layer partition, or batch
4. Hetero-layer fusion
• Fuse popular layer stack [ CONV – BN – PReLU – Pooling ]
• Greatly reduce the DRAM access data
5. Ping-pong CFG registers
• Configure N and N+1 layer simultaneously
• Cover the configuration time during layer change
DLA Features - Overview width
height
IN IN
IN
OUT
kernels
Stride 1, no pad
Channel first Plane first
3D CONV example
DLA Features - Why Configurable Resource is important ?
Alexnet (~0.73 GOP, 61M weights)
• Huge fully connected weights
• DRAM speed dominates
• Computation power cannot help
GoogleNet (~3.2 GOP, 7M weights)
• Small filter size (1x1)
• Benefit parallelism in CNN operations
• Computation power dominates
• DRAM speed cannot help
ResNet50 (~7.8 GOP, 25M weights)
• Large CNN operations, large weights
• Residual  directly add two data cubes 
DRAM speed dominates
• Computation power and DRAM speed are
evenly important
Performance Gradient
Original NVDLA Framework, DEV Flow
Caffe Prototxt
Caffe Model
(weights)
Parser
HW SPEC
Layer ID
Compiler
(Optimization)
Wisdom DIR
• layer details
Loadable file
• HW CONFIGs
• Layers’ CONFIGs
Kernel Mode Driver
(KMD)
• Translate a layer to
HW binary CFGs
• Handle IRQ
User Mode Driver
(UMD)
• Allocate address
• Function call :
layer by layer
inference
Flow Controller
(MCU or CPU)
• Load HW binary
CONFIGs
• Handle IRQ
DLA
HW
Input Compiler (binary version)
HardwareAPI and Driver
Formatted Weights
 online | offline 
ITRI DLA-Lite Simplified Flow - Overview
MCU
NVM
(optional)
DRAM
DMA
GPIF
DNN
Accelerator
Host System
(ARM-based, x86, …)
Program GPIF
DNN Model
Translate /
Format
Tools
HW
resource
allocation
Quantized
Re-train
Weights
Performance
Estimation
DEV ToolsHW Architecture
1. Find an efficient setup of HW resources
2. Setup system address allocation
3. Generate “translated” inference commands
4. Generate “formatted” model parameters
 Inference command package ( to compile for MCU)
 Inference weight package
ITRI DLA-Lite Simplified Flow – DEV tools
DNN Model
Parameters
Caffe Model
Prototxt
Model
Parser
Layer
Fusion
Layer
Partition
Check Layer
Sequence
Check HW
buffer size
DLA CFG
Commands
MCU
Instructions
MCU
compiler
DLA CFG translator
Memory allocator
HW-aware
Quantize
Insertion (TF)
Accuracy
Retrain (TF)
Parameter
Partition
Formatted
Quantized
Weights
Weight format
writer
• Before inference, initialize 2
packages into memory
• After inference, load images
and activate MCU and DLA
• API Example : “YOLO”, “RESNET-50” as a function call,
no breakdown to sub-tasks
• Easy for predefined DNN, future updated by venders
which is like the input.txn
file in NVDLA v1 testbench
Two binary packages
1. compiled MCU instructions
2. formatted weights
Popular NN Computer Vision Tasks
“You Only Look Once“ (YOLO)
Object detection (OD)
application is verified and
demonstrated
Figure sourced from :
Arthur Ouaknine’s Medium log
Object Detection Inference (1/2)
-- Layer Fusion
ID type
1 CONV
2 BN
3 Scale
4 ReLU
5 Pool
6 CONV
7 BN
8 Scale
9 ReLU
10 Pool
11 CONV
12 BN
13 Scale
14 ReLU
15 Pool
16 CONV
17 BN
18 Scale
19 ReLU
20 Pool
21 CONV
22 BN
23 Scale
24 ReLU
25 Pool
26 CONV
27 BN
28 Scale
29 ReLU
30 Pool
31 CONV
32 BN
33 Scale
34 ReLU
35 CONV
36 BN
37 Scale
38 ReLU
39 FC
Layer
Number
Hybrid Layer 1
Hybrid Layer 2
Hybrid Layer 3
Hybrid Layer 4
Hybrid Layer 5
Hybrid Layer 6
Hybrid Layer 7
Hybrid Layer 8
FC9
Tiny YOLO v1
(39 DNN layers)
HW Inference
Queue (9 layers)
 Hybrid layer supports [CONV–BN–Scale–PReLU–
Pool] 5-layer combination
• Originally,8-bit data,Minimal feature maps
DRAM access = 27.7MB
• Use [CONV–BN–Scale–PReLU–Pool] fusion, total
feature map DRAM access = 6.2MB
 Why reduce DRAM access important
(Weight = 27MB)
• Originally, @30 FPS,DRAM BW = 1.64 GB/s
• After fusion, @30 FPS,DRAM BW = 996 MB/s
 HW : 64 Cores, 128KB SRAM
* Detection layer is done by CPU
Object Detection Inference (2/2)
–RTL Results
Conv.
layer
Input Data
Dimension
RTL
Cycle #
OPs
OPs /
cycle
UTIL
Hybrid1 448x448x3 5.80M 193M 33 26.0%
Hybrid2 224x224x16 4.25M 472M 111 86.8%
Hybrid3 112x112x32 3.94M 467M 119 92.7%
Hybrid4 56x56x64 3.82M 465M 122 95.1%
Hybrid5 28x28x128 3.71M 464M 125 97.6%
Hybrid6 14x14x256 3.69M 463M 126 98.1%
Hybrid7 7x7x512 3.66M 463M 126 98.7%
Hybrid8 7x7x1024 3.52M 231M 66 51.3%
FC9 12540 14.19M 37M 2.6 2.0%
Summary 46.57M 3.25G 70
Note: MAC (CONV+FC) total OPs = 3.18G
Total weights = 27M
 Use 64 cores, 128KB SRAM
 Peak performance = 128 OPs/ cycle
 Result analysis
• Utilization (86%~98%) in CNN layers
• DRAM BW and SRAM size affects
hybrid layer 1 and 8
• FC is highly DRAM BW dominated
 Have some detailed
partitions (by DEV tool)
Config
file
Weight
Generator
DLA
RTL
DRAM
Model
VPI
hex
Caffe
format
Trans
DLA Product Prototypes (1/2)
• FPGA–based standalone product
• CFG file is packed to a C function, compiled to ARM
• Running a defined DNN inference
• Update DNN CFG & models by venders
Example 1 --- as a standalone ID Camera
DRAM
DLA Input Data
Model Weights
OS Memory Space
DRAM
CTRL
HDMI
USB ARM CPU
(FPGA)
DLA
(Processing System) Activations
AXI
DLA
AXI
Private Memory
MCU
Main CPUUSB HDMI
APB DRAM CTRL
DMA
DRAM
Example 3
--- as a SoC IP
Video Capture ( )
DNN_CALL( )
Data Fusion ( )
Decision ( )
USB – DLA on FPGA
USB - DLA in ASIC,
dev board
• USB accelerating stick + SDK
• Help legacy facilities to equip DNN
acceleration
• DNN accelerator IP
• Conventional IP business + DEV
tool chains
DLA Product Prototypes (2/2)
Example 2
--- as a Plug and Play Stick
 similar to Movidius / Gyrfalcon stick
 execute whole model inference, instead of
convolution function only
USB acceleration system & ASIC Design
USB to
GPIF
GPIF
Data CTRL
DRAM
CTRL / IF
RISC-V
Cache
DLA
(64 MAC)
AXI
Parallel bus SDK + API
DRAM
A
P
BPeripherals
DLA-Lite System SPEC
• 400MHz core, 100MHz board
• 64CONV MAC, 128KB CONV SRAM
• 50 GOPs peak CNN performance
• Targeting power consumption 50mW
ASIC Preliminary Info (floorplan view)
• TSMC 65nm
• Die size: 3,200 x 3,200 μm2
• Core: 2,500 x 2,500 μm2
64KB
CONV
Buffer32
MAC
32
MAC
BN
PReLU
Pool
Processor
ACC
CONV
DMA
CONV
Sequencer
Data
IO
CTRL
RISC-V
PLL
AXI
DMA interface
64KB
CONV
Buffer
THANK YOU!
QUESTIONS AND COMMENTS?
technical contact : scluo@itri.org.tw , yhchu@itri.org.tw
business contact : victor.wang@itri.org.tw

Contenu connexe

Tendances

Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debuggingHao-Ran Liu
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
 
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...VMworld
 
Introduction of Java GC Tuning and Java Java Mission Control
Introduction of Java GC Tuning and Java Java Mission ControlIntroduction of Java GC Tuning and Java Java Mission Control
Introduction of Java GC Tuning and Java Java Mission ControlLeon Chen
 
eMMC Embedded Multimedia Card overview
eMMC Embedded Multimedia Card overvieweMMC Embedded Multimedia Card overview
eMMC Embedded Multimedia Card overviewVijayGESYS
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/CoreShay Cohen
 
A whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizerA whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizerNikita Popov
 
Patroni: PostgreSQL HA in the cloud
Patroni: PostgreSQL HA in the cloudPatroni: PostgreSQL HA in the cloud
Patroni: PostgreSQL HA in the cloudLucio Grenzi
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016Tomas Vondra
 
Post Mortem Debugging in Embedded Linux Systems
Post Mortem Debugging in Embedded Linux Systems Post Mortem Debugging in Embedded Linux Systems
Post Mortem Debugging in Embedded Linux Systems GlobalLogic Ukraine
 
Control your service resources with systemd
 Control your service resources with systemd  Control your service resources with systemd
Control your service resources with systemd Marian Marinov
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughThomas Graf
 
Physical Memory Management.pdf
Physical Memory Management.pdfPhysical Memory Management.pdf
Physical Memory Management.pdfAdrian Huang
 
ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)The Linux Foundation
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveNetronome
 
Linux power management: are you doing it right?
Linux power management: are you doing it right?Linux power management: are you doing it right?
Linux power management: are you doing it right?Chris Simmonds
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
 

Tendances (20)

Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
kdump: usage and_internals
kdump: usage and_internalskdump: usage and_internals
kdump: usage and_internals
 
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
VMworld 2013: ESXi Native Networking Driver Model - Delivering on Simplicity ...
 
Introduction of Java GC Tuning and Java Java Mission Control
Introduction of Java GC Tuning and Java Java Mission ControlIntroduction of Java GC Tuning and Java Java Mission Control
Introduction of Java GC Tuning and Java Java Mission Control
 
eMMC Embedded Multimedia Card overview
eMMC Embedded Multimedia Card overvieweMMC Embedded Multimedia Card overview
eMMC Embedded Multimedia Card overview
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/Core
 
A whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizerA whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizer
 
Patroni: PostgreSQL HA in the cloud
Patroni: PostgreSQL HA in the cloudPatroni: PostgreSQL HA in the cloud
Patroni: PostgreSQL HA in the cloud
 
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
PostgreSQL na EXT4, XFS, BTRFS a ZFS / FOSDEM PgDay 2016
 
Post Mortem Debugging in Embedded Linux Systems
Post Mortem Debugging in Embedded Linux Systems Post Mortem Debugging in Embedded Linux Systems
Post Mortem Debugging in Embedded Linux Systems
 
Control your service resources with systemd
 Control your service resources with systemd  Control your service resources with systemd
Control your service resources with systemd
 
LinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking WalkthroughLinuxCon 2015 Linux Kernel Networking Walkthrough
LinuxCon 2015 Linux Kernel Networking Walkthrough
 
Physical Memory Management.pdf
Physical Memory Management.pdfPhysical Memory Management.pdf
Physical Memory Management.pdf
 
ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)ALSS14: Xen Project Automotive Hypervisor (Demo)
ALSS14: Xen Project Automotive Hypervisor (Demo)
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 
Persistent memory
Persistent memoryPersistent memory
Persistent memory
 
Linux power management: are you doing it right?
Linux power management: are you doing it right?Linux power management: are you doing it right?
Linux power management: are you doing it right?
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 

Similaire à Lightweight DNN Processor Design (based on NVDLA)

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
Fast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsFast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsRuhaim Izmeth
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2Junli Gu
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalJunli Gu
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Michelle Holley
 
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptxQ1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptxMemory Fabric Forum
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCoburn Watson
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioOPNFV
 
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deploymenttaeseon ryu
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Cheng-Hsuan Li
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceAmazon Web Services
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationBigstep
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 

Similaire à Lightweight DNN Processor Design (based on NVDLA) (20)

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Userspace networking
Userspace networkingUserspace networking
Userspace networking
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
Fast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating SystemsFast switching of threads between cores - Advanced Operating Systems
Fast switching of threads between cores - Advanced Operating Systems
 
APSys Presentation Final copy2
APSys Presentation Final copy2APSys Presentation Final copy2
APSys Presentation Final copy2
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
B.tech_project_ppt.pptx
B.tech_project_ppt.pptxB.tech_project_ppt.pptx
B.tech_project_ppt.pptx
 
OpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation finalOpenCL caffe IWOCL 2016 presentation final
OpenCL caffe IWOCL 2016 presentation final
 
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
Building efficient 5G NR base stations with Intel® Xeon® Scalable Processors
 
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptxQ1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 

Dernier

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...Health
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksMagic Marks
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 

Dernier (20)

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 

Lightweight DNN Processor Design (based on NVDLA)

  • 1. A Lightweight DNN Inference Processor design, system, tools, and applications 羅賢君 Shien-Chun Luo Oct. 2018 工業技術研究院 Industrial Technology Research Institute (ITRI) 資訊與通訊研究所 Information and Communication Research Lab (ICL)
  • 2. Roofline Model - Key to Design DNN Inference Engine 1. More parallel PEs with high utilization ▪ Efficient parallel PE structure, interconnect ▪ Proper memory hierarchy 2. Increase data supplement ▪ High bandwidth data access ▪ Reduce data movement or compress data 3. Improve energy efficiency ▪ Adaptive resource to models ▪ Low-power skills Performance(Operations) Operational Intensity (operations/byte) Computation ↓ Bound 2 3 2 ↑ Computation Bound 1 2
  • 3. Segment & Position ARM’s Project Trillium • Performance of > 4.6 TOP/s • Efficiency of > 3 TOPs/W (7nm process) • On-chip SRAM size up-to 1MB Our targeting DNN accelerating solution • Performance of 50 GOP/s ~ 200 GOP/s • Efficiency about 1 TOPs/W (65nm process) • On-chip SRAM size ≤ 256KB Figure sourced from : ARM Project Trillium
  • 4. We Started from nVIDIA Open-source Deep Learning Accelerator (DLA) What have ITRI done 1. A bug-fixed, fully-compatible to NVDLA HW version (can use of NV’s tools) 2. A model translation tool – compile DNN model to DLA configuration files 3. An adaptive quantization flow – convert FP weights to HW-specific 8-bit precision 4. End-to-end verifications – we show an object detection (YOLO) in this presentation HW Overview Features 1. Variable HW resource 2. Suit for 3D convolution 3. Buffer data reuse 4. Hetero-layer fusion 5. Ping-pong CFG registers
  • 5. 1. Variable HW resource-PE#, buffer size • Search an efficient resource to models • Adaptive performance & power consumption 2. Suit for 3D convolution • Released data dependency, share input feature cube • Output pixel first, share IN, avoiding partial sum storage • Support any kernel size (n x m) ,the same data flow • Close to 100% PE utilization 3. Buffer data reuse • Reuse input or weight in the next layer • Benefit large layer partition, or batch 4. Hetero-layer fusion • Fuse popular layer stack [ CONV – BN – PReLU – Pooling ] • Greatly reduce the DRAM access data 5. Ping-pong CFG registers • Configure N and N+1 layer simultaneously • Cover the configuration time during layer change DLA Features - Overview width height IN IN IN OUT kernels Stride 1, no pad Channel first Plane first 3D CONV example
  • 6. DLA Features - Why Configurable Resource is important ? Alexnet (~0.73 GOP, 61M weights) • Huge fully connected weights • DRAM speed dominates • Computation power cannot help GoogleNet (~3.2 GOP, 7M weights) • Small filter size (1x1) • Benefit parallelism in CNN operations • Computation power dominates • DRAM speed cannot help ResNet50 (~7.8 GOP, 25M weights) • Large CNN operations, large weights • Residual  directly add two data cubes  DRAM speed dominates • Computation power and DRAM speed are evenly important Performance Gradient
  • 7. Original NVDLA Framework, DEV Flow Caffe Prototxt Caffe Model (weights) Parser HW SPEC Layer ID Compiler (Optimization) Wisdom DIR • layer details Loadable file • HW CONFIGs • Layers’ CONFIGs Kernel Mode Driver (KMD) • Translate a layer to HW binary CFGs • Handle IRQ User Mode Driver (UMD) • Allocate address • Function call : layer by layer inference Flow Controller (MCU or CPU) • Load HW binary CONFIGs • Handle IRQ DLA HW Input Compiler (binary version) HardwareAPI and Driver Formatted Weights  online | offline 
  • 8. ITRI DLA-Lite Simplified Flow - Overview MCU NVM (optional) DRAM DMA GPIF DNN Accelerator Host System (ARM-based, x86, …) Program GPIF DNN Model Translate / Format Tools HW resource allocation Quantized Re-train Weights Performance Estimation DEV ToolsHW Architecture 1. Find an efficient setup of HW resources 2. Setup system address allocation 3. Generate “translated” inference commands 4. Generate “formatted” model parameters  Inference command package ( to compile for MCU)  Inference weight package
  • 9. ITRI DLA-Lite Simplified Flow – DEV tools DNN Model Parameters Caffe Model Prototxt Model Parser Layer Fusion Layer Partition Check Layer Sequence Check HW buffer size DLA CFG Commands MCU Instructions MCU compiler DLA CFG translator Memory allocator HW-aware Quantize Insertion (TF) Accuracy Retrain (TF) Parameter Partition Formatted Quantized Weights Weight format writer • Before inference, initialize 2 packages into memory • After inference, load images and activate MCU and DLA • API Example : “YOLO”, “RESNET-50” as a function call, no breakdown to sub-tasks • Easy for predefined DNN, future updated by venders which is like the input.txn file in NVDLA v1 testbench Two binary packages 1. compiled MCU instructions 2. formatted weights
  • 10. Popular NN Computer Vision Tasks “You Only Look Once“ (YOLO) Object detection (OD) application is verified and demonstrated Figure sourced from : Arthur Ouaknine’s Medium log
  • 11. Object Detection Inference (1/2) -- Layer Fusion ID type 1 CONV 2 BN 3 Scale 4 ReLU 5 Pool 6 CONV 7 BN 8 Scale 9 ReLU 10 Pool 11 CONV 12 BN 13 Scale 14 ReLU 15 Pool 16 CONV 17 BN 18 Scale 19 ReLU 20 Pool 21 CONV 22 BN 23 Scale 24 ReLU 25 Pool 26 CONV 27 BN 28 Scale 29 ReLU 30 Pool 31 CONV 32 BN 33 Scale 34 ReLU 35 CONV 36 BN 37 Scale 38 ReLU 39 FC Layer Number Hybrid Layer 1 Hybrid Layer 2 Hybrid Layer 3 Hybrid Layer 4 Hybrid Layer 5 Hybrid Layer 6 Hybrid Layer 7 Hybrid Layer 8 FC9 Tiny YOLO v1 (39 DNN layers) HW Inference Queue (9 layers)  Hybrid layer supports [CONV–BN–Scale–PReLU– Pool] 5-layer combination • Originally,8-bit data,Minimal feature maps DRAM access = 27.7MB • Use [CONV–BN–Scale–PReLU–Pool] fusion, total feature map DRAM access = 6.2MB  Why reduce DRAM access important (Weight = 27MB) • Originally, @30 FPS,DRAM BW = 1.64 GB/s • After fusion, @30 FPS,DRAM BW = 996 MB/s  HW : 64 Cores, 128KB SRAM * Detection layer is done by CPU
  • 12. Object Detection Inference (2/2) –RTL Results Conv. layer Input Data Dimension RTL Cycle # OPs OPs / cycle UTIL Hybrid1 448x448x3 5.80M 193M 33 26.0% Hybrid2 224x224x16 4.25M 472M 111 86.8% Hybrid3 112x112x32 3.94M 467M 119 92.7% Hybrid4 56x56x64 3.82M 465M 122 95.1% Hybrid5 28x28x128 3.71M 464M 125 97.6% Hybrid6 14x14x256 3.69M 463M 126 98.1% Hybrid7 7x7x512 3.66M 463M 126 98.7% Hybrid8 7x7x1024 3.52M 231M 66 51.3% FC9 12540 14.19M 37M 2.6 2.0% Summary 46.57M 3.25G 70 Note: MAC (CONV+FC) total OPs = 3.18G Total weights = 27M  Use 64 cores, 128KB SRAM  Peak performance = 128 OPs/ cycle  Result analysis • Utilization (86%~98%) in CNN layers • DRAM BW and SRAM size affects hybrid layer 1 and 8 • FC is highly DRAM BW dominated  Have some detailed partitions (by DEV tool) Config file Weight Generator DLA RTL DRAM Model VPI hex Caffe format Trans
  • 13. DLA Product Prototypes (1/2) • FPGA–based standalone product • CFG file is packed to a C function, compiled to ARM • Running a defined DNN inference • Update DNN CFG & models by venders Example 1 --- as a standalone ID Camera DRAM DLA Input Data Model Weights OS Memory Space DRAM CTRL HDMI USB ARM CPU (FPGA) DLA (Processing System) Activations
  • 14. AXI DLA AXI Private Memory MCU Main CPUUSB HDMI APB DRAM CTRL DMA DRAM Example 3 --- as a SoC IP Video Capture ( ) DNN_CALL( ) Data Fusion ( ) Decision ( ) USB – DLA on FPGA USB - DLA in ASIC, dev board • USB accelerating stick + SDK • Help legacy facilities to equip DNN acceleration • DNN accelerator IP • Conventional IP business + DEV tool chains DLA Product Prototypes (2/2) Example 2 --- as a Plug and Play Stick  similar to Movidius / Gyrfalcon stick  execute whole model inference, instead of convolution function only
  • 15. USB acceleration system & ASIC Design USB to GPIF GPIF Data CTRL DRAM CTRL / IF RISC-V Cache DLA (64 MAC) AXI Parallel bus SDK + API DRAM A P BPeripherals DLA-Lite System SPEC • 400MHz core, 100MHz board • 64CONV MAC, 128KB CONV SRAM • 50 GOPs peak CNN performance • Targeting power consumption 50mW ASIC Preliminary Info (floorplan view) • TSMC 65nm • Die size: 3,200 x 3,200 μm2 • Core: 2,500 x 2,500 μm2 64KB CONV Buffer32 MAC 32 MAC BN PReLU Pool Processor ACC CONV DMA CONV Sequencer Data IO CTRL RISC-V PLL AXI DMA interface 64KB CONV Buffer
  • 16. THANK YOU! QUESTIONS AND COMMENTS? technical contact : scluo@itri.org.tw , yhchu@itri.org.tw business contact : victor.wang@itri.org.tw