SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
A High-speed Low-power Deep Neural
Network on an FPGA based on the Nested
RNS: Applied to an Object Detector
Hiroki Nakahara, Tokyo Institute of Technology, Japan
Tsutomu Sasao, Meiji University, Japan
ISCAS2018
@Florence
Outline
• Background
• YOLOv2
• Convolutional Neural Network (CNN)
• Nested RNS (NRNS) for YOLOv2
• Experimental Results
• Conclusion
2
Image Classification by NN
Input Neural Network (NN) Output
3
Cat
(92%)
Improved by AlexNet (Deep Learning)
Why?
4
Bigdata
High-Performance
Computing
Algorithm
&Data Structure
Object Detection
5
Son
Baby
Daughter
• Detect multiple objects at a time
• High performance-power is necessary
Define of Problem
• Detecting and classifying multiple objects at the same time
• Evaluation criteria (from Pascal VOC):
6
Ground truth
annotation
Detection results:
>50% overlap of
bounding box(BBox)
with ground truth
One BBox for each
object
Confidence value
for each object
Person (50%)
, ∈{ ,. ,…, }
Average Precision (AP):
YOLOv2
(You Only Look Once version 2)
7
Input
Image
(Frame)
Feature maps
CONV+Pooling
CNN
CONV+Pooling
Class score
Bounding Box
Detection
J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger," arXiv preprint arXiv:1612.08242, 2016.
• Single CNN (One-shot) object detector
• Both a classification and a BBox estimation for each grid
2D Convolutional Operation
8
Input feature map
Output feature map
Kernel
(Binary)
X0,0 x W0,0
X0,1 x W0,1
X0,2 x W0,2
X1,0 x W1,0
X1,1 x W1,1
X1,2 x W1,2
X2,0 x W2,0
X2,1 x W2,1
+) X2,2 x W2,2
y
• Computational intensive part of the YOLOv2
FPGA
Realization of 2D Convolutional Layer
• Requires more than billion MACs
• Our realization
• Time multiplexing
• Nested Residue Number System(NRNS)
9
Off-chip Memory
* *
+
* *
+
+
* *
+
* *
+
+
BRAM BRAM
FPGA
Off-chip Memory
* *
BRAM
*
*
* *
*
* *
BRAM
*
*
* *
*
* *
BRAM
*
*
* *
*
* *
BRAM
*
*
* *
*
➔
Converter Converter Converter Converter
Converter Converter Converter Converter
Fully parallelization
with RNS
Residue Number System (RNS)
• Defined by a set of L mutually prime integer constants
〈m1,m2,...,mL〉
• No pair modulus have a common factor with any other
• Typically, prime number is used as moduli set
• An arbitrary integer X can be uniquely represented by
a tuple of L integers (X1,X2,…,XL), where
• Dynamic range:
10
)(mod ii mXX 
M = mi
i=1
L
Õ
Parallel Multiplication
Multiplication on RNS
• Moduli set〈3,4,5〉, X=8, Y=2
• Z=X×Y=16=(1,0,1)
• X=(2,0,3), Y=(2,2,2)
Z=(4 mod 3,0 mod 4,6 mod 5)
=(1,0,1)=16
11
Binary2RNS Conversion
RNS2Binary Conversion
Binary2RNS Converter
12
X mod 2 mod 3 mod4
0 0 0 0
1 1 1 1
2 0 2 2
3 1 0 3
4 0 1 0
5 1 2 1
13
00 01 10 11
00
01
10
11
0
1
1
1
1
1
0
0
0
1
1
1
1
1
0
0
X1=(x1, x2)
X2=(x3, x4)
h(X1) 0 01 1
x1 0 0 1 1
x2 0 1 0 1
h(X1) 0 1 0 1
0 1
00 0 1
01 1 1
10 1 0
11 1 0
x3,x4
h(X1)
Functional Decomposition
24x1=16 [bit] 22x1+23x1=12 [bit]
Column multiplicity=2
Bound variables
Free
variables
Decomposition Chart for X mod 3
14
000 001 010 011
00
01
10
11
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
X2=(x3, x4, x5)
X1=(x1,x2)
100 101 110 111
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
0 mod 3 = 0
1 mod 3 = 1
2 mod 3 = 2
3 mod 3 = 0
4 mod 3 = 1
5 mod 3 = 2
6 mod 3 = 0
7 mod 3 = 1
8 mod 3 = 2
9 mod 3 = 0
10 mod 3 = 1
…
Free
variables
Bound variables
Decomposition Chart for X mod 3
15
0 1 2
00
01
10
11
0
1
2
0
1
2
0
1
2
0
1
2
X2=(x3,x4,x5)X1=(x1,x2)
FreeBound
x3 0 0 0 0 1 1 1 1
x4 0 0 1 1 0 0 1 1
x5 0 1 0 1 0 1 0 1
h(X2) 0 1 2 0 1 2 0 1
0 mod 3 = 0
1 mod 3 = 1
2 mod 3 = 2
3 mod 3 = 0
4 mod 3 = 1
5 mod 3 = 2
6 mod 3 = 0
7 mod 3 = 1
8 mod 3 = 2
9 mod 3 = 0
10 mod 3 = 1
…
Binary2RNS Converter
16
LUT cascade for X mod m1
LUT cascade for X mod m2
BRAM
BRAM
BRAM
RNS2Binary Converter (m=30)
17
x1 y1
0 0
1 15
x2 y2
0 0
1 10
2 20
x3 y3
0 0
1 6
2 12
3 18
4 24
Mod m
Adder
Mod m
Adder
carry
carry
Problem
• Moduli set of RNS consists of mutually prime numbers
• sizes of circuits are all different
• Example: <7,11,13>
18
6-input
LUT
8-input
LUT
8-input
LUT
3
4
4
4
4
3
3
4
4
Binary2RNS
Converter
by
BRAMs
RNS2Binary
Converter
by
DSP blocks
and BRAMs
Nested RNS
• (Z1,Z2,…,Zi,…, ZL) (Z1,Z2,…,(Zi1,Zi2,…,Zij),…, ZL)
• Ex: <7,11,13>×<7,11,13>
<7,<5,6,7>11,<5,6,7>13>×<7,<5,6,7>11,<5,6,7>13>
19
1. Reuse the same moduli set
2. Decompose a large modulo into smaller ones
Original modulus
➔
Example of Nested RNS
• 19x22(=418) on <7,<5,6,7>11,<5,6,7>13>
19×22
=<5,8,6>×<1,0,9>
=<5,<3,2,1>11,<1,0,6>13>×<1,<0,0,0>11,<4,3,2>13>
=<5,<0,0,0>11,<4,0,5>13>
=<5,0,2>
=418
20
Modulo Multiplication
Bin2RNS on NRNS
RNS2Bin
Binary2NRNS Conversion
Realization of Nested RNS
21
<5,6,7>
2Bin
Bin2
<7,11,13>
3
<7,11,13>
2Bin
<5,6,7>
2Bin
Bin2
<5,6,7>
Bin2
<5,6,7>
6-input
LUT
6-input
LUT
6-input
LUT
6-input
LUT
6-input
LUT
6-input
LUT
6-input
LUT
Bin2
<7,11,13>
Bin2
<5,6,7>
Bin2
<5,6,7>
4
4
3
4
4
3
3
3
3
3
3
Binary
2NRNS
NRNS2
Binary
Realized by BRAMs LUTs BRAMs and DSP blocks
Moduli Set for NRNS
• Conventional RNS (uses 9 moduli)
<3,5,7,11,13,16,17,19,23>
• Applied the NRNS to moduli that are greater than 16
<3,4,5,7,11,13,16,
<3,4,5,7,11,13>17,
<3,4,5,7,11,13>19,
<3,4,5,7,11,13,<3,4,5,7,11,13>17>23>
22
All the 30 bit MAC operations are decomposed into 4 bit ones
DCNN Architecture using the NRNS
23
...
Parallel modulo mi
2D convolutional units
...
...
...
BRAM BRAM BRAM...
BRAM BRAM BRAM...
BRAM BRAM BRAM...
...
Parallel Bin2NRNS
Converters
Tree-based NRNS2Bin
Converters
Sequencer
On-chip
Memory
RNS
2
Bin
RNS
2
Bin
RNS
2
Bin
RNS
2
Bin
RNS
2
Bin
.........
...
NRNS based YOLOv2
• Framework: Chainer 1.24.0
• CNN: Tiny YOLOv2
• Benchmark: KITTI
vision benchmark
• mAP: 69.1 %
24
Implementation
• FPGA board: NetFPGA-SUME
• FPGA: Virtex7 VC690T
• LUT: 427,014 / 433,200
• 18Kb BRAM: 1,235 / 2,940
• DSP48E: 0 / 3,600
• Realized the pre-trained
NRNS-based YOLOv2
• 9 bit fixed precision
(dynamic range: 30 bit)
• Synthesis tool: Xilinx Vivado2017.2
• Timing constrain: 300MHz
• 3.84 FPS@3.5W → 1.097 FPS/W
25
Comparison
26
NVivia Pascal
GTX1080Ti
NetFPGA-SUME
Speed [FPS] 20.64 3.84
Power [W] 60.0 3.5
Efficiency [FPS/W] 0.344 1.097
Conclusion
• Realized the DCNN on the FPGA
• Time multiplexing
• Nested RNS
• MAC operation is realized by small LUTs
• Functional decomposition are used as follows:
• Bin2NRNS converter is realized by BRAMs
• NRNS2Bin converter is realized by DSP blocks
and BRAMs
• Performance per power (FPS/W)
• 3.19 times better than Pascal GPU
27

Contenu connexe

Tendances

Lecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural NetworksLecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural NetworksSang Jun Lee
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentationOwin Will
 
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...Deep Learning JP
 
Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...
Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...
Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...ryuz88
 
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYCTed Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYCMLconf
 
Deep learning and its application
Deep learning and its applicationDeep learning and its application
Deep learning and its applicationSrishty Saha
 
LUT-Network Revision2 -English version-
LUT-Network Revision2 -English version-LUT-Network Revision2 -English version-
LUT-Network Revision2 -English version-ryuz88
 
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Universitat Politècnica de Catalunya
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Electricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural NetworksElectricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural NetworksTaegyun Jeon
 
Applied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural NetworksApplied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural NetworksMark Chang
 
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowAltoros
 
Artificial neural networks introduction
Artificial neural networks introductionArtificial neural networks introduction
Artificial neural networks introductionSungminYou
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural NetworksSharath TS
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCMLconf
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare EventsTaegyun Jeon
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
 

Tendances (20)

Lecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural NetworksLecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural Networks
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
 
Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...
Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...
Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...
 
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYCTed Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
 
Deep learning and its application
Deep learning and its applicationDeep learning and its application
Deep learning and its application
 
LUT-Network Revision2 -English version-
LUT-Network Revision2 -English version-LUT-Network Revision2 -English version-
LUT-Network Revision2 -English version-
 
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
 
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
 
Electricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural NetworksElectricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural Networks
 
Applied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural NetworksApplied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural Networks
 
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
 
Artificial neural networks introduction
Artificial neural networks introductionArtificial neural networks introduction
Artificial neural networks introduction
 
Multidimensional RNN
Multidimensional RNNMultidimensional RNN
Multidimensional RNN
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
 
[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events[PR12] PR-036 Learning to Remember Rare Events
[PR12] PR-036 Learning to Remember Rare Events
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 

Similaire à ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied to YOLOv2

3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...
3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...
3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...EL-Hachemi Guerrout
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Course-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdfCourse-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdfShreeDevi42
 
Advanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdfAdvanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdfHariPrasad314745
 
Solution of simplified neutron diffusion equation by FDM
Solution of simplified neutron diffusion equation by FDMSolution of simplified neutron diffusion equation by FDM
Solution of simplified neutron diffusion equation by FDMSyeilendra Pramuditya
 
ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)WoochulShin10
 
Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Nabil Chouba
 
Sparse coding Super-Resolution を用いた核医学画像処理
Sparse coding Super-Resolution を用いた核医学画像処理Sparse coding Super-Resolution を用いた核医学画像処理
Sparse coding Super-Resolution を用いた核医学画像処理Yutaka KATAYAMA
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningVahid Mirjalili
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Usatyuk Vasiliy
 
2016 03-03 marchand
2016 03-03 marchand2016 03-03 marchand
2016 03-03 marchandSCEE Team
 
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
SimCLR: A Simple Framework for Contrastive Learning of Visual RepresentationsSimCLR: A Simple Framework for Contrastive Learning of Visual Representations
SimCLR: A Simple Framework for Contrastive Learning of Visual Representationsynxm25hpxp
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker DiarizationHONGJOO LEE
 
fast-matmul-ppopp2015
fast-matmul-ppopp2015fast-matmul-ppopp2015
fast-matmul-ppopp2015Austin Benson
 
A framework for practical fast matrix multiplication
A framework for practical fast matrix multiplication�A framework for practical fast matrix multiplication�
A framework for practical fast matrix multiplicationAustin Benson
 
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding AlgorithmFixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding AlgorithmCSCJournals
 

Similaire à ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied to YOLOv2 (20)

3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...
3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...
3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...
 
Families of Triangular Norm Based Kernel Function and Its Application to Kern...
Families of Triangular Norm Based Kernel Function and Its Application to Kern...Families of Triangular Norm Based Kernel Function and Its Application to Kern...
Families of Triangular Norm Based Kernel Function and Its Application to Kern...
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Course-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdfCourse-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdf
 
Advanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdfAdvanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdf
 
Solution of simplified neutron diffusion equation by FDM
Solution of simplified neutron diffusion equation by FDMSolution of simplified neutron diffusion equation by FDM
Solution of simplified neutron diffusion equation by FDM
 
ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)
 
Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation
 
Sparse coding Super-Resolution を用いた核医学画像処理
Sparse coding Super-Resolution を用いた核医学画像処理Sparse coding Super-Resolution を用いた核医学画像処理
Sparse coding Super-Resolution を用いた核医学画像処理
 
Dsp manual print
Dsp manual printDsp manual print
Dsp manual print
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep Learning
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...
 
2016 03-03 marchand
2016 03-03 marchand2016 03-03 marchand
2016 03-03 marchand
 
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
SimCLR: A Simple Framework for Contrastive Learning of Visual RepresentationsSimCLR: A Simple Framework for Contrastive Learning of Visual Representations
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
 
OFDM
OFDMOFDM
OFDM
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
 
fast-matmul-ppopp2015
fast-matmul-ppopp2015fast-matmul-ppopp2015
fast-matmul-ppopp2015
 
A framework for practical fast matrix multiplication
A framework for practical fast matrix multiplication�A framework for practical fast matrix multiplication�
A framework for practical fast matrix multiplication
 
Extreme dxt compression
Extreme dxt compressionExtreme dxt compression
Extreme dxt compression
 
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding AlgorithmFixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
 

Plus de Hiroki Nakahara

ROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROSROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROSHiroki Nakahara
 
DSF2018講演スライド
DSF2018講演スライドDSF2018講演スライド
DSF2018講演スライドHiroki Nakahara
 
(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESSHiroki Nakahara
 
(公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017 (公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017 Hiroki Nakahara
 
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介Hiroki Nakahara
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)Hiroki Nakahara
 
Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Hiroki Nakahara
 
FPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGAFPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGAHiroki Nakahara
 
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみたHiroki Nakahara
 
Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)Hiroki Nakahara
 
Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装Hiroki Nakahara
 
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略Hiroki Nakahara
 
Verilog-HDL Tutorial (15) software
Verilog-HDL Tutorial (15) softwareVerilog-HDL Tutorial (15) software
Verilog-HDL Tutorial (15) softwareHiroki Nakahara
 
Verilog-HDL Tutorial (15) hardware
Verilog-HDL Tutorial (15) hardwareVerilog-HDL Tutorial (15) hardware
Verilog-HDL Tutorial (15) hardwareHiroki Nakahara
 
Verilog-HDL Tutorial (14)
Verilog-HDL Tutorial (14)Verilog-HDL Tutorial (14)
Verilog-HDL Tutorial (14)Hiroki Nakahara
 
Verilog-HDL Tutorial (13)
Verilog-HDL Tutorial (13)Verilog-HDL Tutorial (13)
Verilog-HDL Tutorial (13)Hiroki Nakahara
 
Verilog-HDL Tutorial (12)
Verilog-HDL Tutorial (12)Verilog-HDL Tutorial (12)
Verilog-HDL Tutorial (12)Hiroki Nakahara
 
Verilog-HDL Tutorial (11)
Verilog-HDL Tutorial (11)Verilog-HDL Tutorial (11)
Verilog-HDL Tutorial (11)Hiroki Nakahara
 

Plus de Hiroki Nakahara (20)

ROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROSROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROS
 
FPGAX2019
FPGAX2019FPGAX2019
FPGAX2019
 
SBRA2018講演資料
SBRA2018講演資料SBRA2018講演資料
SBRA2018講演資料
 
DSF2018講演スライド
DSF2018講演スライドDSF2018講演スライド
DSF2018講演スライド
 
(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS
 
(公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017 (公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017
 
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
 
Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)
 
FPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGAFPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGA
 
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
 
Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)
 
Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装
 
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
 
Verilog-HDL Tutorial (15) software
Verilog-HDL Tutorial (15) softwareVerilog-HDL Tutorial (15) software
Verilog-HDL Tutorial (15) software
 
Verilog-HDL Tutorial (15) hardware
Verilog-HDL Tutorial (15) hardwareVerilog-HDL Tutorial (15) hardware
Verilog-HDL Tutorial (15) hardware
 
Verilog-HDL Tutorial (14)
Verilog-HDL Tutorial (14)Verilog-HDL Tutorial (14)
Verilog-HDL Tutorial (14)
 
Verilog-HDL Tutorial (13)
Verilog-HDL Tutorial (13)Verilog-HDL Tutorial (13)
Verilog-HDL Tutorial (13)
 
Verilog-HDL Tutorial (12)
Verilog-HDL Tutorial (12)Verilog-HDL Tutorial (12)
Verilog-HDL Tutorial (12)
 
Verilog-HDL Tutorial (11)
Verilog-HDL Tutorial (11)Verilog-HDL Tutorial (11)
Verilog-HDL Tutorial (11)
 

Dernier

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 

Dernier (20)

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 

ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied to YOLOv2

  • 1. A High-speed Low-power Deep Neural Network on an FPGA based on the Nested RNS: Applied to an Object Detector Hiroki Nakahara, Tokyo Institute of Technology, Japan Tsutomu Sasao, Meiji University, Japan ISCAS2018 @Florence
  • 2. Outline • Background • YOLOv2 • Convolutional Neural Network (CNN) • Nested RNS (NRNS) for YOLOv2 • Experimental Results • Conclusion 2
  • 3. Image Classification by NN Input Neural Network (NN) Output 3 Cat (92%) Improved by AlexNet (Deep Learning)
  • 5. Object Detection 5 Son Baby Daughter • Detect multiple objects at a time • High performance-power is necessary
  • 6. Define of Problem • Detecting and classifying multiple objects at the same time • Evaluation criteria (from Pascal VOC): 6 Ground truth annotation Detection results: >50% overlap of bounding box(BBox) with ground truth One BBox for each object Confidence value for each object Person (50%) , ∈{ ,. ,…, } Average Precision (AP):
  • 7. YOLOv2 (You Only Look Once version 2) 7 Input Image (Frame) Feature maps CONV+Pooling CNN CONV+Pooling Class score Bounding Box Detection J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger," arXiv preprint arXiv:1612.08242, 2016. • Single CNN (One-shot) object detector • Both a classification and a BBox estimation for each grid
  • 8. 2D Convolutional Operation 8 Input feature map Output feature map Kernel (Binary) X0,0 x W0,0 X0,1 x W0,1 X0,2 x W0,2 X1,0 x W1,0 X1,1 x W1,1 X1,2 x W1,2 X2,0 x W2,0 X2,1 x W2,1 +) X2,2 x W2,2 y • Computational intensive part of the YOLOv2
  • 9. FPGA Realization of 2D Convolutional Layer • Requires more than billion MACs • Our realization • Time multiplexing • Nested Residue Number System(NRNS) 9 Off-chip Memory * * + * * + + * * + * * + + BRAM BRAM FPGA Off-chip Memory * * BRAM * * * * * * * BRAM * * * * * * * BRAM * * * * * * * BRAM * * * * * ➔ Converter Converter Converter Converter Converter Converter Converter Converter Fully parallelization with RNS
  • 10. Residue Number System (RNS) • Defined by a set of L mutually prime integer constants 〈m1,m2,...,mL〉 • No pair modulus have a common factor with any other • Typically, prime number is used as moduli set • An arbitrary integer X can be uniquely represented by a tuple of L integers (X1,X2,…,XL), where • Dynamic range: 10 )(mod ii mXX  M = mi i=1 L Õ
  • 11. Parallel Multiplication Multiplication on RNS • Moduli set〈3,4,5〉, X=8, Y=2 • Z=X×Y=16=(1,0,1) • X=(2,0,3), Y=(2,2,2) Z=(4 mod 3,0 mod 4,6 mod 5) =(1,0,1)=16 11 Binary2RNS Conversion RNS2Binary Conversion
  • 12. Binary2RNS Converter 12 X mod 2 mod 3 mod4 0 0 0 0 1 1 1 1 2 0 2 2 3 1 0 3 4 0 1 0 5 1 2 1
  • 13. 13 00 01 10 11 00 01 10 11 0 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 X1=(x1, x2) X2=(x3, x4) h(X1) 0 01 1 x1 0 0 1 1 x2 0 1 0 1 h(X1) 0 1 0 1 0 1 00 0 1 01 1 1 10 1 0 11 1 0 x3,x4 h(X1) Functional Decomposition 24x1=16 [bit] 22x1+23x1=12 [bit] Column multiplicity=2 Bound variables Free variables
  • 14. Decomposition Chart for X mod 3 14 000 001 010 011 00 01 10 11 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 X2=(x3, x4, x5) X1=(x1,x2) 100 101 110 111 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 0 mod 3 = 0 1 mod 3 = 1 2 mod 3 = 2 3 mod 3 = 0 4 mod 3 = 1 5 mod 3 = 2 6 mod 3 = 0 7 mod 3 = 1 8 mod 3 = 2 9 mod 3 = 0 10 mod 3 = 1 … Free variables Bound variables
  • 15. Decomposition Chart for X mod 3 15 0 1 2 00 01 10 11 0 1 2 0 1 2 0 1 2 0 1 2 X2=(x3,x4,x5)X1=(x1,x2) FreeBound x3 0 0 0 0 1 1 1 1 x4 0 0 1 1 0 0 1 1 x5 0 1 0 1 0 1 0 1 h(X2) 0 1 2 0 1 2 0 1 0 mod 3 = 0 1 mod 3 = 1 2 mod 3 = 2 3 mod 3 = 0 4 mod 3 = 1 5 mod 3 = 2 6 mod 3 = 0 7 mod 3 = 1 8 mod 3 = 2 9 mod 3 = 0 10 mod 3 = 1 …
  • 16. Binary2RNS Converter 16 LUT cascade for X mod m1 LUT cascade for X mod m2 BRAM BRAM BRAM
  • 17. RNS2Binary Converter (m=30) 17 x1 y1 0 0 1 15 x2 y2 0 0 1 10 2 20 x3 y3 0 0 1 6 2 12 3 18 4 24 Mod m Adder Mod m Adder carry carry
  • 18. Problem • Moduli set of RNS consists of mutually prime numbers • sizes of circuits are all different • Example: <7,11,13> 18 6-input LUT 8-input LUT 8-input LUT 3 4 4 4 4 3 3 4 4 Binary2RNS Converter by BRAMs RNS2Binary Converter by DSP blocks and BRAMs
  • 19. Nested RNS • (Z1,Z2,…,Zi,…, ZL) (Z1,Z2,…,(Zi1,Zi2,…,Zij),…, ZL) • Ex: <7,11,13>×<7,11,13> <7,<5,6,7>11,<5,6,7>13>×<7,<5,6,7>11,<5,6,7>13> 19 1. Reuse the same moduli set 2. Decompose a large modulo into smaller ones Original modulus ➔
  • 20. Example of Nested RNS • 19x22(=418) on <7,<5,6,7>11,<5,6,7>13> 19×22 =<5,8,6>×<1,0,9> =<5,<3,2,1>11,<1,0,6>13>×<1,<0,0,0>11,<4,3,2>13> =<5,<0,0,0>11,<4,0,5>13> =<5,0,2> =418 20 Modulo Multiplication Bin2RNS on NRNS RNS2Bin Binary2NRNS Conversion
  • 21. Realization of Nested RNS 21 <5,6,7> 2Bin Bin2 <7,11,13> 3 <7,11,13> 2Bin <5,6,7> 2Bin Bin2 <5,6,7> Bin2 <5,6,7> 6-input LUT 6-input LUT 6-input LUT 6-input LUT 6-input LUT 6-input LUT 6-input LUT Bin2 <7,11,13> Bin2 <5,6,7> Bin2 <5,6,7> 4 4 3 4 4 3 3 3 3 3 3 Binary 2NRNS NRNS2 Binary Realized by BRAMs LUTs BRAMs and DSP blocks
  • 22. Moduli Set for NRNS • Conventional RNS (uses 9 moduli) <3,5,7,11,13,16,17,19,23> • Applied the NRNS to moduli that are greater than 16 <3,4,5,7,11,13,16, <3,4,5,7,11,13>17, <3,4,5,7,11,13>19, <3,4,5,7,11,13,<3,4,5,7,11,13>17>23> 22 All the 30 bit MAC operations are decomposed into 4 bit ones
  • 23. DCNN Architecture using the NRNS 23 ... Parallel modulo mi 2D convolutional units ... ... ... BRAM BRAM BRAM... BRAM BRAM BRAM... BRAM BRAM BRAM... ... Parallel Bin2NRNS Converters Tree-based NRNS2Bin Converters Sequencer On-chip Memory RNS 2 Bin RNS 2 Bin RNS 2 Bin RNS 2 Bin RNS 2 Bin ......... ...
  • 24. NRNS based YOLOv2 • Framework: Chainer 1.24.0 • CNN: Tiny YOLOv2 • Benchmark: KITTI vision benchmark • mAP: 69.1 % 24
  • 25. Implementation • FPGA board: NetFPGA-SUME • FPGA: Virtex7 VC690T • LUT: 427,014 / 433,200 • 18Kb BRAM: 1,235 / 2,940 • DSP48E: 0 / 3,600 • Realized the pre-trained NRNS-based YOLOv2 • 9 bit fixed precision (dynamic range: 30 bit) • Synthesis tool: Xilinx Vivado2017.2 • Timing constrain: 300MHz • 3.84 FPS@3.5W → 1.097 FPS/W 25
  • 26. Comparison 26 NVivia Pascal GTX1080Ti NetFPGA-SUME Speed [FPS] 20.64 3.84 Power [W] 60.0 3.5 Efficiency [FPS/W] 0.344 1.097
  • 27. Conclusion • Realized the DCNN on the FPGA • Time multiplexing • Nested RNS • MAC operation is realized by small LUTs • Functional decomposition are used as follows: • Bin2NRNS converter is realized by BRAMs • NRNS2Bin converter is realized by DSP blocks and BRAMs • Performance per power (FPS/W) • 3.19 times better than Pascal GPU 27