SlideShare a Scribd company logo
1 of 28
Download to read offline
A Deep Convolutional
Neural Network Based on
Nested Residue Number System
Hiroki Nakahara1 Tsutomu Sasao2
1Ehime University, Japan
2Meiji University, Japan
1
Outline
• Background
• Deep convolutional neural network (DCNN)
• Residue number system (RNS)
• DCNN using nested RNS (NRNS)
• Experimental results
• Conclusion
2
Background
• Deep Neural Network
– Multi-layer neuron model
– Used for embedded vision system
• FPGA realization is suitable for real-time systems
– faster than the CPU
– Lower power consumption than the GPU
– Fixed point representation is sufficient
• High-performance per area is desired
3
Deep Convolutional
Neural Network (DCNN)
4
Artificial Neuron
+
x0=1x0=1
x1
x2
xN
... w0 (Bias)
w1
w2
wN
f(u)
u y
xi: Input signal
wi: Weight
u: Internal state
f(u): Activation function
(Sigmoid, ReLU, etc.)
y: Output signal
5
y  f (u)
u  wi xi
i0
N

Deep Convolutional Neural Network(DCNN)
for ImageNet
• 2D convolutional layer, pooling layer, and fully connection layer
6
2D Convolutional Layer
• Consumes more than 90% of the computation time
– Multiply-accumulation (MAC) operation is performed
7
zij  yij  xim, jnwmn
n0
K1

m0
K1

xij: Input signal
yij : Bias
wmn: Weight
K: Kernel size
zij: Output signal
K
K
FPGA
Realization of 2D Convolutional Layer
• Requires more than billion MACs!
• Our realization
– Time multiplexing
– Nested Residue Number System(NRNS)
8
Off‐chip Memory
** **
++
** **
++
++
** **
++
** **
++
++
BRAMBRAM BRAMBRAM
FPGA
Off‐chip Memory
** **
BRAMBRAM
**
**
** **
**
** **
BRAMBRAM
**
**
** **
**
** **
BRAMBRAM
**
**
** **
**
** **
BRAMBRAM
**
**
** **
**
➔
ConverterConverter ConverterConverter ConverterConverter ConverterConverter
ConverterConverter ConverterConverter ConverterConverter ConverterConverter
Residue Number System(RNS)
9
Residue Number System (RNS)
 Defined by a set of L mutually prime integer
constants 〈m1,m2,...,mL〉
 No pair modulus have a common factor with any other
 Typically, prime number is used as moduli set
 An arbitrary integer X can be uniquely
represented by a tuple of L integers
(X1,X2,…,XL), where
 Dynamic range
10
)(mod ii mXX 
M  mi
i1
L

Parallel Multiplication
Multiplication on RNS
Moduli set〈3,4,5〉, X=8, Y=2
Z=X×Y=16=(1,0,1)
X=(2,0,3), Y=(2,2,2)
Z=(4 mod 3,0 mod 4,6 mod 5)
=(1,0,1)=16
11
Binary2RNS Conversion
RNS2Binary Conversion
➔ ➔
Binary2RNS Converter
12
X mod 2 mod 3 mod4
0 0 0 0
1 1 1 1
2 0 2 2
3 1 0 3
4 0 1 0
5 1 2 1
➔
13
00 01 10 11
00
01
10
11
0
1
1
1
1
1
0
0
0
1
1
1
1
1
0
0
X1=(x1, x2)
X2=(x3, x4)
h(X1) 0 01 1
x1 0 0 1 1
x2 0 1 0 1
h(X1) 0 1 0 1
0 1
00 0 1
01 1 1
10 1 0
11 1 0
x3,x4
h(X1)
Functional Decomposition
24x1=16 [bit] 22x1+23x1=12 [bit]
Column multiplicity=2
Bound variables
Free
variables
Decomposition Chart for X mod 3
14
000 001 010 011
00
01
10
11
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
X2=(x3, x4, x5)
X1=(x1,x2)
100 101 110 111
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
0 mod 3 = 0
1 mod 3 = 1
2 mod 3 = 2
3 mod 3 = 0
4 mod 3 = 1
5 mod 3 = 2
6 mod 3 = 0
7 mod 3 = 1
8 mod 3 = 2
9 mod 3 = 0
10 mod 3 = 1
…
Free
variables
Bound variables
Decomposition Chart for X mod 3
15
0 1 2
00
01
10
11
0
1
2
0
1
2
0
1
2
0
1
2
X2=(x3,x4,x5)X1=(x1,x2) 0 mod 3 = 0
1 mod 3 = 1
2 mod 3 = 2
3 mod 3 = 0
4 mod 3 = 1
5 mod 3 = 2
6 mod 3 = 0
7 mod 3 = 1
8 mod 3 = 2
9 mod 3 = 0
10 mod 3 = 1
…
FreeBound
x3 0 0 0 0 1 1 1 1
x4 0 0 1 1 0 0 1 1
x5 0 1 0 1 0 1 0 1
h(X2) 0 1 2 0 1 2 0 1
Binary2RNS Converter
16
LUT cascade for X mod m1
LUT cascade for X mod m2
BRAM
BRAM
BRAM
RNS2Binary Converter (m=30)
17
x1 y1
0 0
1 15
x2 y2
0 0
1 10
2 20
x3 y3
0 0
1 6
2 12
3 18
4 24
Mod m
Adder
Mod m
Adder
carry
carry
Problem
• Moduli set of RNS consists of mutually prime numbers
– sizes of circuits are all different
• Example: <7,11,13>
18
6‐input
LUT
8‐input
LUT
8‐input
LUT
3
4
4
4
4
3
3
4
4
Binary2RNS
Converter
by
BRAMs
RNS2Binary
Converter
by
DSP blocks
and BRAMs
➔ ➔
DCNN using Nested RNS
19
Nested RNS
• (Z1,Z2,…,Zi,…, ZL) (Z1,Z2,…,(Zi1,Zi2,…,Zij),…, ZL)
• Ex: <7,11,13>×<7,11,13>
<7,<5,6,7>11,<5,6,7>13>×<7,<5,6,7>11,<5,6,7>13>
20
1. Reuse the same moduli set
2. Decompose a large modulo into smaller ones
Original modulus
➔
Example of Nested RNS
• 19x22(=418) on <7,<5,6,7>11,<5,6,7>13>
19×22
=<5,8,6>×<1,0,9>
=<5,<3,2,1>11,<1,0,6>13>×<1,<0,0,0>11,<4,3,2>13>
=<5,<0,0,0>11,<4,0,5>13>
=<5,0,2>
=418
21
Modulo Multiplication
Bin2RNS on NRNS
RNS2Bin
Binary2NRNS Conversion
Realization of Nested RNS
22
<5,6,7>
2Bin
Bin2
<7,11,13>
3
<7,11,13>
2Bin
<5,6,7>
2Bin
Bin2
<5,6,7>
Bin2
<5,6,7>
6‐input
LUT
6‐input
LUT
6‐input
LUT
6‐input
LUT
6‐input
LUT
6‐input
LUT
6‐input
LUT
Bin2
<7,11,13>
Bin2
<5,6,7>
Bin2
<5,6,7>
4
4
3
4
4
3
3
3
3
3
3
Binary
2NRNS
NRNS2
Binary
Realized by BRAMs                      LUTs      BRAMs and DSP blocks   
Moduli Set for NRNS
• Conventional RNS (uses 23 moduli)
<3,4,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,
61,67,71,73,79,83>
• Applied the NRNS to moduli that are greater than 15
<3,4,5,7,11,13,
<3,4,5,7,11,13>17,
<3,4,5,7,11,13>19,
<3,4,5,7,11,13,<3,4,5,7,11,13>17>23,
<3,4,5,7,11,13,<3,4,5,7,11,13>17>29,
…, <3,4,5,7,11,13,<3,4,5,7,11,13>17>83>
23
All the 48-bit MAC operations are decomposed into 4-bit ones
DCNN Architecture using the NRNS
24
...
16 parallel modulo mi
2D convolutional units
...
...
. . .
BRAM BRAM BRAM...
BRAM BRAM BRAM...
BRAM BRAM BRAM...
. . .
Parallel Bin2NRNS
Converters
Tree‐based NRNS2Bin
Converters
Sequencer             
External DDR3SODIMM
DDR3 Ctrl.DDR3 Ctrl.
On‐chip
Memory
RNS
2
Bin
RNS
2
Bin
RNS
2
Bin
RNS
2
Bin
RNS
2
Bin
.........
...
Experimental Results
25
Implementation Setup
• FPGA board: Xilinx VC707
– FPGA: Virtex7 VC485T
– 1GB DDR3SODIMM
(Bus@800MHz, 64 bit width)
• Realized the pre-trained
ImageNet by Convnet2
– 48-bit fixed precision
• Synthesis tool: Xilinx Vivado2014.1
– Timing constrain: 400MHz
26
Comparison with Other
Implementations
27
Precision Max.
Freq.
[MHz]
FPGA Performance
[GOPS]
Performance
per area
[GOPS/
Slice x 10‐4]
ASAP2009 16bit fixed 115 Viretex5 LX330T 6.7 1.3
PACT2010 ‐‐‐ fixed 125 Viretex5 SX240T 7.0 1.9
FPL2009 48bit fixed 125 Spartax3A DSP3400 5.3 2.2
ISCA2010 48bit fixed 200 Virtex5 SX240T 16.0 4.3
ICCD2013 ‐‐‐ fixed 150 Virtex6 LVX240T 17.0 4.5
FPGA2015 32bit float 100 Virtex7 VX485T 61.6 8.1
Proposed 48bit fixed 400 Virtex7 VX485T 132.2 25.2
Conclusion
• Realized the DCNN on the FPGA
– Time multiplexing
– Nested RNS
• MAC operation is realized by small LUTs
• Functional decomposition are used as follows:
– Bin2NRNS converter is realized by BRAMs
– NRNS2Bin converter is realized by DSP blocks and
BRAMs
• Performance per area (GOPS/Slice)
– 5.86 times higher than ISCA10’s
28

More Related Content

What's hot

An FPGA-based acceleration methodology and performance model for iterative st...
An FPGA-based acceleration methodology and performance model for iterative st...An FPGA-based acceleration methodology and performance model for iterative st...
An FPGA-based acceleration methodology and performance model for iterative st...
NECST Lab @ Politecnico di Milano
 

What's hot (20)

[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn[251] implementing deep learning using cu dnn
[251] implementing deep learning using cu dnn
 
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYCTed Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
 
Lecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural NetworksLecture 7: Recurrent Neural Networks
Lecture 7: Recurrent Neural Networks
 
Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...
Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...
Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...
 
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...【DL輪読会】Incorporating group update for speech enhancement  based on convolutio...
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
LUT-Network Revision2 -English version-
LUT-Network Revision2 -English version-LUT-Network Revision2 -English version-
LUT-Network Revision2 -English version-
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention Networks
 
Neurogrid : A Mixed-Analog-Digital Multichip System for Large-Scale Neural Si...
Neurogrid : A Mixed-Analog-Digital Multichip System for Large-Scale Neural Si...Neurogrid : A Mixed-Analog-Digital Multichip System for Large-Scale Neural Si...
Neurogrid : A Mixed-Analog-Digital Multichip System for Large-Scale Neural Si...
 
Electricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural NetworksElectricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural Networks
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
 
Deep learning and its application
Deep learning and its applicationDeep learning and its application
Deep learning and its application
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Multidimensional RNN
Multidimensional RNNMultidimensional RNN
Multidimensional RNN
 
An FPGA-based acceleration methodology and performance model for iterative st...
An FPGA-based acceleration methodology and performance model for iterative st...An FPGA-based acceleration methodology and performance model for iterative st...
An FPGA-based acceleration methodology and performance model for iterative st...
 
Slide tesi
Slide tesiSlide tesi
Slide tesi
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
 

Viewers also liked

Viewers also liked (17)

Verilog-HDL Tutorial (13)
Verilog-HDL Tutorial (13)Verilog-HDL Tutorial (13)
Verilog-HDL Tutorial (13)
 
FPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGAFPGAX2016 ドキュンなFPGA
FPGAX2016 ドキュンなFPGA
 
Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装Nested RNSを用いたディープニューラルネットワークのFPGA実装
Nested RNSを用いたディープニューラルネットワークのFPGA実装
 
Verilog-HDL Tutorial (15) hardware
Verilog-HDL Tutorial (15) hardwareVerilog-HDL Tutorial (15) hardware
Verilog-HDL Tutorial (15) hardware
 
Verilog-HDL Tutorial (9)
Verilog-HDL Tutorial (9)Verilog-HDL Tutorial (9)
Verilog-HDL Tutorial (9)
 
Verilog-HDL Tutorial (14)
Verilog-HDL Tutorial (14)Verilog-HDL Tutorial (14)
Verilog-HDL Tutorial (14)
 
Verilog-HDL Tutorial (12)
Verilog-HDL Tutorial (12)Verilog-HDL Tutorial (12)
Verilog-HDL Tutorial (12)
 
Verilog-HDL Tutorial (15) software
Verilog-HDL Tutorial (15) softwareVerilog-HDL Tutorial (15) software
Verilog-HDL Tutorial (15) software
 
Verilog-HDL Tutorial (11)
Verilog-HDL Tutorial (11)Verilog-HDL Tutorial (11)
Verilog-HDL Tutorial (11)
 
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
 
Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)Altera sdk for open cl アンケート集計結果(公開版)
Altera sdk for open cl アンケート集計結果(公開版)
 
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
 
Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)Tensor flow usergroup 2016 (公開版)
Tensor flow usergroup 2016 (公開版)
 
(公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017 (公開版)FPGAエクストリームコンピューティング2017
(公開版)FPGAエクストリームコンピューティング2017
 
(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS(公開版)Reconf研2017GUINNESS
(公開版)Reconf研2017GUINNESS
 
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
 
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)2値化CNN on FPGAでGPUとガチンコバトル(公開版)
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
 

Similar to FPL15 talk: Deep Convolutional Neural Network on FPGA

ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)
WoochulShin10
 

Similar to FPL15 talk: Deep Convolutional Neural Network on FPGA (20)

ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)ImageNet classification with deep convolutional neural networks(2012)
ImageNet classification with deep convolutional neural networks(2012)
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep Learning
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation
 
Aleksander gegov
Aleksander gegovAleksander gegov
Aleksander gegov
 
Solution of simplified neutron diffusion equation by FDM
Solution of simplified neutron diffusion equation by FDMSolution of simplified neutron diffusion equation by FDM
Solution of simplified neutron diffusion equation by FDM
 
lossy compression JPEG
lossy compression JPEGlossy compression JPEG
lossy compression JPEG
 
Parallelization Techniques for the 2D Fourier Matched Filtering and Interpola...
Parallelization Techniques for the 2D Fourier Matched Filtering and Interpola...Parallelization Techniques for the 2D Fourier Matched Filtering and Interpola...
Parallelization Techniques for the 2D Fourier Matched Filtering and Interpola...
 
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
 
The reversible residual network
The reversible residual networkThe reversible residual network
The reversible residual network
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
 
3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...
3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...
3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...
 
Introduction to Neural networks (under graduate course) Lecture 6 of 9
Introduction to Neural networks (under graduate course) Lecture 6 of 9Introduction to Neural networks (under graduate course) Lecture 6 of 9
Introduction to Neural networks (under graduate course) Lecture 6 of 9
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
Sparse coding Super-Resolution を用いた核医学画像処理
Sparse coding Super-Resolution を用いた核医学画像処理Sparse coding Super-Resolution を用いた核医学画像処理
Sparse coding Super-Resolution を用いた核医学画像処理
 
Skip RNN: Learning to Skip State Updates in RNNs (ICLR 2018)
Skip RNN: Learning to Skip State Updates in RNNs (ICLR 2018)Skip RNN: Learning to Skip State Updates in RNNs (ICLR 2018)
Skip RNN: Learning to Skip State Updates in RNNs (ICLR 2018)
 
Dsp manual print
Dsp manual printDsp manual print
Dsp manual print
 
26_Fan.pdf
26_Fan.pdf26_Fan.pdf
26_Fan.pdf
 
Neural network
Neural networkNeural network
Neural network
 
Lifting 1
Lifting 1Lifting 1
Lifting 1
 

More from Hiroki Nakahara (6)

ROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROSROS User Group Meeting #28 マルチ深層学習とROS
ROS User Group Meeting #28 マルチ深層学習とROS
 
FPGAX2019
FPGAX2019FPGAX2019
FPGAX2019
 
SBRA2018講演資料
SBRA2018講演資料SBRA2018講演資料
SBRA2018講演資料
 
DSF2018講演スライド
DSF2018講演スライドDSF2018講演スライド
DSF2018講演スライド
 
Verilog-HDL Tutorial (8)
Verilog-HDL Tutorial (8)Verilog-HDL Tutorial (8)
Verilog-HDL Tutorial (8)
 
Verilog-HDL Tutorial (7)
Verilog-HDL Tutorial (7)Verilog-HDL Tutorial (7)
Verilog-HDL Tutorial (7)
 

Recently uploaded

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
Health
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
HenryBriggs2
 

Recently uploaded (20)

S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 

FPL15 talk: Deep Convolutional Neural Network on FPGA

  • 1. A Deep Convolutional Neural Network Based on Nested Residue Number System Hiroki Nakahara1 Tsutomu Sasao2 1Ehime University, Japan 2Meiji University, Japan 1
  • 2. Outline • Background • Deep convolutional neural network (DCNN) • Residue number system (RNS) • DCNN using nested RNS (NRNS) • Experimental results • Conclusion 2
  • 3. Background • Deep Neural Network – Multi-layer neuron model – Used for embedded vision system • FPGA realization is suitable for real-time systems – faster than the CPU – Lower power consumption than the GPU – Fixed point representation is sufficient • High-performance per area is desired 3
  • 5. Artificial Neuron + x0=1x0=1 x1 x2 xN ... w0 (Bias) w1 w2 wN f(u) u y xi: Input signal wi: Weight u: Internal state f(u): Activation function (Sigmoid, ReLU, etc.) y: Output signal 5 y  f (u) u  wi xi i0 N 
  • 6. Deep Convolutional Neural Network(DCNN) for ImageNet • 2D convolutional layer, pooling layer, and fully connection layer 6
  • 7. 2D Convolutional Layer • Consumes more than 90% of the computation time – Multiply-accumulation (MAC) operation is performed 7 zij  yij  xim, jnwmn n0 K1  m0 K1  xij: Input signal yij : Bias wmn: Weight K: Kernel size zij: Output signal K K
  • 8. FPGA Realization of 2D Convolutional Layer • Requires more than billion MACs! • Our realization – Time multiplexing – Nested Residue Number System(NRNS) 8 Off‐chip Memory ** ** ++ ** ** ++ ++ ** ** ++ ** ** ++ ++ BRAMBRAM BRAMBRAM FPGA Off‐chip Memory ** ** BRAMBRAM ** ** ** ** ** ** ** BRAMBRAM ** ** ** ** ** ** ** BRAMBRAM ** ** ** ** ** ** ** BRAMBRAM ** ** ** ** ** ➔ ConverterConverter ConverterConverter ConverterConverter ConverterConverter ConverterConverter ConverterConverter ConverterConverter ConverterConverter
  • 10. Residue Number System (RNS)  Defined by a set of L mutually prime integer constants 〈m1,m2,...,mL〉  No pair modulus have a common factor with any other  Typically, prime number is used as moduli set  An arbitrary integer X can be uniquely represented by a tuple of L integers (X1,X2,…,XL), where  Dynamic range 10 )(mod ii mXX  M  mi i1 L 
  • 11. Parallel Multiplication Multiplication on RNS Moduli set〈3,4,5〉, X=8, Y=2 Z=X×Y=16=(1,0,1) X=(2,0,3), Y=(2,2,2) Z=(4 mod 3,0 mod 4,6 mod 5) =(1,0,1)=16 11 Binary2RNS Conversion RNS2Binary Conversion ➔ ➔
  • 12. Binary2RNS Converter 12 X mod 2 mod 3 mod4 0 0 0 0 1 1 1 1 2 0 2 2 3 1 0 3 4 0 1 0 5 1 2 1 ➔
  • 13. 13 00 01 10 11 00 01 10 11 0 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 X1=(x1, x2) X2=(x3, x4) h(X1) 0 01 1 x1 0 0 1 1 x2 0 1 0 1 h(X1) 0 1 0 1 0 1 00 0 1 01 1 1 10 1 0 11 1 0 x3,x4 h(X1) Functional Decomposition 24x1=16 [bit] 22x1+23x1=12 [bit] Column multiplicity=2 Bound variables Free variables
  • 14. Decomposition Chart for X mod 3 14 000 001 010 011 00 01 10 11 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 X2=(x3, x4, x5) X1=(x1,x2) 100 101 110 111 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 0 mod 3 = 0 1 mod 3 = 1 2 mod 3 = 2 3 mod 3 = 0 4 mod 3 = 1 5 mod 3 = 2 6 mod 3 = 0 7 mod 3 = 1 8 mod 3 = 2 9 mod 3 = 0 10 mod 3 = 1 … Free variables Bound variables
  • 15. Decomposition Chart for X mod 3 15 0 1 2 00 01 10 11 0 1 2 0 1 2 0 1 2 0 1 2 X2=(x3,x4,x5)X1=(x1,x2) 0 mod 3 = 0 1 mod 3 = 1 2 mod 3 = 2 3 mod 3 = 0 4 mod 3 = 1 5 mod 3 = 2 6 mod 3 = 0 7 mod 3 = 1 8 mod 3 = 2 9 mod 3 = 0 10 mod 3 = 1 … FreeBound x3 0 0 0 0 1 1 1 1 x4 0 0 1 1 0 0 1 1 x5 0 1 0 1 0 1 0 1 h(X2) 0 1 2 0 1 2 0 1
  • 17. RNS2Binary Converter (m=30) 17 x1 y1 0 0 1 15 x2 y2 0 0 1 10 2 20 x3 y3 0 0 1 6 2 12 3 18 4 24 Mod m Adder Mod m Adder carry carry
  • 18. Problem • Moduli set of RNS consists of mutually prime numbers – sizes of circuits are all different • Example: <7,11,13> 18 6‐input LUT 8‐input LUT 8‐input LUT 3 4 4 4 4 3 3 4 4 Binary2RNS Converter by BRAMs RNS2Binary Converter by DSP blocks and BRAMs ➔ ➔
  • 20. Nested RNS • (Z1,Z2,…,Zi,…, ZL) (Z1,Z2,…,(Zi1,Zi2,…,Zij),…, ZL) • Ex: <7,11,13>×<7,11,13> <7,<5,6,7>11,<5,6,7>13>×<7,<5,6,7>11,<5,6,7>13> 20 1. Reuse the same moduli set 2. Decompose a large modulo into smaller ones Original modulus ➔
  • 21. Example of Nested RNS • 19x22(=418) on <7,<5,6,7>11,<5,6,7>13> 19×22 =<5,8,6>×<1,0,9> =<5,<3,2,1>11,<1,0,6>13>×<1,<0,0,0>11,<4,3,2>13> =<5,<0,0,0>11,<4,0,5>13> =<5,0,2> =418 21 Modulo Multiplication Bin2RNS on NRNS RNS2Bin Binary2NRNS Conversion
  • 22. Realization of Nested RNS 22 <5,6,7> 2Bin Bin2 <7,11,13> 3 <7,11,13> 2Bin <5,6,7> 2Bin Bin2 <5,6,7> Bin2 <5,6,7> 6‐input LUT 6‐input LUT 6‐input LUT 6‐input LUT 6‐input LUT 6‐input LUT 6‐input LUT Bin2 <7,11,13> Bin2 <5,6,7> Bin2 <5,6,7> 4 4 3 4 4 3 3 3 3 3 3 Binary 2NRNS NRNS2 Binary Realized by BRAMs                      LUTs      BRAMs and DSP blocks   
  • 23. Moduli Set for NRNS • Conventional RNS (uses 23 moduli) <3,4,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59, 61,67,71,73,79,83> • Applied the NRNS to moduli that are greater than 15 <3,4,5,7,11,13, <3,4,5,7,11,13>17, <3,4,5,7,11,13>19, <3,4,5,7,11,13,<3,4,5,7,11,13>17>23, <3,4,5,7,11,13,<3,4,5,7,11,13>17>29, …, <3,4,5,7,11,13,<3,4,5,7,11,13>17>83> 23 All the 48-bit MAC operations are decomposed into 4-bit ones
  • 24. DCNN Architecture using the NRNS 24 ... 16 parallel modulo mi 2D convolutional units ... ... . . . BRAM BRAM BRAM... BRAM BRAM BRAM... BRAM BRAM BRAM... . . . Parallel Bin2NRNS Converters Tree‐based NRNS2Bin Converters Sequencer              External DDR3SODIMM DDR3 Ctrl.DDR3 Ctrl. On‐chip Memory RNS 2 Bin RNS 2 Bin RNS 2 Bin RNS 2 Bin RNS 2 Bin ......... ...
  • 26. Implementation Setup • FPGA board: Xilinx VC707 – FPGA: Virtex7 VC485T – 1GB DDR3SODIMM (Bus@800MHz, 64 bit width) • Realized the pre-trained ImageNet by Convnet2 – 48-bit fixed precision • Synthesis tool: Xilinx Vivado2014.1 – Timing constrain: 400MHz 26
  • 27. Comparison with Other Implementations 27 Precision Max. Freq. [MHz] FPGA Performance [GOPS] Performance per area [GOPS/ Slice x 10‐4] ASAP2009 16bit fixed 115 Viretex5 LX330T 6.7 1.3 PACT2010 ‐‐‐ fixed 125 Viretex5 SX240T 7.0 1.9 FPL2009 48bit fixed 125 Spartax3A DSP3400 5.3 2.2 ISCA2010 48bit fixed 200 Virtex5 SX240T 16.0 4.3 ICCD2013 ‐‐‐ fixed 150 Virtex6 LVX240T 17.0 4.5 FPGA2015 32bit float 100 Virtex7 VX485T 61.6 8.1 Proposed 48bit fixed 400 Virtex7 VX485T 132.2 25.2
  • 28. Conclusion • Realized the DCNN on the FPGA – Time multiplexing – Nested RNS • MAC operation is realized by small LUTs • Functional decomposition are used as follows: – Bin2NRNS converter is realized by BRAMs – NRNS2Bin converter is realized by DSP blocks and BRAMs • Performance per area (GOPS/Slice) – 5.86 times higher than ISCA10’s 28