LUT-Network Revision2 -English version-

LUT-Network
～ FOR REAL-TIME COMPUTING～
REVISION2
Ryuji Fuchikami
渕上竜司

• This document is update from “fpgax February 2, 2019”
• https://www.slideshare.net/ryuz88/lut-network-
fpgx201902
• This is English translation version
2

History of LUT-Network publishing
• BinaryBrain Version 1 (August 1, 2018 ～)
• I named it “LUT-Network”
• Flat programing
• Binary-LUT model (SIMD AVX2)
• brute force learning model
• binary modulation model
• BinaryBrain Version 2 (September 2, 2018 ～)
• Layer model programing
• support CNN
• support export Verilog-RTL
• add back-propagation learning model
• Sparse Affine model
• micro-MLP model
• BinaryBrain Version 3 (March 19, 2019 ～)
• data object support with GPU (CUDA)
• add Stochastic-LUT model
• add Regression sample
3
https://github.com/ryuz/BinaryBrain

What is Real-Time Computing?
• Technology to match the computer to the real world dynamics.
• Computing in the living space
Human
thing
thing
Digital mirror
Video
conference
Remote
controller
Exploration
robot
Care robot
AR glasses
4
HPC
Autonomous
control
YouTube movie : https://www.youtube.com/watch?v=wGRhw9bbiik&t=2s
offline
Human
real-world

Real-Time Binary-DNN architecture for FPGA
memory
processor
input
device
output
device
best effort (variable fps)
Data enters memory first.
high-throughput, but long latency.
von Neumann architecture
dataflow programming for Real-time.
memory
processor
input
device
output
device
hard-real-time and Low-Latency
Memory is used to refer to past data
5I invented LUT-Network for Real-Time processing

LUT-Network Overview
• Conventional DNN
1. Construct with perceptron nodes.
2. Do training.
3. Get perceptron’s weight parameter.
• LUT-Network
1. Construct with LUT nodes.
2. Do training.
3. Get table parameters
θ
x1
x2
x3
xn
･･･
w1
w2
w3
wn
y
6

LUT-Network Performance
7
xc7z020clg400-1
very few resource Very Low-delay Real-Time recognition
MNIST MLP classification 318,877fps
1ms delay, 1000fps throughput

Network Design
Learning
(e.g. Tensor Flow)
Convert to C++
network
parameter
C++ source code
High Level Synthesis
(e.g. Vivado HLS)
RTL(behavior)
Synthesis
(e.g. Vivado)
Complete
(many LUTs, 100～200MHz)
Network Circuit
Design
network
(FPGA Circuit)
Learning
(BinaryBrain)
RTL(net-list)
Complete
(few LUTs, 300～400MHz)
Synthesis
(e.g. Vivado)
Design Flow for LUT-Network
【Conventional】【LUT-Network】
8

Features of LUT-Network
• Binary Network for Prediction on Edge Device.
• Classification and Regression
• High-density and High-Speed(300～400MHz)
• Circuit size is determined prior to learning
• It is possible to keep Real-Time Warranty
Conventional DNN LUT-Network
Recognition rate Decided when learning best effort
System
performance
best effort Decided when learning
(Real-Time Warranty)
9

How do you learn the LUT?
1. Brute force learning
• Directly optimize LUT tables to minimize loss function for Train data.
• MLP(multi layer perceptron) only. (can’t apply to CNN)
• A large network's learning is difficult.
• Do not use gradients for learning.
(Possibility of being resistant to “Adversarial Examples”)
2. learning with micro-MLP model
• Apply the method of BDNN
• Forward : Binary, Backward : FP32
• low-speed learning on GPU, and high-speed prediction on FPGA.
3. learning with Stochastic-LUT model
• Forward : FP32, Backward : FP32
• high-speed and high-accuracy learning on GPU, and high-speed prediction on FPGA.
3 ideas
10

Brute force learning
1. Initialize LUT with random numbers
2. Fix the output to 0 and 1 respectively and pass all learning data
3. Keep the sum of loss function for each input value of LUT, and update the table
in the direction to reduce .
11
input frequency loss with 0 loss with 1
0 37932 47813.7 48233.9
1 39482 50001.3 49692.9
2 37028 44698.9 44845.7
3 40640 49257.1 49331.0
4 27156 33998.4 33891.0
5 23930 29538.6 29495.2
6 29002 35197.3 35451.4
7 27786 33390.9 33466.9
8 43532 52741.1 52993.5
9 41628 49985.9 50388.5
10 49176 56521.4 56026.1
11 46542 54215.4 54284.9
・・・・
・・・・
・・・・
・・・・
59 34268 41152.9 41215.8
60 22872 28852.4 29000.0
61 17930 22068.9 22112.9
62 24156 28213.2 28227.1
63 24194 28367.0 28450.4
new table value
0
1
0
0
1
1
0
0
0
0
1
0
・・・・
0
0
0
0
0

















yxwvu
tsrqp
onmlk
jihgf
edcba
















000
000
000
000
000
wv
ts
ok
gf
db
Dense-Affine (Fully Connection)
Sparse-Affine (my 1st idea)
・
・
・
synthesis
LUT
LUT
LUT
LUT
LUT
LUT
LUTmapping
BatchNormalization
Binary-Activation
BatchNormalization
Binary-Activation
Deep Logic
(Low-speed and Middle
Performance)
It can not learn XOR
high-speed(300MHz～400MHz)
It can not learn XOR
100MHz～200MHz
micro-MLP stack(my 2nd idea)
LUTmapping
BatchNormalization
Binary-Activation
It can learn XOR
high-speed(300MHz～400MHz)
・
・
・
Simple Logic
(High-speed and Low Performance)
Simple Logic
(High-speed and High Performance)
LUT includes BN
LUT includes hidden layer
This unit is “micro-MLP”
Micro-MLP learning
12

binary activation layer for Micro-MLP
• forward
• Sign()
• 𝑦 =
1 𝑖𝑓 𝑥 ≥ 0,
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒,
• backward
• hard-tanh()
• 𝑔 𝑥 = 1 𝑥 ≤1
Same as the Binary Connect method
Batch Normalization uses the conventional one
13
BatchNormalization
Binary-Activation
・
・
・
mapping LUT

Stochastic-LUT model learning
14
-
＊
-
x0
x1
＊
W0
binarize
＊
＊
W1
＊
＊
W2
＊
＊
W3
1
1
+ y
e.g.) 2-input LUT model x0-x1 is input stochastic variables. W0-W3 is table value.
Probability that W0 is selected : (1 - x1) * (1 - x0)
Probability that W1 is selected : (1 - x1) * x0
Probability that W2 is selected : x1 * (1 - x0)
Probability that W3 is selected : x1 * x0
y = W0 * (1 - x1) * (1 - x0)
+ W1 * (1 - x1) * x0
+ W2 * x1 * (1 - x0)
+ W3 * x1 * x0
Because this calculation tree is differentiable, it can be calculate back-propagation.
The formula for the 6-input LUT is larger, but can be calculated in the same method.
By using the Stochastic-LUT model, it is possible to perform learning much faster and with
higher accuracy than the micro-MLP model.
No need for Batch-Normalization
No need for Activation

Stochastic-LUT model
15
input[n-1:0]
output
Table with probability values as n-dimensional continuum

Learning Prediction
Matrix Weight Activation Convolution Deep
Network
performance of
CPUs/GPUs
performance of
FPGA
Binary
Connect
Dense Binary Real
(FP32)
OK
Binarized
Neural
Network
Dense
Binary
Binary OK
XNOR-
Network
Dense Binary Binary OK
LUT-
Network
Sparse
Real
(FP32) none OK
good
excellent
1 node → many adders
1 node → many XNOR
1 node → many XNOR
1 node → 1 LUT
excellent
benchmark for other Binary Network
16
good

Demonstration １
[MNIST MLP 1000fps])
DNN
(LUT-Net)
MIPI-CSI
RX
Raspberry Pi
Camera V2
(Sony IMX219)
SERDE
S
TX
PS
(Linux)
SERDES
RX
FIN1216
DMA
OLED
UG-9664HDDAG01
DDR3
SDRA
M
I2C
MIPI-CSI
Original Board
PCX-Window
1000fps
640x132
1000fps
control
PL (Jelly)
BinaryBrain
Ether
RTL
offline learning (PC)
ZYBO Z7-20
debug view
17
YouTube movie: https://www.youtube.com/watch?v=NJa77PZlQMI

MNIST MLP 1000fps
LUT: 1182
input:784
layer0: 256
layer1: 256
layer2: 128
layer3: 128
layer4: 128
layer5: 128
layer6: 128
layer7: 30
Total Utilization(include Camera/OLED control)Utilization of DNN part
250MHz / (28x28) = 318,877fps

DNN
(LUT-Net)
MIPI-CSI
RX
Raspberry Pi
Camera V2
(Sony IMX219)
SERDE
S
TX
PS
(Linux)
SERDES
RX
FIN1216
DMA
OLED
UG-9664HDDAG01
DDR3
SDRA
M
I2C
MIPI-CSI
Original Board
PCX-Window
1000fps
640x132
1000fps
control
PL (Jelly)
BinaryBrain
Ether
RTL
offline learning (PC)
ZYBO Z7-20
debug view
OSD
(frame-mem)
19
Demonstration 2
[MNIST CNN 1000fp]
YouTube movie : https://www.youtube.com/watch?v=aYuYrYxztBU

MNIST CNN (DNN part)
CNV3x3
CNV3x3
MaxPol
Affine
CNV3x3
CNV3x3
MaxPol
Affine
// sub-networks for convolution(3x3)
bb::NeuralNetSparseMicroMlp<6, 16>sub0_smm0(1 * 3 * 3, 192);
bb::NeuralNetSparseMicroMlp<6, 16>sub0_smm1(192, 32);
bb::NeuralNetGroup<>sub0_net;
sub0_net.AddLayer(&sub0_smm0);
// main-networks
bb::NeuralNetRealToBinary<float>input_real2bin(28 * 28, 28 * 28);
bb::NeuralNetLoweringConvolution<>layer0_conv(&sub0_net, 1, 28, 28, 32, 3, 3);
bb::NeuralNetMaxPooling<>layer2_maxpol(32, 24, 24, 2, 2);
bb::NeuralNetMaxPooling<>layer5_maxpol(32, 8, 8, 2, 2);
bb::NeuralNetSparseMicroMlp<6, 16>layer6_smm(32 * 4 * 4, 480);
bb::NeuralNetSparseMicroMlp<6, 16>layer7_smm(480, 80);
bb::NeuralNetBinaryToReal<float>output_bin2real(80, 10);
xc7z020clg400-1
20

MNIST CNN (system total)
include Camera and OLED control
21
result of: RTL-simulation

MNIST CNN Learning log [micro-MLP]
fitting start : MnistCnnBin
initial test_accuracy : 0.1518
[save] MnistCnnBin_net_1.json
[load] MnistCnnBin_net.json
fitting start : MnistCnnBin
[initial] test_accuracy : 0.6778 train_accuracy : 0.6694
695.31s epoch[ 2] test_accuracy : 0.7661 train_accuracy : 0.7473
fitting end
22micro MLP-model on BinaryBrain version2

MNIST CNN Learning log[Stochastic-LUT]
fitting start : MnistStochasticLut6Cnn
72.35s epoch[ 1] test accuracy : 0.9508 train accuracy : 0.9529
・
・
・
fitting end
parameter copy to LUT-Network
lut_accuracy : 0.9641
export : verilog/MnistStochasticLut6Cnn.v
23
Stochastic-LUT model on BinaryBrain version3

Linear Regression [Stochastic-LUT]
(diabetes data from scikit-learn)
fitting start : DiabetesRegressionStochasticLut6
[initial] test MSE : 0.0571 train MSE : 0.0581
0.97s epoch[ 1] test MSE : 0.0307 train MSE : 0.0344
・
・
・
fitting end
parameter copy to LUT-Network
LUT-Network accuracy : 0.0340518
export : DiabetesRegressionBinaryLut.v
24Stochastic-LUT model on BinaryBrain version3

Learning prediction
operator
CPU
1Core
operator
CPU 1Core
(1 weight calculate instructions)
FPGA
(XILIN 7-Series)
ASIC
multi-cycle pipeline multi-cycle pipeline
Affine
(Float)
Multiplier
+ adder
0.25
cycle
Multiplier
+ adder
0.125 cycle
(8 parallel [FMA])
[MUL] DSP:2
LUT:133
[ADD] LUT:413
左×node数 gate : over 10k gate : over 10M
Affine
(INT16)
Multiplier
+ adder
0.125
cycle
Multiplier
+ adder
0.0625 cycle
(16 parallel)
[MAC] DSP:1 左×node数 gate : 0.5k～1k gate : over 1M
Binary
Connect
Multiplier
+ adder
0.25
cycle
adder
+adder
0.125 cycle
(8 parallel)
[MAC] DSP:1 左×node数 gate : 100～200 左×node数
BNN/
XNOR-Net
Multiplier
+ adder
0.25
cycle
XNOR
+popcnt
0.0039+0.0156 cycle
(256 parallel)
LUT:6～12
LUT:400～10000
(接続数次第)
gate : 20～60 左×node数
6-LUT-Net
Multiplier
+ adder
23.8
cycle
LUT
1.16 cycle
(6 input load
+ 1 table load) / 6
(256 parallel)
LUT : 1
(over spec)
LUT : 1
(fit)
gate : 10～30
(over spec)
gate : 10～30
2-LUT-Net
Multiplier
+ adder
1.37
cycle
logic-gate
1.5 cycle
(2 input load
+ 1 table load) / 2
LUT : 1
(over spec)
LUT : 1
(over spec)
gate : 1
(over spec)
gate : 1
(fit)
Resource estimate
25

oversampling and binary modulation
• oversampling and quantum modulation
• PWM(Pulse Width Modulation)
• delta-sigma modulation
• random dither, etc.
• For example, high-speed camera images originally contain noise.
• LPF (Low pass filter) removes noise and dequantizes it
• Regression analysis becomes possible
• e.g.) LPF will be constructed of IIR/FIR/Kalman filter
modulation
Quantization
DNN
random noise
or local oscillator
LPF
26
Human sense includes LPF

architecture proposal for Real-time
27
DNN
Video-In
ME MC
Video-Out
Frame Memory
Similar to IIR-filter

Next approach
• Improving Sparse Connected Connection Rules
• Currently connection rules is random select. But, any
data have locality, as CNN.
• There is a method to determine connection destination
by node distance probabilistically with Gaussian function
etc.
• I want to make stacked connection in pyramid structure.
28

reference
• BinaryConnect: Training Deep Neural Networks with binary weights during propagations
https://arxiv.org/pdf/1511.00363.pdf
• Binarized Neural Networks
https://arxiv.org/abs/1602.02505
• Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations
Constrained to +1 or -1
• XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
• Xilinx UltraScale Architecture Configurable Logic Block User Guide
https://japan.xilinx.com/support/documentation/user_guides/ug574-ultrascale-clb.pdf
29

My Profile
• Open source programmer (hobbyist)
• Born in 1976, I’m living in Fukuoka-city, Japan
• 1998～ publish HOS (Real-Time OS [uITRON])
• https://ja.osdn.net/projects/hos/
(ARM,H8,SH,MIPS,x86,Z80,AM,V850,MicroBlaze, etc.)
• 2008～ publish Jell (Soft-core CPU for FPGA)
• https://github.com/ryuz/jelly
• http://ryuz.my.coocan.jp/jelly/toppage.html
• 2018～ publish LUT-Network
• https://github.com/ryuz/BinaryBrain
• Real-Time AR-glasses project(current my hobby)
• Real-Time glasses (camera [IMX219] & OLED 1000fps)
https://www.youtube.com/watch?v=wGRhw9bbiik
• Real-Time GPU (no frame buffer architecture)
https://www.youtube.com/watch?v=vl-lhSOOlSk
• Real-Time DNN (LUT-Network）
https://www.youtube.com/watch?v=aYuYrYxztBU
30

Contact to me
• Ryuji Fuchikami (渕上竜司)
• e-mail : ryuji.fuchikami@nifty.com
• Web-Site : http://ryuz.my.coocan.jp/
• Blog. : http://ryuz.txt-nifty.com/
• GitHub : https://github.com/ryuz/
• Twitter : https://twitter.com/Ryuz88
• Facebook : https://www.facebook.com/ryuji.fuchikami
• YouTube : https://www.youtube.com/user/nekoneko1024
31

LUT-Network Revision2 -English version-

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à LUT-Network Revision2 -English version-

Similaire à LUT-Network Revision2 -English version- (20)

Plus de ryuz88

Plus de ryuz88 (8)

Dernier

Dernier (20)

LUT-Network Revision2 -English version-