This document is update from “fpgax February 2, 2019”
https://www.slideshare.net/ryuz88/lut-network-fpgx201902
Japanese Version
https://www.slideshare.net/ryuz88/lutnetwork-revision2
BinaryBrain
https://github.com/ryuz/BinaryBrain
2. • This document is update from “fpgax February 2, 2019”
• https://www.slideshare.net/ryuz88/lut-network-
fpgx201902
• This is English translation version
2
3. History of LUT-Network publishing
• BinaryBrain Version 1 (August 1, 2018 ~)
• I named it “LUT-Network”
• Flat programing
• Binary-LUT model (SIMD AVX2)
• brute force learning model
• binary modulation model
• BinaryBrain Version 2 (September 2, 2018 ~)
• Layer model programing
• support CNN
• support export Verilog-RTL
• add back-propagation learning model
• Sparse Affine model
• micro-MLP model
• BinaryBrain Version 3 (March 19, 2019 ~)
• data object support with GPU (CUDA)
• add Stochastic-LUT model
• add Regression sample
3
https://github.com/ryuz/BinaryBrain
4. What is Real-Time Computing?
• Technology to match the computer to the real world dynamics.
• Computing in the living space
Human
thing
thing
Digital mirror
Video
conference
Remote
controller
Exploration
robot
Care robot
AR glasses
4
HPC
Autonomous
control
YouTube movie : https://www.youtube.com/watch?v=wGRhw9bbiik&t=2s
offline
Human
real-world
5. Real-Time Binary-DNN architecture for FPGA
memory
processor
input
device
output
device
best effort (variable fps)
Data enters memory first.
high-throughput, but long latency.
von Neumann architecture
dataflow programming for Real-time.
memory
processor
input
device
output
device
hard-real-time and Low-Latency
Memory is used to refer to past data
5I invented LUT-Network for Real-Time processing
6. LUT-Network Overview
• Conventional DNN
1. Construct with perceptron nodes.
2. Do training.
3. Get perceptron’s weight parameter.
• LUT-Network
1. Construct with LUT nodes.
2. Do training.
3. Get table parameters
θ
x1
x2
x3
xn
・・・
w1
w2
w3
wn
y
6
9. Features of LUT-Network
• Binary Network for Prediction on Edge Device.
• Classification and Regression
• High-density and High-Speed(300~400MHz)
• Circuit size is determined prior to learning
• It is possible to keep Real-Time Warranty
Conventional DNN LUT-Network
Recognition rate Decided when learning best effort
System
performance
best effort Decided when learning
(Real-Time Warranty)
9
10. How do you learn the LUT?
1. Brute force learning
• Directly optimize LUT tables to minimize loss function for Train data.
• MLP(multi layer perceptron) only. (can’t apply to CNN)
• A large network's learning is difficult.
• Do not use gradients for learning.
(Possibility of being resistant to “Adversarial Examples”)
2. learning with micro-MLP model
• Apply the method of BDNN
• Forward : Binary, Backward : FP32
• low-speed learning on GPU, and high-speed prediction on FPGA.
3. learning with Stochastic-LUT model
• Forward : FP32, Backward : FP32
• high-speed and high-accuracy learning on GPU, and high-speed prediction on FPGA.
3 ideas
10
11. Brute force learning
1. Initialize LUT with random numbers
2. Fix the output to 0 and 1 respectively and pass all learning data
3. Keep the sum of loss function for each input value of LUT, and update the table
in the direction to reduce .
11
input frequency loss with 0 loss with 1
0 37932 47813.7 48233.9
1 39482 50001.3 49692.9
2 37028 44698.9 44845.7
3 40640 49257.1 49331.0
4 27156 33998.4 33891.0
5 23930 29538.6 29495.2
6 29002 35197.3 35451.4
7 27786 33390.9 33466.9
8 43532 52741.1 52993.5
9 41628 49985.9 50388.5
10 49176 56521.4 56026.1
11 46542 54215.4 54284.9
・・・・
・・・・
・・・・
・・・・
59 34268 41152.9 41215.8
60 22872 28852.4 29000.0
61 17930 22068.9 22112.9
62 24156 28213.2 28227.1
63 24194 28367.0 28450.4
new table value
0
1
0
0
1
1
0
0
0
0
1
0
・・・・
0
0
0
0
0
12.
yxwvu
tsrqp
onmlk
jihgf
edcba
000
000
000
000
000
wv
ts
ok
gf
db
Dense-Affine (Fully Connection)
Sparse-Affine (my 1st idea)
・
・
・
synthesis
LUT
LUT
LUT
LUT
LUT
LUT
LUTmapping
BatchNormalization
Binary-Activation
BatchNormalization
Binary-Activation
Deep Logic
(Low-speed and Middle
Performance)
It can not learn XOR
high-speed(300MHz~400MHz)
It can not learn XOR
100MHz~200MHz
micro-MLP stack(my 2nd idea)
LUTmapping
BatchNormalization
Binary-Activation
It can learn XOR
high-speed(300MHz~400MHz)
・
・
・
Simple Logic
(High-speed and Low Performance)
Simple Logic
(High-speed and High Performance)
LUT includes BN
LUT includes hidden layer
This unit is “micro-MLP”
Micro-MLP learning
12
13. binary activation layer for Micro-MLP
• forward
• Sign()
• 𝑦 =
1 𝑖𝑓 𝑥 ≥ 0,
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒,
• backward
• hard-tanh()
• 𝑔 𝑥 = 1 𝑥 ≤1
Same as the Binary Connect method
Batch Normalization uses the conventional one
13
BatchNormalization
Binary-Activation
・
・
・
mapping LUT
14. Stochastic-LUT model learning
14
-
*
-
x0
x1
*
W0
binarize
*
*
W1
*
*
W2
*
*
W3
1
1
+ y
e.g.) 2-input LUT model x0-x1 is input stochastic variables. W0-W3 is table value.
Probability that W0 is selected : (1 - x1) * (1 - x0)
Probability that W1 is selected : (1 - x1) * x0
Probability that W2 is selected : x1 * (1 - x0)
Probability that W3 is selected : x1 * x0
y = W0 * (1 - x1) * (1 - x0)
+ W1 * (1 - x1) * x0
+ W2 * x1 * (1 - x0)
+ W3 * x1 * x0
Because this calculation tree is differentiable, it can be calculate back-propagation.
The formula for the 6-input LUT is larger, but can be calculated in the same method.
By using the Stochastic-LUT model, it is possible to perform learning much faster and with
higher accuracy than the micro-MLP model.
No need for Batch-Normalization
No need for Activation
16. Learning Prediction
Matrix Weight Activation Convolution Deep
Network
performance of
CPUs/GPUs
performance of
FPGA
Binary
Connect
Dense Binary Real
(FP32)
OK
Binarized
Neural
Network
Dense
Binary
Binary OK
XNOR-
Network
Dense Binary Binary OK
LUT-
Network
Sparse
Real
(FP32) none OK
good
excellent
1 node → many adders
1 node → many XNOR
1 node → many XNOR
1 node → 1 LUT
excellent
benchmark for other Binary Network
16
good
17. Demonstration 1
[MNIST MLP 1000fps])
DNN
(LUT-Net)
MIPI-CSI
RX
Raspberry Pi
Camera V2
(Sony IMX219)
SERDE
S
TX
PS
(Linux)
SERDES
RX
FIN1216
DMA
OLED
UG-9664HDDAG01
DDR3
SDRA
M
I2C
MIPI-CSI
Original Board
PCX-Window
1000fps
640x132
1000fps
control
PL (Jelly)
BinaryBrain
Ether
RTL
offline learning (PC)
ZYBO Z7-20
debug view
17
YouTube movie: https://www.youtube.com/watch?v=NJa77PZlQMI
26. oversampling and binary modulation
• oversampling and quantum modulation
• PWM(Pulse Width Modulation)
• delta-sigma modulation
• random dither, etc.
• For example, high-speed camera images originally contain noise.
• LPF (Low pass filter) removes noise and dequantizes it
• Regression analysis becomes possible
• e.g.) LPF will be constructed of IIR/FIR/Kalman filter
modulation
Quantization
DNN
random noise
or local oscillator
LPF
26
Human sense includes LPF
27. architecture proposal for Real-time
27
DNN
Video-In
ME MC
Video-Out
Frame Memory
Similar to IIR-filter
28. Next approach
• Improving Sparse Connected Connection Rules
• Currently connection rules is random select. But, any
data have locality, as CNN.
• There is a method to determine connection destination
by node distance probabilistically with Gaussian function
etc.
• I want to make stacked connection in pyramid structure.
28
29. reference
• BinaryConnect: Training Deep Neural Networks with binary weights during propagations
https://arxiv.org/pdf/1511.00363.pdf
• Binarized Neural Networks
https://arxiv.org/abs/1602.02505
• Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations
Constrained to +1 or -1
https://arxiv.org/abs/1602.02830
• XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
https://arxiv.org/abs/1603.05279
• Xilinx UltraScale Architecture Configurable Logic Block User Guide
https://japan.xilinx.com/support/documentation/user_guides/ug574-ultrascale-clb.pdf
29
30. My Profile
• Open source programmer (hobbyist)
• Born in 1976, I’m living in Fukuoka-city, Japan
• 1998~ publish HOS (Real-Time OS [uITRON])
• https://ja.osdn.net/projects/hos/
(ARM,H8,SH,MIPS,x86,Z80,AM,V850,MicroBlaze, etc.)
• 2008~ publish Jell (Soft-core CPU for FPGA)
• https://github.com/ryuz/jelly
• http://ryuz.my.coocan.jp/jelly/toppage.html
• 2018~ publish LUT-Network
• https://github.com/ryuz/BinaryBrain
• Real-Time AR-glasses project(current my hobby)
• Real-Time glasses (camera [IMX219] & OLED 1000fps)
https://www.youtube.com/watch?v=wGRhw9bbiik
• Real-Time GPU (no frame buffer architecture)
https://www.youtube.com/watch?v=vl-lhSOOlSk
• Real-Time DNN (LUT-Network)
https://www.youtube.com/watch?v=aYuYrYxztBU
30