"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Copyright © 2015 Auviz Systems 1
Nagesh Gupta
12 May 2015
Trade-offs in Implementing Deep Neural
Networks on FPGAs

• Startup, specializes in implementing & optimizing algorithms on FPGAs
• Offers libraries of different classes of algorithms
• AuvizCV—optimized OpenCV algorithms
• AuvizLA —optimized BLAS
• AuvizDNN—optimized deep neural networks
• And develops custom algorithms in Computer Vision, Linear Algebra,
Deep Learning & Machine Learning
• Available as OpenCL function calls for software users to abstract the
complexity of using an FPGA
• Visit our booth & see AlexNet running on Xilinx FPGA!
Auviz Systems

The Time for Artificial Intelligence &
Machine Learning
• Sources: Cisco/Statista, Facebook research, IT Business Edge

Machine Learning Moving to the Data Center
Performance/watt
Programming model &
use model
Microsoft Azure ML—
provides Machine Learning as a service on the cloud
IBM Watson at Jeopardy—one of the
best demonstration of Machine Learning
Amazon AWS ML & Google Predictive Analytics —other
Machine Learning services on the cloud

• A form of Deep Neural Networks—used for various “recognition” tasks
• AlexNet [2] is a CNN configuration as shown below was used to classify
1.2 million images
Convolutional Neural Networks (CNNs)

• A convolution layer has multiple stages
• 3D Convolutions:
• Activation: Using the ReLU function, Max(x, 0)
• Max pooling: Sub-sampling function that selects the max value
within a neighborhood
Components of AlexNet—Convolution layers
3D Convolutions Activation (ReLU)
Sub-sampling
(Max pooling)

• Dense layers are fully connected—each
output node is a function of all the input
nodes
• The first 2 dense layers can be represented
as a matrix-vector multiplication operation
• Layer 6 has 9216 inputs which are
multiplied with a weight matrix to
create 4096 outputs
• Layer 7 has 4096 inputs which are
multiplied with a different weight
matrix to create 4096 outputs
• The output layer uses SoftMax to classify
the input image into one of 1000 classes
Dense Layers in AlexNet
Layer 6 Layer 7
Output
layer

• Sequential implementation
• Implementation follows the
convolution equations
• Resource utilization will be very low,
but the latency at 200 MHz will be
22s for the 2nd layer
• High level synthesis (HLS) can be used to
implement as shown in [3]
• Get better performance by parallelizing
the implementation
Implementing 3D Convolutions
Weight
Matrices
Input feature
maps
Output feature
maps

1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09 Computations Data transfers
Computations vs. Data Transfers in AlexNet
• Computation latency, 2nd
convolution layer
• With 512 single precision
floating point operations
the 2nd convolution layer
takes 2.2 ms to
complete at 200 MHz
• Data transfer latency, 2nd
convolution layer
• With 64 bit DDR, 1.3
Gb/s, single precision
floating point data fetch
latency is around 0.5 ms
3D convolutions require more number of computations, while the data
transfers are higher for the dense layers

3D Convolution—Parallel Implementation
X =
• A 11x11 weight matrix with 3 input feature maps requires 121*3
multiplications and 121*3 adders
• With 363 multiply units and 363 adders, this can be done in 1 cycle
• The FPGA resources required for a each single precision floating point
operation are 2-5 DSP blocks and 200-400 LUTs
• Implementing this in parallel will require ~1200 DSPs and ~75000 LUTs
1 Output value
11x11 Weight Matrix 11x11 Input Feature
Map

Increasing Throughput With Pipelining
• Pipelining is a hardware concept to achieve higher throughput
• Helpful with complex multi-cycle operations—works by registering
intermediate results
• Pipeline 3D convolutions on one dimension & parallelize the other
• For example, convolve the weight matrix with an input feature
map in parallel, and pipeline for different feature maps
• Zhang, et al [3] convolve a set of input
feature maps with a set of weight matrices in
parallel and pipeline for the size of the input
feature map
C
R
C’
R’
M number of NxKxK weight filters
N M
Tn
Tr
Tc
N
Tn
Tm
Input feature maps, NxRxC
K
K
N
Tn
Output feature maps, MxR’xC’

• A simple way is to flatten feature maps and to create an array of
feature maps—below is an illustration for the first layer of AlexNet
• The weight matrices are flattened and the input feature maps are
rearranged for each column to have the neighborhood required for
convolutions
Mapping 3D Convolutions into Matrix
Multiplications
.
.96
55 x 55 = 3025
.
.96
3 x 11 x 11 = 363
.
.
3x11x11=363
55 x 55 = 3025
Y, matrix of output
feature maps
W, matrix of weight
coefficients
X, matrix of input
feature maps

• Larger number of compute units exhausts
the FPGA resources
• Each compute unit takes a few hundred
LUTs and 3-5 DSPs
• Data organization to ensure the compute
units are performing to the max
• Need to read a lot of data in parallel
• Data has to be stored on-chip to enable
parallel access
• Routing turns out to be a bigger challenge
• Proper data organization, architecture
& tools are the way to overcome
Implementation Challenges
0
10000
20000
30000
40000
50000
60000
70000
80000
256 512 768
Bitsrequiredpercycle
Parallelism
Bits per operation

• Single precision floating point
• Uses 32 bits to represent each data
• Requires more DSPs (3-5) to implement multiply/accumulate
• Fixed point
• 16-bit fixed point representation would suffice for many
applications [4]
• Stochastic rounding techniques perform similar to single precision
floating point representation [5]
• Half precision
• Uses 16 bits to represent data
• Significant reduction in routing & overall FPGA resources
• Mixed representation
• Use fixed point or half precision representation for some and single
precision representation for other layers
Using Alternate Data Representations

• OpenCL tools enable software programmers to use the FPGA accelerator
without learning hardware methodologies
• Programmer calls OpenCL functions to accelerate on the FPGA
A complete CNN on the FPGA using OpenCL
Configure &
setup
3D
Convolutions
Dense layers Softmax

Performance of AlexNet on FPGAs
FPGAs can achieve an impressive 14 images/sec/Watt compared to high
end GPUs such as Tesla K40, which can get to 4 images/sec/Watt

• 3D convolutions are a key part of a CNN, and are compute intensive
• In FPGAs, 3D convolutions can be implemented efficiently with a
parallel & pipelined implementation
• FPGA resources—gates & routing will be the critical factors in
achieving a highly parallel implementation
• OpenCL implementation tools, such as Xilinx SDAccel simplify the
implementation task and provide a software flow
• Alternate data representations can be used to simplify the complexity
• Mixed data representations can simplify the computations without
compromising on the performance
• FPGAs are capable of delivering a high performance at a suitable power
profile for the data center
Summary

• [1] Kevin Ovtcharov, et al, Accelerating Deep Convolutional Neural
Networks Using Specialized Hardware, Microsoft Research, 2015
• [2] A. Krizhevsky, I. Sutskever and G. E. Hinton, ImageNet Classification
with Deep Convolutional Neural Networks, Advances in Neural
Information Processing Systems, 2012
• [3] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao and J. Cong, Optimizing
FPGA-based Accelerator Design for Deep Convolutional Neural
Networks, FPGA'2015, 2015
• [4] Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B.,
Akselrod, P., & Talay, S., “Large-scale FPGA-based convolutional
networks” in Machine Learning on Very Large Data Sets (2011).
• [5] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish
Narayanan. "Deep Learning with Limited Numerical Precision." arXiv
preprint arXiv:1502.02551 (2015).
References

Nagesh Gupta
12 May 2015
Deep Neural Networks in FPGAs

Convolutionlayers
Input size Input
feature
maps
Output
feature
maps
Filter
size
Computations Total data
transfer
224 x 224 3 96 11x11 110 * 10^6 255 * 10^3
27 x 27 96 256 5x5 448 * 10^6 728 * 10^3
13 x 13 256 384 3x3 150 * 10^6 993 * 10^3
13 x 13 384 384 3x3 224 * 10^6 1457 * 10^3
13 x 13 384 256 3x3 150 * 10^6 959 * 10^3
Computations vs. Data TransfersDenselayers
Input data Weight matrix Computations Data transfers
9216 9216 x 4096 38 * 10^6 38 * 10^6
4096 4096 x 4096 16 * 10^6 16 * 10^6
4096 4096 x 1000 4 * 10^6 4 * 10^6

"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (19)

Similaire à "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems

Similaire à "Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems (20)

Plus de Edge AI and Vision Alliance

Plus de Edge AI and Vision Alliance (20)

Dernier

Dernier (20)

"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation from Auviz Systems