Driving Behavioral Change for Information Management through Data-Driven Gree...
FPGA Inference - DellEMC SURFsara
1. Superpowered to
Low Power
Deploying supercomputer-trained deep learning
models for inference on Intel® FPGAs
Lucas A. Wilson, PhD
HPC & AI Engineering, Dell EMC
@lucasawilson
2. 2
Dell EMC / Intel / SURFSara Collaboration
Lucas A. Wilson, Vineet Gundecha, and Alex Filby
Valeriu Codreanu and Damian Podareanu
Vikram Saletore and Shawn Slockers
3. 3
Two Phases of Machine Learning
Training
Learn a suitable function
approximation to correctly
respond to the overwhelming
majority of test cases
Inference / Deployment
Use learned function approximation
to respond to new cases
5. 5
Superpowered Neural Network Training
386845
8319
16532
1742 1362 825 675
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
DenseNet121, P=1,
BZ=8
DenseNet121, P=64,
BZ=64, GBZ=4096
VGG16, P=128,
GBZ=8192
ResNet50, P=512,
GBZ=4096
Resnet50, P=512,
GBZ=8192
ResNet50, P=800,
GBZ=8000
ResNet50, P=1024,
GBZ=8192
TimetoSolution(seconds)
4.5 DAYS
to reach a solution with
DenseNet121 using 2
Intel® Xeon® Scalable
Gold 6148 processors
11.25 MINUTES
to reach a solution
with ResNet50 using
512 Intel® Xeon®
Scalable Gold 6148
processors!
573x FASTER
total time to solution going
from 1 to 256 Dell EMC
PowerEdge C6420 nodes
6. 6
The Price of Performance
256x
Dell EMC PowerEdge C6420 Nodes
~450W*
Intel® Xeon® Gold 6148+6148F
*Measured median usage per node
675 sec.
Runtime to Train Model
21.6 KwH
Power to Train Radiologist Model
7. 7
21.6 KwH of Electricity is Equivalent to…
Emitting 20 lbs of CO2
by burning anthracite coal
https://www.eia.gov/tools/faqs/faq.php?id=73&t=11
Keeping 360 60W
Light Bulbs on for 1 hour
Running a Hair Dryer
for 18 hours
Running a
Student Cluster Challenge
system for 7 hrs!
*At maximum allowed power
11. 11
Preparing the Model for Inference*
*according to the instructions
Remove Horovod nodes from TensorFlow graph
tensorflow/optimize_for_inference.py
Apply checkpoint weights and generate binary file
tensorflow/freeze_graph.py
INTEL®
OPENVINO™
Provide binary file to Intel® distribution of
OpenVINO™ toolkit
12. 12
Preparing the Model for Inference*
*the way that actually worked
INTEL®
OPENVINO™
Save Checkpoint Files While Training
Load weights into equivalent Keras model
and generate TF checkpoint and protobuf
Make sure to set training mode to False (0)
Apply checkpoint weights and generate
binary file
tensorflow/freeze_graph.py
Provide binary file to Intel® distribution of
OpenVINO™ toolkit
14. 14
Flashing the Bitstream to Intel® PAC with Arria® 10
aocl program <target> <bitstream_dir>/2-0-
1_RC_FP11_ResNet50-101.aocx
4 targets available in test system: acl[0-3]
Precision Topology
Target device
15. 15
Executing the Model on FPGA/CPU using
Intel® distribution of OpenVINO™ toolkit
Image Preprocessing Convolution/Activation/BatchNorm
Classification
150,528B (224x224x3)
Image data
Pre-classification model output
114,744B (28,686 x 32b)
In addition to raw throughput, time-to-solution was improved dramatically. Because only the sequential DenseNet model converged to an acceptable accuracy, comparison is made to that runtime. ResNet50 runs were able to exceed generalized accuracy of the sequential DenseNet model, while being highly scalable.
There are many reasons why an organization would want to deploy AI models at a location other than the data center where the model was trained.
It may make more sense to deploy the model to a datacenter closer to where the model will be used
The model may be intended for deployment
During data streaming, PAC uses between 9-11 additional watts (41-43W total).