This document provides a summary of a presentation about quantized neural network inference on FPGAs using FINN and LogicNets. It discusses:
- Xilinx Research Labs in Dublin and their work quantifying machine learning applications on Xilinx devices.
- How neural network quantization can improve efficiency by reducing precision while trading off accuracy, and how this is well-suited for FPGAs.
- The FINN toolflow which includes quantization-aware training in PyTorch with Brevitas, the FINN compiler to map networks to hardware, and deployment with PYNQ.
- LogicNets which further improves efficiency by unfolding DNNs into fully pipelined datapath circuits for
6. Benefits of Quantization on FPGAs
6
On-chip weights
~60 M
~30 M
~10 M
~5 M
~2 M
Precision
1b
4b
8b
16b
32b
Xilinx UltraScale+ MPSoC ZU19EG (Vivado HLS, conservative estimates)
30x
Approx. Peak GOPS
66 000
20 000
4 000
1 000
300
200x
Trillions of quantized
operations per
second
Weights can
stay entirely
on-chip
compute memory
Great for energy efficiency! But what about accuracy?
14. 14
QNN training in PyTorch
Brevitas
Frontends, Transformation,
Dataflow Backend
FINN Compiler
Deployment with
Quantization-Aware
Training in PyTorch
with Brevitas
16. The FINN Compiler
16
QNN training in PyTorch
Brevitas
Frontends, Transformation,
Dataflow Backend
FINN Compiler
Deployment with
17. An Overview of the FINN Compiler
17
› Python library of graph transformations
» Each consumes and produces an ONNX graph
› User calls sequence of transformations to
create their own flow
» Example end-to-end flows to get started
Code Generator
Import
FINN HLS Library
Synthesizable
description
Hardware Cost Model
Vivado
Synthesis, PAR
Software Library
Host Run-time FPGA Platform
ONNX
Streamlining
Hardware Mapping
Resource Allocation
https://github.com/Xilinx/finn
18. Deployment with PYNQ
18
QNN training in PyTorch
Brevitas
Frontends, Transformation,
Dataflow Backend
FINN Compiler
Deployment with
19. Deployment with for Python Productivity
19
› Use PYNQ-provided Python abstractions and drivers
› User provides Numpy array in, calls driver, gets Numpy array out
» Internally use PYNQ DMA driver to wr/rd NumPy arrays into I/O streams
# numpy array shapes for i/o
ishape_packed = (1, 49, 2)
oshape_packed = (1, 1, 40)
# set up the DMA
dma.sendchannel.transfer(ibuf_packed_device)
dma.recvchannel.transfer(obuf_packed)
# wait until all transfers complete
dma.sendchannel.wait()
dma.recvchannel.wait()
https://github.com/Xilinx/PYNQ