Presented on July 20, 2011 in Gold Room (Regular session) at 11:40 - 12:00pm
http://ersaconf.org/ersa11/program/wed.php
The International Conference on
Engineering of Reconfigurable Systems and Algorithms (ERSA'11) at Las Vegas Nevada USA
Accelerating Real-time processing of the ATST Adaptive Optics System using Coarse-grained Parallel Hardware Architectures
1. ERSA, Las Vegas, Nevada, July 2011
Accelerating Real-time processing of the
ATST Adaptive Optics System using
Coarse-grained Parallel Hardware
Architectures
Vivek Venugopal (vivek@vivekvenugopal.net)
National Solar Observatory,
Sunspot, New Mexico
5. HOAO Real-time system
Actuator
gains
Offscale Recon-
Dark Reference slope Slope struction Actuator
Flat field image
field tolerance offsets matrix offsets
Deformable
mirror
Cross-
Offscale
WFS correlation Matrix Actuator
Camera X slope
slope
detection
X multiply servos Servo
computation parameters
Average Tip/Tilt
slope servos Tip/Tilt
mirror
Data Zernike
collection offload
process
• 1750 sub-apertures and 1900 actuators 5
6. Camera data format
camera data half camera data half
960 x 480 pixels 960 x 480 pixels
• Camera data consists of two halves of 960x480 pixels
• Each half of camera data sent to FPGA using 12 channels
6
7. Scenario 1:FPGA-DSP
96 DSPs
Camera
FPGA 1
data half
12 optical 12
fiber channels
channels
Camera
FPGA 2
data half
12 optical 12
fiber channels
channels
• Pixel unpacking task - FPGA
• Processing - DSPs
7
8. Scenario 2:FPGA-DSP
48 DSPs
Camera
FPGA 1
data half
12 optical 12
fiber channels
channels
Camera
FPGA 2
data half 12
12 optical
fiber channels
channels
• Pixel unpacking, dark and flat correction- FPGA
• Cross-correlation and reconstruction matrix processing - DSPs
8
9. Dark and flat correction
pixel0 10
• Dark pixel and flat pixel stored in
- 10
RAM
dark_pixel 8
8
x 18 flat_product0
• Flat corrected product is
flat_pixel 8
accumulator
8
concatenated and written to
flat_acc1
pixel 1 10
FIFO
- 10
• Flat accumulated value can be
used to update the reference
dark_pixel 8
flat_pixel 8
x 8
18 flat_product1
image
8
accumulator
flat_acc1
pixel16 10
- 10
dark_pixel 8
flat_pixel 8
x 8
18 flat_product16
8
accumulator
flat_acc16
9
10. Pixel unpacking & Dark
and flat correction
Synchronizer/
counters
dark and flat reference image
value RAM RAM
206.8 ns
20 ns
256
channel 1
128
Data 160 Dark-flat correction/
Receiver FIFO
unpack accumulator
16 160
288
channel 2
PCIe system bus
128
Data 160 Dark-flat correction/
12 channels
Receiver FIFO
1/2 camera
unpack accumulator
16 160
288
channel 12
128
Data 160 Dark-flat correction/
Receiver FIFO
unpack accumulator
16 160
288
clock period = 9.42 ns clock period = 5 ns
clock rate = 106.15 MHz clock rate = 200 MHz
10
11. Scenario 3:FPGA-GPU
or FPGA-CPU
Camera
FPGA 1
data half
12 optical
fiber
PCI-e bus
channels
GPU/CPU
Camera
FPGA 2
data half
12 optical
fiber
channels
• Pixel unpacking, dark and flat correction - FPGA
• Cross-correlation and reconstruction matrix processing - GPU or CPU
11
13. Process mapping and
partitioning
Raw Flat Reference
pixels pixels pixels
20x20 20x20 20x20
FPGA GPU
Dark find x and y
dark flat 2D cross-correlation
pixels maximum interpolation
correction correction
20x20
13
15. find_max and
interpolation routines
• Find the maximum value and itʼs index
• Find x and y shifts using the interpolation equations
num x = max value − out(shif ted y index, (shif ted x index − 1)
den x = 2 ∗ max value − out(shif ted y index, (shif ted x index − 1))
−out(shif ted y index, (shif ted x index + 1))
num x
x = (shif ted x index − 0.5) +
den x
num y = max value − out((shif ted y index − 1), shif ted x index)
den y = 2 ∗ max value − out((shif ted y index − 1), shif ted x index)
−out((shif ted y index + 1), shif ted x index))
num y
y = (shif ted y index − 0.5) +
den y
15
16. GPU results
Tesla C1060
FFT correlation Tesla C2050 7x7 correlation
2200 400
1889
313 307 301
1619 278 279 281
1650 1510 300
Time in us
Time in us
1188
1100 200
550 100
0 0
1 50 1 50 584
No. of images No. of images
Note: Least time indicates better performance 16
17. Reconstruction routine
1900
Tesla C1060
x y
Tesla C2050
1750 1750
x DSP
CPU
x and y shifts for 1750
sub-aperture images
3500
100000 46769
reconstruction matrix 1900x3500
10000
964 956
Time in us
1900
1000
229
accumulated values for 1900
actuators 100
10
• 1750 sub-aperture x and y shifts
• 3500 x 1900 reconstruction matrix 1
Devices 17
18. Scenario 4:FPGA-GPU
or FPGA-CPU
Camera
FPGA 1
data half
12 optical
fiber
PCI-e bus
channels
GPU/CPU
Camera
FPGA 2
data half
12 optical
fiber
channels
• Pixel unpacking, dark and flat correction, cross-correlation - FPGA
• Reconstruction matrix processing - GPU or CPU
18
19. Cross-correlation
18 • Configure 400x392 (49x8 bits/
flat_product0 pixel) RAM bank (RAM0-RAM19)
18
8
x 26 xcorr_product0
with pre-computed reference
flatcorr_value pixels
ref_pixel0
392
• Multiply each pixel with
18
ref_pixel
corresponding reference pixel
flat_product0
8
x 26 xcorr_product1
1274
xcorr_value_per pixel
ref_pixel1
18
flat_product0
8
x 26 xcorr_product48
ref_pixel48
19
21. Timing
Rxdata from transceiver
unpacked data 123.73 ns
written to FIFO
40 ns
unpacked data read 95 ns
from FIFO
15 ns
dark-flat output
40 ns
input to xcorr_pixel
module
20 ns
output from xcorr_pixel
16 ns
output from sub-aperture
accumulator per channel
91 ns
• Each data packet is available from the FIFO after 95 ns
• 95 ns * 5 packets * 10 rows = 4.75 us to read the data from the FIFO
21
22. Timing with find_max &
interpolation
sub-aperture 2*max_value
accumulated max_value 36
value
35 subap value 1715
find_max max_index max_index - 1 x_shift1 x
1715
6 max_index + 1 x_shift2 35 32
index index x and y
calculation max_index - 7 decoder y_shift1 35 shift calc y
max_index + 7 y_shift2 35 32
35
shifted_x - 0.5
shifted index 32
calculation shifted_y - 0.5
32
5.84 us
output from sub-
aperture
accumulation
across 6 channels
0.060 us 0.060 us
xy shifts
22
23. GPU vs FPGA vs DSP
100 us 225 us 300.93 us
Camera
readout
Data transfer through
PCIe x16
C2050 GPU 1
C2050 GPU 2
C2050 GPU 3
C2050 GPU throughput = 525.93 us
FPGA
FPGA throughput = 280 us
DSP
96 DSPs throughput = 495 us
Camera
readout
Data transfer through
PCIe x16
C2050 GPU 1
23
24. Conclusions
GPU FPGA
• DSP: excellent performance but not cost-effective
• GPU: fast SIMD architectures - suitable for certain tasks
• FPGA: MIMD architectures, custom I/O, meets latency and throughput
constraints
Slide idea: David Pellerin, Impulse Accelerated Technology 24
25. Future work
Virtex-6 Virtex-7
Resources
XC6VLX550T XC7V2000T
Slice logic resources 549,888 1,954,560
I/O pins 840 850
GTX transceivers 36 36
• Investigate performance improvement after partitioning 3 channels
between FPGAs. Total of 5 FPGAs for processing each half of
camera data
• Throughput sustained even if the processes are partitioned over
multiple FPGAs
• Promising because of increased logic density in Virtex-7 FPGAs
25
28. Synthesis estimates for
Virtex-6 FPGA
• Implement dark, flat correction only : resources used 288 out of
687,360 (1%)
• Implement the correlation for single channel up to the sub-aperture
accumulator within the channel (without the final 12 channel
accumulation) : resources used 2,578 out of 687,360 (1%)
Device utilization summary:
Slice Logic Utilization:
Number of Slice Registers: 992448 out of 687360 144% (*)
Number of Slice LUTs: 1126081 out of 343680 327% (*)
Number used as Logic: 1125853 out of 343680 327% (*)
Number used as Memory: 228 out of 99200
Number used as SRL: 37
28