Accelerating Real-time processing of the ATST Adaptive Optics System using Coarse-grained Parallel Hardware Architectures

ERSA, Las Vegas, Nevada, July 2011

Accelerating Real-time processing of the
ATST Adaptive Optics System using
Coarse-grained Parallel Hardware
Architectures
Vivek Venugopal (vivek@vivekvenugopal.net)
National Solar Observatory,
Sunspot, New Mexico

Advanced Technology
Solar Telescope

2

Adaptive Optics system

Uncorrected Tip/Tilt
light Mirror

Deformable
Mirror (DM) Tilt drive signal

DM drive signal

Corrected
Processors Beamsplitter light

Shack-Hartmann
Lenslet Array
CCD
Camera

3

HOAO Real-time system
Actuator
gains

Offscale Recon-
Dark Reference slope Slope struction Actuator
Flat field image
field tolerance offsets matrix offsets

Deformable
mirror
Cross-
Offscale
WFS correlation Matrix Actuator
Camera X slope
slope
detection
X multiply servos Servo
computation parameters

Average Tip/Tilt
slope servos Tip/Tilt
mirror

Data Zernike
collection offload
process

• 1750 sub-apertures and 1900 actuators 5

Camera data format

camera data half camera data half
960 x 480 pixels 960 x 480 pixels

• Camera data consists of two halves of 960x480 pixels
• Each half of camera data sent to FPGA using 12 channels
6

Scenario 1:FPGA-DSP
96 DSPs

Camera
FPGA 1
data half
12 optical 12
ﬁber channels
channels

Camera
FPGA 2
data half
12 optical 12
ﬁber channels
channels

• Pixel unpacking task - FPGA
• Processing - DSPs
7

Scenario 2:FPGA-DSP
48 DSPs

Camera
FPGA 1
data half
12 optical 12
fiber channels
channels

Camera
FPGA 2
data half 12
12 optical
fiber channels
channels

• Pixel unpacking, dark and flat correction- FPGA
• Cross-correlation and reconstruction matrix processing - DSPs
8

Dark and flat correction
pixel0 10
• Dark pixel and flat pixel stored in
- 10
RAM
dark_pixel 8

8
x 18 flat_product0
• Flat corrected product is
flat_pixel 8
accumulator
8
concatenated and written to
flat_acc1
pixel 1 10
FIFO
- 10
• Flat accumulated value can be
used to update the reference
dark_pixel 8

flat_pixel 8
x 8
18 flat_product1

image
8
accumulator
flat_acc1

pixel16 10

- 10

dark_pixel 8

flat_pixel 8
x 8
18 flat_product16

8
accumulator
flat_acc16
9

Pixel unpacking & Dark
and flat correction
Synchronizer/
counters

dark and flat reference image
value RAM RAM
206.8 ns
20 ns
256
channel 1
128
Data 160 Dark-flat correction/
Receiver FIFO
unpack accumulator
16 160
288

channel 2

PCIe system bus
128
12 channels

Receiver FIFO
1/2 camera

unpack accumulator
16 160
288

channel 12
128
Receiver FIFO
unpack accumulator
16 160
288

clock period = 9.42 ns clock period = 5 ns
clock rate = 106.15 MHz clock rate = 200 MHz

10

Scenario 3:FPGA-GPU
or FPGA-CPU

Camera
FPGA 1
data half
12 optical
fiber

PCI-e bus
channels
GPU/CPU

Camera
FPGA 2
data half
12 optical
fiber
channels

• Pixel unpacking, dark and flat correction - FPGA
• Cross-correlation and reconstruction matrix processing - GPU or CPU
11

Nvidia Tesla C2050
GPU
Multiprocessor 14
• Nvidia Tesla C2050: 14
streaming multi-processors
Multiprocessor 2 with 32 cores each (SIMD)
Multiprocessor 1

Instruction Cache
clocked at 1.15 GHz
Warp Scheduler Warp Scheduler • 3 GB on-board RAM
Dispatch Unit Dispatch Unit
• Kernel-based execution
Register File
• 1.288 TFLOPS single
Core 1 Core 2 Core 1 Core 2
Load/
Store 1
SFU 1 precision
Load/ SFU 2
Core 3 Core 4 Core 3 Core 4
Store 2 • 515.2 GFLOPS double
SFU 3

Load/
precision
Core 15 Core 16 Core 15 Core 16 SFU 4
Store 16

Interconnection Network

64 KB Shared Memory/ L1 cache

Uniform Cache

Reference: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf 12

Process mapping and
partitioning

Raw Flat Reference
pixels pixels pixels
20x20 20x20 20x20
FPGA GPU

Dark ﬁnd x and y
dark ﬂat 2D cross-correlation
pixels maximum interpolation
correction correction
20x20

13

Correlation routines
1. FFT correlation 2. 7x7 correlation

ﬂat
reference
corrected
image
image
precomputed
original reference Region 1 reference
FFT FFT image 26x26 pixels (20x20 pixels)

precomputed
Region 2 reference
Complex conjugate (20x20 pixels)
Multiplication

IFFT

precomputed
Region 49 reference
(20x20 pixels)

Precomputed Reference pixels 20x20 (49 regions)
14

ﬁnd_max and
interpolation routines
• Find the maximum value and itʼs index
• Find x and y shifts using the interpolation equations

num x = max value − out(shif ted y index, (shif ted x index − 1)
den x = 2 ∗ max value − out(shif ted y index, (shif ted x index − 1))
−out(shif ted y index, (shif ted x index + 1))
num x
x = (shif ted x index − 0.5) +
den x
num y = max value − out((shif ted y index − 1), shif ted x index)
den y = 2 ∗ max value − out((shif ted y index − 1), shif ted x index)
−out((shif ted y index + 1), shif ted x index))
num y
y = (shif ted y index − 0.5) +
den y

15

GPU results
Tesla C1060
FFT correlation Tesla C2050 7x7 correlation
2200 400
1889
313 307 301
1619 278 279 281
1650 1510 300
Time in us

Time in us
1188
1100 200

550 100

0 0
1 50 1 50 584
No. of images No. of images
Note: Least time indicates better performance 16

Reconstruction routine

1900
Tesla C1060
x y
Tesla C2050

1750 1750
x DSP
CPU
x and y shifts for 1750
sub-aperture images
3500
100000 46769
reconstruction matrix 1900x3500

10000
964 956
Time in us
1900
1000
229
accumulated values for 1900
actuators 100

10
• 1750 sub-aperture x and y shifts
• 3500 x 1900 reconstruction matrix 1

Devices 17

Scenario 4:FPGA-GPU
or FPGA-CPU

Camera
FPGA 1
data half
12 optical
fiber

PCI-e bus
channels
GPU/CPU

Camera
FPGA 2
data half
12 optical
fiber
channels

• Pixel unpacking, dark and flat correction, cross-correlation - FPGA
• Reconstruction matrix processing - GPU or CPU
18

Cross-correlation
18 • Configure 400x392 (49x8 bits/
flat_product0 pixel) RAM bank (RAM0-RAM19)
18

8
x 26 xcorr_product0
with pre-computed reference
flatcorr_value pixels
ref_pixel0
392
• Multiply each pixel with
18
ref_pixel
corresponding reference pixel
flat_product0

8
x 26 xcorr_product1
1274

xcorr_value_per pixel
ref_pixel1

18

flat_product0

8
x 26 xcorr_product48

ref_pixel48

19

Sub-aperture format
Channel # Channel # 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

1
0
1 1
0
1
3
8
3
8
3
8
3
8
2
7
2
7
2
7
2
7
1
6
1
6
1
6
1
6
0
5
0
5
0
4
0
4
• Sub-aperture regions in 480 columns x
2 2
3
2 2
3
13
18
13
18
13
18
13
18
12
17
12
17
12
16
12
16
11
15
11
15
10
15
10
15
9
14
9
14
9
14
9
14
1 row per channel
4 4 23 23 22 22 21 21 21 21 20 20 20 20 19 19 19 19

0 0 4 4 4 4 3 3 2 2 1 1 1 1 0 0 0 0
• Accumulate pixels per sub-aperture in
3
4
1
2
3
4
1
2
9
13
9
13
8
13
8
13
7
12
7
12
7
12
7
12
6
11
6
11
6
11
6
11
5
10
5
10
5
10
5
10
each channel
3 3 18 18 18 18 17 17 17 17 16 16 16 16 15 15 14 14 1274 1715
4 4 23 23 23 23 22 22 22 22 21 21 20 20 19 19 19 19 xcorr_pixel0 subap0_acc
1274 1715
subap_accumulator
5 1 5 1 9 9 9 9 8 8 8 8 7 7 6 6 5 5 5 5 channel #1,#2,#7,#8
6 2 6 2 14 14 14 14 13 13 12 12 11 11 11 11 10 10 10 10
3 3
1274 1715
19 19 18 18 17 17 17 17 16 16 16 16 15 15 15 15 xcorr_pixel15 subap23_acc
4 4 23 23 23 23 22 22 22 22 21 21 21 21 20 20 20 20

0 0 3 3 3 3 2 2 2 2 1 1 1 1 0 0 0 0
1274 1715
7 1 7 1 8 8 8 8 7 7 7 7 6 6 6 6 5 5 4 4 xcorr_pixel0 subap0_acc
8 2 8 2 13 13 13 13 12 12 12 12 11 11 10 10 9 9 9 9 1274 1715
subap_accumulator
4 4 23 23 22 22 21 21 21 21 20 20 20 20 19 19 19 19 channel #3,#4,#9,#10

0 0 4 4 4 4 3 3 2 2 1 1 1 1 0 0 0 0 1274 1715
xcorr_pixel15 subap23_acc
9 1 9 1 9 9 8 8 7 7 7 7 6 6 6 6 5 5 5 5
10 2 10 2 13 13 13 13 12 12 12 12 11 11 11 11 10 10 10 10
3 3 18 18 18 18 17 17 17 17 16 16 16 16 15 15 14 14
4 4 23 23 23 23 22 22 22 22 21 21 20 20 19 19 19 19
1274 1715
0 0 4 4 4 4 3 3 3 3 2 2 2 2 1 1 0 0 1274 1715
11 1 11 1 9 9 9 9 8 8 8 8 7 7 6 6 5 5 5 5 xcorr_pixel1 subap1_acc
subap_accumulator
12 2 12 2 14 14 14 14 13 13 12 12 11 11 11 11 10 10 10 10 channel #5,#6,#11,#12
3 3 19 19 18 18 17 17 17 17 16 16 16 16 15 15 15 15
4 4 23 23 23 23 22 22 22 22 21 21 21 21 20 20 20 20 1274 1715
20

Timing

Rxdata from transceiver

unpacked data 123.73 ns
written to FIFO
40 ns
unpacked data read 95 ns
from FIFO
15 ns
dark-ﬂat output
40 ns
input to xcorr_pixel
module
20 ns
output from xcorr_pixel
16 ns
output from sub-aperture
accumulator per channel

91 ns

• Each data packet is available from the FIFO after 95 ns
• 95 ns * 5 packets * 10 rows = 4.75 us to read the data from the FIFO
21

Timing with ﬁnd_max &
interpolation
sub-aperture 2*max_value
accumulated max_value 36
value
35 subap value 1715
ﬁnd_max max_index max_index - 1 x_shift1 x
1715
6 max_index + 1 x_shift2 35 32
index index x and y
calculation max_index - 7 decoder y_shift1 35 shift calc y
max_index + 7 y_shift2 35 32
35
shifted_x - 0.5
shifted index 32
calculation shifted_y - 0.5
32

5.84 us
output from sub-
aperture
accumulation
across 6 channels
0.060 us 0.060 us

xy shifts

22

GPU vs FPGA vs DSP
100 us 225 us 300.93 us

Camera
readout

Data transfer through
PCIe x16

C2050 GPU 1

C2050 GPU 2

C2050 GPU 3

C2050 GPU throughput = 525.93 us

FPGA

FPGA throughput = 280 us

DSP

96 DSPs throughput = 495 us

Camera
readout

Data transfer through
PCIe x16

C2050 GPU 1

23

Conclusions

GPU FPGA

• DSP: excellent performance but not cost-effective
• GPU: fast SIMD architectures - suitable for certain tasks
• FPGA: MIMD architectures, custom I/O, meets latency and throughput
constraints
Slide idea: David Pellerin, Impulse Accelerated Technology 24

Future work

Virtex-6 Virtex-7
Resources
XC6VLX550T XC7V2000T
Slice logic resources 549,888 1,954,560
I/O pins 840 850
GTX transceivers 36 36

• Investigate performance improvement after partitioning 3 channels
between FPGAs. Total of 5 FPGAs for processing each half of
camera data
• Throughput sustained even if the processes are partitioned over
multiple FPGAs
• Promising because of increased logic density in Virtex-7 FPGAs
25

Discussion

Questions

Email: vivek@vivekvenugopal.net 26

Top level design

channel_cycle_count
288 288
160
subap_row_count refim_fetch_addr_d RAM bank (RAM0- FCFPGA dark_flat_acc_top Flatcorr
xcorr_pixel_channel ch1278_subap_accumulator
ecoder RAM19) _FIFO

addr_decoder_ce subap_acc_out
(1715 bits) x24
address decoder data unpack xcorr_pixel
refim_in (1274 bits) x16
xcorr_sm xcorr_pixel_ce (392 bits)
x16
subap_acc_ce
channel1_top

subap_acc_12ch_ce

xcorr state
flat_fifo_rd
machine
subap_acc_out
24subap_12ch_ (1715 bits) x24
accumulator

288 288
160
FCFPGA dark_flat_acc_top Flatcorr
xcorr_pixel_channel ch561112_subap_accumulator
_FIFO

subap_acc_out
xcorr_pixel (1715 bits) x24
data unpack
refim_in (1274 bits) x16
(392 bits)
x16

channel12_top

27

Synthesis estimates for
Virtex-6 FPGA
• Implement dark, ﬂat correction only : resources used 288 out of
687,360 (1%)
• Implement the correlation for single channel up to the sub-aperture
accumulator within the channel (without the ﬁnal 12 channel
accumulation) : resources used 2,578 out of 687,360 (1%)

Device utilization summary:
Slice Logic Utilization:
Number of Slice Registers: 992448 out of 687360 144% (*)
Number of Slice LUTs: 1126081 out of 343680 327% (*)
Number used as Logic: 1125853 out of 343680 327% (*)
Number used as Memory: 228 out of 99200
Number used as SRL: 37
28

Virtex-6 FPGA Board
/-5

#78$")*'+,-.+
01Y57
_/]^d^
01Y57]

90':0&
01Y570 /01 #2. % !"#$%&&'($")*'+,-.+ %&)-L !"#$%&&'($")*'+,-.+ %&)-L !"#$%&&'($")*'+,-.+
23 %$V%$$V%$$$
1@S /.++0/1.&$23445*-+6 0OJA@ /.++0/1.&$23445*-+6 0OJA@ /.++0/1.&$23445*-+6
/014
Ded&*7
9&*7-`c? ) #2.
!" !" !" !" !" !"
01Y575 /565788
01Y57F 9:;<=>;? # #
_/]^dD #2. #2.
01Y57.
!"#$ !"#$ !"#$

;;<=$>?;@!!
# #2. # #2.

;;<=$>?;@!!

23#A$!')6
! D C

23#A$!')6
*) "$
9@2AB? %+$
%%#X&)*7-`c %+$ %&'()*+, %&'()*+, %&'()*+,
>?@A&B<!"#$

&# -./0123-.4,52 *)
-./0123-.4,52 "$
-./0123-.4,52
-`c 03;RM;H>S 6.4752 6.4752 6.4752
/SHB@;A=c;3 Y$ &$ #$
-.551236.0852 -.551236.0852 -.551236.0852
9/=*+&"? 9&e`c 9!!785:; 9!!785:; 9!!785:;
B27'$$7-`c?
&$ #$
%%#X&)*7-`c #$

#2.
) #

#2.

#2.
&#
-`c 03;RM;H>S !"#$<= #$
#$
/SHB@;A=c;3 Y% %&'()*+5 #$
%& %"
9/=*+&"? 9&e`c
>?@A&B<!"#$ #$ #$ #$ '#( '# !* !* &$ &$ '+ '+

#2.

#2.
B2'$$7-`c? #$ %&
,-.
%%#X&)*7-`c ">F) # #2. # #2.
&#

;;<=$>?;@!!
-`c 03;RM;H>S !"#$ !"#$ !"#$

;;<=$>?;@!!

23#A$!')6
/SHB@;A=c;3 Y& # #2. # #2.
> E $

23#A$!')6
9/=*+&"? 9&e`c %&$ %$$
%+$ %+$
B27'$$7-`c? %&'()*+, %&'()*+, %&'()*+,
-./0123-.4,52 -./0123-.4,52 -./0123-.4,52
%&$ %$$
%&*7-`c 6.4752 6.4752 6.4752
%*$7-`c -.551236.0852 &$ -.551236.0852 #$ -.551236.0852
W/
&*$7-`c 9!!785:; 9!!785:; 9!!785:;
+%&X*7-`c -Y67O2>U
&$ #$
Y$ #$
) )

#2.
-117.MA

#2.
">F) W/ #$
9#)@7;
0<+<GH@)I

#78$")*'+,-.+ %&)-L #78$")*'+,-.+ %&)-L
90':0& 0D5/` 90':0& 0OJA@
%$V%$$V%$$$ %$V%$$V%$$$ ^Y-88
LJA;6 1@S
^b#*
JH'K)GG<J%8L/11 "#
BCD!$)$E3
;;<C C7DEF/7G@;H7IJ=3;:K7LMB7>JH7L;73MH
^/&+& 01_ 01_
777A=HNO;P;H:;:7JB73;:M>;:7Q3;RM;H>S
F-59#[? + %&)-L
_/.7&X$ 1_ 1_
_/. /18
9+[? .22B70D5/`
/565 18;
#2. PPT71J>U;B78VW763JHA>;=<;3A
#POJH;A ) &*"-L
77777"X*7YLVA7I;37>@JHH;O
/565788 ^6 18;79Y],%? ,5,F70D5/` 77777L=:=3;>B=2HJO79%+7YLVA7ZJ[?
9@2AB? .22B
&[
187]a1^]//

DiniFPGA board CM%,!,">F)
!"#$%&'()*+), -./.0
29

Accelerating Real-time processing of the ATST Adaptive Optics System using Coarse-grained Parallel Hardware Architectures

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (18)

Dernier

Dernier (20)

Accelerating Real-time processing of the ATST Adaptive Optics System using Coarse-grained Parallel Hardware Architectures