Ensuring Technical Readiness For Copilot in Microsoft 365
Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs
1. Performance Evaluation of SAR
Image Reconstruction on CPUs and
GPUs
Fisnik Kraja, Alin Murarasu, Georg Acher, Arndt Bode
Chair of Computer Architecture, Technische Universität München, Germany
2012 IEEE Aerospace Conference, 3-10 March 2012, Big Sky, Montana
2. The main points
• The motivation statement
• Description of the SAR 2DFMFI application
• Description of the benchmarked architecture
• Results of sequential optimizations and thread
parallelization on the CPU
• Porting SAR Image Reconstruction to CUDA
• Comparison of CPU and GPU Results
• Summary and conclusions
2/24/2012 2
3. Motivation
• On board space based processing should be
On-board space-based
increased
• Future space applications with high p
p pp g performance
requirements
– HRWS SAR: 1 Tera FLOPS, 603.1 Gbit/s throughput
• Heterogeneous (CPU+GPU) architectures might be
the solution
• Novel accelerator designs integrate in one chip CPUs
and graphics processing modules
2/24/2012 3
4. SAR Image Reconstruction
SAR Sensor
Synthetic Data Processing (SSP)
P i
Generation(SDG):
Reconstructed SAR
Synthetic SAR image is obtained by
returns from a applying the 2D
uniform grid of Fourier Matched
point reflectors Filtering and
Interpolation
Raw Data Reconstructed Image
SCALE mc n m nx
10 1600 3290 3808 2474
20
0 3 00
3200 6 60
6460 76 6
7616 4926
9 6
30 4800 9630 11422 7380
60 9600 19140 22844 14738
2/24/2012 4
5. SAR Sensor Processing Profiling
SSP Processing Step Computation Execution Size &
Type Time in % Layout
1. Filter the echoed signal 1d_Fw_FFT 1.1 [mc x n]
2. Transposition is needed 0.3 [n x mc]
3. Signal Compression along slow-time CEXP, MAC 1.1 [n x mc]
4.
4 Narrow-bandwidth
Narrow bandwidth polar format reconstruction along slow time
slow-time 1d_Fw_FFT
1d Fw FFT 0.5
05 [n x mc]
5. Zero pad the spatial frequency domain's compressed signal 0.4 [n x mc]
6. Transform-back the zero padded spatial spectrum 1d_Bw_FFT 5.2 [n x m]
7. Slow-time decompression CEXp, MAC 2.3 [n x m]
8. Digitally spotlighted
Digitally-spotlighted SAR signal spectrum 1d_Fw_FFT
1d Fw FFT 5.2 [n x m]
9. Generate the Doppler domain representation the reference CEXP, MAC 3.4 [n x m]
signal's complex conjugate
10. Circumvent edge processing effects 2D-FFT_shift 0.4 [n x m]
11. 2D Interpolation from a wedge to a rectangular area: MAC,Sin,Cos 69 [nx x m]
input[n x m] -> output[nx x m]
12. Transform from the doppler domain image into a spatial domain 1d_Bw_FFT 10 [m x nx]
image. IFFT[nx x m]-> Transpose -> FFT[m x nx] 1d_Bw_FFT
13 Transform into a viewable image CABS 1.1 [m x nx]
2/24/2012 5
6. The Benchmarked Architecture
Memory (6GB) Memory (6GB)
• The dual socket ccNUMA
– 2 Intel Nehalem CPUs 4Cores
@2.13GHz CPU CPU
– 2x6 GB=12 GB shared memory (4Cores) (4Cores)
– 32 nm
– Board TDP=120 W
Input/Output Controller
• 2 Accelerators with NVIDIA Tesla PCI Express 2.0 (up to 36 lanes)
p ( p )
C2070 GPUs each:
– 14 Streaming Multiprocessors
– 448 scalar cores @ 1.15 GHz.
– 6 GB of GDDR5 memory GPU GPU
• 5.25 GB available(if ECC enabled)
– 40 nm
– B d TDP 238 W
Board TDP=238
2/24/2012 6
Memory (6GB) Memory (6GB)
8. CPU Thread Parallelization
800
700
600
500
The vectorized code is
400 – 27 % faster in sequential
– 16% faster in parallel
300
200
100
0
16_Threads 8
Best fftw_threads 8 Threads
HT
7
Sequential OpenMP
6
Elapsed Time 733.5
733 5 183.5
183 5 122.5
122 5 100.7
100 7
5
Elapsed
537.41 161.97 103.06 84.36 4
Time(vect)
3
2
A very well optimized 1
sequential code impacts 0
Best fftw_threads 8 Threads
16_Threads
the scalability of the Sequential OpenMP
HT
application Speedup 1 3.997275204
3 997275204 5.987755102
5 987755102 7.284011917
7 284011917
Speedup(vect) 1 3.317960116 5.214535222 6.370436226
2/24/2012 8
9. Introduction to CUDA
• CUDA kernels are executed by
parallel threads.
threads
B B B B B B
• A group of threads forms a thread B B B B B B
block.
• Shared memory among the
threads in one block • Exploiting the locality of the
algorithms ensures performance
• Thread blocks are mapped to SMs
in warps (32 threads) that receive
the same instruction (SIMD)
( ) • Limited amount of memory brings
the need for slow PCIe
• Branches impact the efficiency of communications
SIMD units
2/24/2012 9
10. Porting SAR Application to CUDA
• 2D Data Tiling for Loops Thread (tx, ty) in block (bx, by) is to
calculate
– Tile elements are computed • row (by*TILE_DIM+ty) and
(by*TILE DIM+ty)
by a block of threads • column (bx*TILE_DIM+tx)
of the data set.
– Tiling technique increases
g q
the number of active
blocks, increasing so the
level of occupancy
– On the Tesla C2070 device:
max 1024 threads per
block.
block
• TILE_DIM=32 (32x32=1024)
2/24/2012 10
11. CUDA Implementation Discussions
• CUFFT library provides a simple interface for computing parallel FFTs
– Batch execution for multiple 1 dimensional transforms
1-dimensional
– Drawback: memory needed on the host side increases with:
• Size of the transform
• Number of the configured transforms in the batch
• Operations missing in CUDA:
– Library functions like cexp() and cabs()
– Atomic operations of floating point variables
floating-point
• Transcendental instructions: efficiently execute on Special Function Units
(
(SFUs).
)
– sine
– cosine
– square root
2/24/2012 11
12. Performance Results
12
• CPU vs GPU 10
– Better performance on the
GPU 8
– Better power efficiency on
peedup
the CPU
Sp
6
• Small Scale vs Large Scale 4
– For small scale images
g
(SCALE<20), the data set 2
fits completely on the GPU
memory 0
– For large scale images CPU_Seq
CPU S
CPU 8 CPU 16
GPU
Threads Threads
(SCALE > 30), the data set
Scale=10 1 7.9474 8.8247 11.0488
does not fit in the GPU
Scale=20 1 7.6237 8.1752 10.6159
memory
Scale=30 1 6.0354 7.0146 10.2855
Scale=60 1 5.2145 6.3704 10.2364
2/24/2012 12
13. Using both CPU and GPU for processing
• Programming heterogeneous systems is impacted by:
– Data dependencies
p
– Scheduling algorithms
– System Resources
• Frequent Transfers between CPU and GPU should be avoided
• Profiling is needed to identify the parts of the code that will
benefit from executing on the GPU
• In our case, it was decided to execute on the GPU only the
Interpolation Loop (70% of the total execution time) i order t
I t l ti L f th t t l ti ti ) in d to
avoid transfers in steps like:
– FFT_SHIFT
– Transposition
2/24/2012 13
14. Using Multiple GPU Devices
• OpenMP + CUDA:
One OpenMP thread per device
– Separate GPU context
• Each thread calls independently
– Memory management functions
– CUDA Kernels
• 2 Approaches
– Same image is reconstruction by 2 GPUs
• Bottlenecks in the QPI (remote accesses) and PCIe links
( )
– Separate images are reconstructed on 2 separate GPUs
(Pipelined version)
• Reduced CPU <-> GPU data transfers
2/24/2012 14
16. Summary and Conclusions
• Porting the SAR application to CUDA requires knowledge on the
underlying hardware and on the CUDA paradigm.
• For the SAR application GPUs offer better performance than CPUs
– But CPUs are more power efficient
• Heterogeneous computing improves performance but the
Performance/Watt ratio is impacted by the number of CPU <-> GPU
transfers.
• Static scheduling of CUDA kernels offers no flexibility in
heterogeneous computing environments
h t ti i t
• When using multiple GPU devices, it is very important to reduce the
number of CPU <-> GPU and GPU <-> GPU transfers.
<> <> transfers
2/24/2012 16
17. Thank You!
Questions?
Fisnik Kraja
Chair of Computer Architecture
Technische Universität München
kraja@in.tum.de