SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
Performance Evaluation of SAR
       Image Reconstruction on CPUs and
                    GPUs
Fisnik Kraja, Alin Murarasu, Georg Acher, Arndt Bode
Chair of Computer Architecture, Technische Universität München, Germany




   2012 IEEE Aerospace Conference, 3-10 March 2012, Big Sky, Montana
The main points

• The motivation statement
• Description of the SAR 2DFMFI application
• Description of the benchmarked architecture
• Results of sequential optimizations and thread
  parallelization on the CPU
• Porting SAR Image Reconstruction to CUDA
• Comparison of CPU and GPU Results
• Summary and conclusions


2/24/2012                                          2
Motivation

• On board space based processing should be
  On-board space-based
  increased
• Future space applications with high p
          p     pp                 g performance
  requirements
      – HRWS SAR: 1 Tera FLOPS, 603.1 Gbit/s throughput
• Heterogeneous (CPU+GPU) architectures might be
  the solution
• Novel accelerator designs integrate in one chip CPUs
  and graphics processing modules

2/24/2012                                                 3
SAR Image Reconstruction

                                                                      SAR Sensor
Synthetic Data                                                        Processing (SSP)
                                                                      P      i
Generation(SDG):
                                                                      Reconstructed SAR
Synthetic SAR                                                         image is obtained by
returns from a                                                        applying the 2D
uniform grid of                                                       Fourier Matched
point reflectors                                                      Filtering and
                                                                      Interpolation

                           Raw Data             Reconstructed Image
      SCALE         mc                 n       m              nx
         10         1600              3290    3808           2474
         20
          0         3 00
                    3200              6 60
                                      6460    76 6
                                              7616           4926
                                                              9 6
         30         4800              9630    11422          7380
         60         9600              19140   22844          14738




   2/24/2012                                                                        4
SAR Sensor Processing Profiling

       SSP Processing Step                                            Computation Execution    Size &
                                                                      Type         Time in %   Layout
1.     Filter the echoed signal                                       1d_Fw_FFT       1.1      [mc x n]
2.     Transposition is needed                                                        0.3      [n x mc]
3.     Signal Compression along slow-time                             CEXP, MAC       1.1      [n x mc]
4.
4      Narrow-bandwidth
       Narrow bandwidth polar format reconstruction along slow time
                                                            slow-time 1d_Fw_FFT
                                                                      1d Fw FFT       0.5
                                                                                      05       [n x mc]
5.     Zero pad the spatial frequency domain's compressed signal                      0.4      [n x mc]
6.     Transform-back the zero padded spatial spectrum                1d_Bw_FFT       5.2      [n x m]
7.     Slow-time decompression                                        CEXp, MAC       2.3      [n x m]
8.     Digitally spotlighted
       Digitally-spotlighted SAR signal spectrum                      1d_Fw_FFT
                                                                      1d Fw FFT       5.2      [n x m]
9.     Generate the Doppler domain representation the reference CEXP, MAC             3.4      [n x m]
       signal's complex conjugate
10.    Circumvent edge processing effects                             2D-FFT_shift    0.4      [n x m]
11.    2D Interpolation from a wedge to a rectangular area:           MAC,Sin,Cos     69       [nx x m]
       input[n x m] -> output[nx x m]
12.    Transform from the doppler domain image into a spatial domain 1d_Bw_FFT        10       [m x nx]
       image. IFFT[nx x m]-> Transpose -> FFT[m x nx]                 1d_Bw_FFT
13     Transform into a viewable image                                CABS            1.1      [m x nx]



      2/24/2012                                                                                    5
The Benchmarked Architecture
                                                    Memory (6GB)                     Memory (6GB)




• The dual socket ccNUMA
     – 2 Intel Nehalem CPUs 4Cores
       @2.13GHz                                      CPU                                CPU

     – 2x6 GB=12 GB shared memory                  (4Cores)                           (4Cores)


     – 32 nm
     – Board TDP=120 W
                                                             Input/Output Controller
• 2 Accelerators with NVIDIA Tesla                        PCI Express 2.0 (up to 36 lanes)
                                                                p         ( p            )

  C2070 GPUs each:
     – 14 Streaming Multiprocessors
     – 448 scalar cores @ 1.15 GHz.
     – 6 GB of GDDR5 memory                         GPU                               GPU
             • 5.25 GB available(if ECC enabled)
     – 40 nm
     – B d TDP 238 W
       Board TDP=238

 2/24/2012                                                                                          6
                                                   Memory (6GB)                      Memory (6GB)
CPU Sequential Optimizations
                                        1800
                                        1600
 Elapsed Time in Seconds




                                        1400
                                        1200
                                        1000
                                         800
                                         600
                                         400
                                         200
                                           0
                                                  O0        O1             O2     O3       O0       O1              O2    O3     SSE4         F_t        cexp    MEA
                                                                 GCC 4.6                                 ICC 12.0                        Vectorization           FFTW
                           Elapsed_time          1606.7   1241.03     1201.6    1208.66   1060.8   861.5       773.5     761.3   751.9      582.83       562.9   537.41



                                          3.5

                                           3

                                          2.5
                             Speedup




                                           2

                                          1.5

                                           1

                                          0.5

                                           0
                                                   O0       O1             O2     O3       O0       O1              O2    O3     SSE4        F_t         cexp    MEA
                                                                 GCC 4.6                                 ICC 12.0                        Vectorization           FFTW
                                       Speedup     1      1.294650 1.337133 1.329323 1.514611 1.865002 2.077181 2.110468 2.136853 2.756721 2.854325 2.989709

2/24/2012                                                                                                                                                                 7
CPU Thread Parallelization
       800
       700
       600
       500
                                                                                  The vectorized code is
       400                                                                        – 27 % faster in sequential
                                                                                  – 16% faster in parallel
       300
       200
       100
         0
                                                        16_Threads          8
                  Best       fftw_threads   8 Threads
                                                            HT
                                                                            7
                Sequential                  OpenMP
                                                                            6
Elapsed Time      733.5
                  733 5         183.5
                                183 5        122.5
                                             122 5        100.7
                                                          100 7
                                                                            5
 Elapsed
                 537.41        161.97        103.06       84.36             4
Time(vect)
                                                                            3
                                                                            2
      A very well optimized                                                 1

      sequential code impacts                                               0
                                                                                     Best       fftw_threads    8 Threads
                                                                                                                             16_Threads

      the scalability of the                                                       Sequential                   OpenMP
                                                                                                                                 HT


      application                                                 Speedup              1        3.997275204
                                                                                                3 997275204    5.987755102
                                                                                                               5 987755102   7.284011917
                                                                                                                             7 284011917
                                                                  Speedup(vect)        1        3.317960116    5.214535222   6.370436226


    2/24/2012                                                                                                                   8
Introduction to CUDA

•   CUDA kernels are executed by
    parallel threads.
             threads
                                                B   B   B    B   B   B


•   A group of threads forms a thread           B   B   B    B   B   B
    block.

•   Shared memory among the
    threads in one block                 •   Exploiting the locality of the
                                             algorithms ensures performance
•   Thread blocks are mapped to SMs
    in warps (32 threads) that receive
    the same instruction (SIMD)
                         (      )        •   Limited amount of memory brings
                                             the need for slow PCIe
•   Branches impact the efficiency of        communications
    SIMD units


2/24/2012                                                                 9
Porting SAR Application to CUDA
• 2D Data Tiling for Loops                Thread (tx, ty) in block (bx, by) is to
                                              calculate
     – Tile elements are computed              • row (by*TILE_DIM+ty) and
                                                       (by*TILE DIM+ty)
       by a block of threads                   • column (bx*TILE_DIM+tx)
                                          of the data set.
     – Tiling technique increases
            g       q
       the number of active
       blocks, increasing so the
       level of occupancy

     – On the Tesla C2070 device:
       max 1024 threads per
       block.
       block
             • TILE_DIM=32 (32x32=1024)




 2/24/2012                                                                          10
CUDA Implementation Discussions

 • CUFFT library provides a simple interface for computing parallel FFTs
       – Batch execution for multiple 1 dimensional transforms
                                      1-dimensional
       – Drawback: memory needed on the host side increases with:
            • Size of the transform
            • Number of the configured transforms in the batch

 • Operations missing in CUDA:
       – Library functions like cexp() and cabs()
       – Atomic operations of floating point variables
                                floating-point

 •   Transcendental instructions: efficiently execute on Special Function Units
     (
     (SFUs).
           )
       – sine
       – cosine
       – square root


2/24/2012                                                                     11
Performance Results
                                             12

• CPU vs GPU                                 10
    – Better performance on the
      GPU                                    8
    – Better power efficiency on




                                    peedup
      the CPU




                                   Sp
                                             6


• Small Scale vs Large Scale                 4
    – For small scale images
                          g
      (SCALE<20), the data set               2
      fits completely on the GPU
      memory                                 0
    – For large scale images                      CPU_Seq
                                                  CPU S
                                                             CPU 8    CPU 16
                                                                                 GPU
                                                            Threads   Threads
      (SCALE > 30), the data set
                                    Scale=10         1      7.9474    8.8247    11.0488
      does not fit in the GPU
                                    Scale=20         1      7.6237    8.1752    10.6159
      memory
                                    Scale=30         1      6.0354    7.0146    10.2855
                                    Scale=60         1      5.2145    6.3704    10.2364


 2/24/2012                                                                        12
Using both CPU and GPU for processing

 • Programming heterogeneous systems is impacted by:
       – Data dependencies
                p
       – Scheduling algorithms
       – System Resources

 • Frequent Transfers between CPU and GPU should be avoided

 • Profiling is needed to identify the parts of the code that will
   benefit from executing on the GPU

 • In our case, it was decided to execute on the GPU only the
   Interpolation Loop (70% of the total execution time) i order t
   I t     l ti L               f th t t l     ti ti ) in d to
   avoid transfers in steps like:
       – FFT_SHIFT
       – Transposition

2/24/2012                                                            13
Using Multiple GPU Devices

• OpenMP + CUDA:
  One OpenMP thread per device
      – Separate GPU context
            • Each thread calls independently
                – Memory management functions
                – CUDA Kernels

• 2 Approaches
      – Same image is reconstruction by 2 GPUs
            • Bottlenecks in the QPI (remote accesses) and PCIe links
                                     (               )
      – Separate images are reconstructed on 2 separate GPUs
        (Pipelined version)
            • Reduced CPU <-> GPU data transfers


2/24/2012                                                               14
Results Updated

                    20
                    18
                    16
                    14
       peedup




                    12
      Sp




                    10
                     8
                     6
                     4
                     2
                     0
                                      CPU 8    CPU 16                                   2GPUs
                           CPU_Seq                        GPU      GPU+CPU   2GPUs
                                     Threads   Threads                                 Pipelined
                Scale=10      1      7.9474    8.8247    11.0488   10.3740   2.3472     5.3086
                Scale=20      1      7.6237    8.1752    10.6159   11.7166   5.7588    11.5306
                Scale=30      1      6.0354    7.0146    10.2855   11.6952   8.8412    13.4404
                Scale=60      1      5.5136    6.5883    10.2364   12.5270   11.3020   17.4938




2/24/2012                                                                                          15
Summary and Conclusions

• Porting the SAR application to CUDA requires knowledge on the
  underlying hardware and on the CUDA paradigm.

• For the SAR application GPUs offer better performance than CPUs
      – But CPUs are more power efficient

• Heterogeneous computing improves performance but the
  Performance/Watt ratio is impacted by the number of CPU <-> GPU
  transfers.

• Static scheduling of CUDA kernels offers no flexibility in
  heterogeneous computing environments
  h t                    ti     i        t

• When using multiple GPU devices, it is very important to reduce the
  number of CPU <-> GPU and GPU <-> GPU transfers.
                <>               <>          transfers

2/24/2012                                                           16
Thank You!



         Questions?

                            Fisnik Kraja
                       Chair of Computer Architecture
                      Technische Universität München
                                     kraja@in.tum.de

Contenu connexe

Tendances

Content based lcd backlight power reduction
Content based lcd backlight power reductionContent based lcd backlight power reduction
Content based lcd backlight power reductionRavi Dharawadkar
 
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Yahoo Developer Network
 
Technical Documentation_Embedded_Acoustic_DSP_Projects
Technical Documentation_Embedded_Acoustic_DSP_ProjectsTechnical Documentation_Embedded_Acoustic_DSP_Projects
Technical Documentation_Embedded_Acoustic_DSP_ProjectsEmmanuel Chidinma
 
Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on ...
Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on ...Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on ...
Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on ...IDES Editor
 
Design and implementation of DADCT
Design and implementation of DADCTDesign and implementation of DADCT
Design and implementation of DADCTSatish Kumar
 
A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...
A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...
A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...CSCJournals
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
The OptIPlanet Collaboratory
The OptIPlanet CollaboratoryThe OptIPlanet Collaboratory
The OptIPlanet CollaboratoryLarry Smarr
 
Enhancement of SAR Imagery using DWT
Enhancement of SAR Imagery using DWTEnhancement of SAR Imagery using DWT
Enhancement of SAR Imagery using DWTIJLT EMAS
 
Satellite Image Resolution Enhancement Technique Using DWT and IWT
Satellite Image Resolution Enhancement Technique Using DWT and IWTSatellite Image Resolution Enhancement Technique Using DWT and IWT
Satellite Image Resolution Enhancement Technique Using DWT and IWTEditor IJCATR
 
Restructuring Campus CI -- UCSD-A LambdaCampus Research CI and the Quest for ...
Restructuring Campus CI -- UCSD-A LambdaCampus Research CI and the Quest for ...Restructuring Campus CI -- UCSD-A LambdaCampus Research CI and the Quest for ...
Restructuring Campus CI -- UCSD-A LambdaCampus Research CI and the Quest for ...Larry Smarr
 
Lifting Scheme Cores for Wavelet Transform
Lifting Scheme Cores for Wavelet TransformLifting Scheme Cores for Wavelet Transform
Lifting Scheme Cores for Wavelet TransformDavid Bařina
 
nternational Journal of Computational Engineering Research(IJCER)
nternational Journal of Computational Engineering Research(IJCER)nternational Journal of Computational Engineering Research(IJCER)
nternational Journal of Computational Engineering Research(IJCER)ijceronline
 
discrete wavelet transform
discrete wavelet transformdiscrete wavelet transform
discrete wavelet transformpiyush_11
 

Tendances (20)

Content based lcd backlight power reduction
Content based lcd backlight power reductionContent based lcd backlight power reduction
Content based lcd backlight power reduction
 
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
 
Technical Documentation_Embedded_Acoustic_DSP_Projects
Technical Documentation_Embedded_Acoustic_DSP_ProjectsTechnical Documentation_Embedded_Acoustic_DSP_Projects
Technical Documentation_Embedded_Acoustic_DSP_Projects
 
Lightspeed SIGGRAPH talk
Lightspeed SIGGRAPH talkLightspeed SIGGRAPH talk
Lightspeed SIGGRAPH talk
 
Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on ...
Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on ...Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on ...
Simple and Fast Implementation of Segmented Matrix Algorithm for Haar DWT on ...
 
Design and implementation of DADCT
Design and implementation of DADCTDesign and implementation of DADCT
Design and implementation of DADCT
 
Ppt
PptPpt
Ppt
 
H0545156
H0545156H0545156
H0545156
 
A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...
A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...
A Dual Tree Complex Wavelet Transform Construction and Its Application to Ima...
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Perceptual Video Coding
Perceptual Video Coding Perceptual Video Coding
Perceptual Video Coding
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
The OptIPlanet Collaboratory
The OptIPlanet CollaboratoryThe OptIPlanet Collaboratory
The OptIPlanet Collaboratory
 
Enhancement of SAR Imagery using DWT
Enhancement of SAR Imagery using DWTEnhancement of SAR Imagery using DWT
Enhancement of SAR Imagery using DWT
 
Satellite Image Resolution Enhancement Technique Using DWT and IWT
Satellite Image Resolution Enhancement Technique Using DWT and IWTSatellite Image Resolution Enhancement Technique Using DWT and IWT
Satellite Image Resolution Enhancement Technique Using DWT and IWT
 
Restructuring Campus CI -- UCSD-A LambdaCampus Research CI and the Quest for ...
Restructuring Campus CI -- UCSD-A LambdaCampus Research CI and the Quest for ...Restructuring Campus CI -- UCSD-A LambdaCampus Research CI and the Quest for ...
Restructuring Campus CI -- UCSD-A LambdaCampus Research CI and the Quest for ...
 
Lifting Scheme Cores for Wavelet Transform
Lifting Scheme Cores for Wavelet TransformLifting Scheme Cores for Wavelet Transform
Lifting Scheme Cores for Wavelet Transform
 
nternational Journal of Computational Engineering Research(IJCER)
nternational Journal of Computational Engineering Research(IJCER)nternational Journal of Computational Engineering Research(IJCER)
nternational Journal of Computational Engineering Research(IJCER)
 
D0941824
D0941824D0941824
D0941824
 
discrete wavelet transform
discrete wavelet transformdiscrete wavelet transform
discrete wavelet transform
 

Similaire à Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs

Research paper
Research paperResearch paper
Research paperRonak Vyas
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learningAmgad Muhammad
 
Computer Science Thesis Defense
Computer Science Thesis DefenseComputer Science Thesis Defense
Computer Science Thesis Defensetompitkin
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Fisnik Kraja
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
Multi-core GPU – Fast parallel SAR image generation
Multi-core GPU – Fast parallel SAR image generationMulti-core GPU – Fast parallel SAR image generation
Multi-core GPU – Fast parallel SAR image generationMahesh Khadatare
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?Shinnosuke Furuya
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
Design and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorDesign and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorNajeeb Ahmad
 
Octnews featured article
Octnews featured articleOctnews featured article
Octnews featured articleKangZhang
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioOPNFV
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologyTiago Sousa
 
Positioning techniques in 3 g networks (1)
Positioning techniques in 3 g networks (1)Positioning techniques in 3 g networks (1)
Positioning techniques in 3 g networks (1)kike2005
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit pptNitesh Dubey
 

Similaire à Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs (20)

FFaraji
FFarajiFFaraji
FFaraji
 
Research paper
Research paperResearch paper
Research paper
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Computer Science Thesis Defense
Computer Science Thesis DefenseComputer Science Thesis Defense
Computer Science Thesis Defense
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
Multi-core GPU – Fast parallel SAR image generation
Multi-core GPU – Fast parallel SAR image generationMulti-core GPU – Fast parallel SAR image generation
Multi-core GPU – Fast parallel SAR image generation
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
26_Fan.pdf
26_Fan.pdf26_Fan.pdf
26_Fan.pdf
 
BDL_project_report
BDL_project_reportBDL_project_report
BDL_project_report
 
Design and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processorDesign and implementation of GPU-based SAR image processor
Design and implementation of GPU-based SAR image processor
 
Octnews featured article
Octnews featured articleOctnews featured article
Octnews featured article
 
IMQA Poster
IMQA PosterIMQA Poster
IMQA Poster
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
 
Positioning techniques in 3 g networks (1)
Positioning techniques in 3 g networks (1)Positioning techniques in 3 g networks (1)
Positioning techniques in 3 g networks (1)
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit ppt
 

Dernier

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Dernier (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs

  • 1. Performance Evaluation of SAR Image Reconstruction on CPUs and GPUs Fisnik Kraja, Alin Murarasu, Georg Acher, Arndt Bode Chair of Computer Architecture, Technische Universität München, Germany 2012 IEEE Aerospace Conference, 3-10 March 2012, Big Sky, Montana
  • 2. The main points • The motivation statement • Description of the SAR 2DFMFI application • Description of the benchmarked architecture • Results of sequential optimizations and thread parallelization on the CPU • Porting SAR Image Reconstruction to CUDA • Comparison of CPU and GPU Results • Summary and conclusions 2/24/2012 2
  • 3. Motivation • On board space based processing should be On-board space-based increased • Future space applications with high p p pp g performance requirements – HRWS SAR: 1 Tera FLOPS, 603.1 Gbit/s throughput • Heterogeneous (CPU+GPU) architectures might be the solution • Novel accelerator designs integrate in one chip CPUs and graphics processing modules 2/24/2012 3
  • 4. SAR Image Reconstruction SAR Sensor Synthetic Data Processing (SSP) P i Generation(SDG): Reconstructed SAR Synthetic SAR image is obtained by returns from a applying the 2D uniform grid of Fourier Matched point reflectors Filtering and Interpolation Raw Data Reconstructed Image SCALE mc n m nx 10 1600 3290 3808 2474 20 0 3 00 3200 6 60 6460 76 6 7616 4926 9 6 30 4800 9630 11422 7380 60 9600 19140 22844 14738 2/24/2012 4
  • 5. SAR Sensor Processing Profiling SSP Processing Step Computation Execution Size & Type Time in % Layout 1. Filter the echoed signal 1d_Fw_FFT 1.1 [mc x n] 2. Transposition is needed 0.3 [n x mc] 3. Signal Compression along slow-time CEXP, MAC 1.1 [n x mc] 4. 4 Narrow-bandwidth Narrow bandwidth polar format reconstruction along slow time slow-time 1d_Fw_FFT 1d Fw FFT 0.5 05 [n x mc] 5. Zero pad the spatial frequency domain's compressed signal 0.4 [n x mc] 6. Transform-back the zero padded spatial spectrum 1d_Bw_FFT 5.2 [n x m] 7. Slow-time decompression CEXp, MAC 2.3 [n x m] 8. Digitally spotlighted Digitally-spotlighted SAR signal spectrum 1d_Fw_FFT 1d Fw FFT 5.2 [n x m] 9. Generate the Doppler domain representation the reference CEXP, MAC 3.4 [n x m] signal's complex conjugate 10. Circumvent edge processing effects 2D-FFT_shift 0.4 [n x m] 11. 2D Interpolation from a wedge to a rectangular area: MAC,Sin,Cos 69 [nx x m] input[n x m] -> output[nx x m] 12. Transform from the doppler domain image into a spatial domain 1d_Bw_FFT 10 [m x nx] image. IFFT[nx x m]-> Transpose -> FFT[m x nx] 1d_Bw_FFT 13 Transform into a viewable image CABS 1.1 [m x nx] 2/24/2012 5
  • 6. The Benchmarked Architecture Memory (6GB) Memory (6GB) • The dual socket ccNUMA – 2 Intel Nehalem CPUs 4Cores @2.13GHz CPU CPU – 2x6 GB=12 GB shared memory (4Cores) (4Cores) – 32 nm – Board TDP=120 W Input/Output Controller • 2 Accelerators with NVIDIA Tesla PCI Express 2.0 (up to 36 lanes) p ( p ) C2070 GPUs each: – 14 Streaming Multiprocessors – 448 scalar cores @ 1.15 GHz. – 6 GB of GDDR5 memory GPU GPU • 5.25 GB available(if ECC enabled) – 40 nm – B d TDP 238 W Board TDP=238 2/24/2012 6 Memory (6GB) Memory (6GB)
  • 7. CPU Sequential Optimizations 1800 1600 Elapsed Time in Seconds 1400 1200 1000 800 600 400 200 0 O0 O1 O2 O3 O0 O1 O2 O3 SSE4 F_t cexp MEA GCC 4.6 ICC 12.0 Vectorization FFTW Elapsed_time 1606.7 1241.03 1201.6 1208.66 1060.8 861.5 773.5 761.3 751.9 582.83 562.9 537.41 3.5 3 2.5 Speedup 2 1.5 1 0.5 0 O0 O1 O2 O3 O0 O1 O2 O3 SSE4 F_t cexp MEA GCC 4.6 ICC 12.0 Vectorization FFTW Speedup 1 1.294650 1.337133 1.329323 1.514611 1.865002 2.077181 2.110468 2.136853 2.756721 2.854325 2.989709 2/24/2012 7
  • 8. CPU Thread Parallelization 800 700 600 500 The vectorized code is 400 – 27 % faster in sequential – 16% faster in parallel 300 200 100 0 16_Threads 8 Best fftw_threads 8 Threads HT 7 Sequential OpenMP 6 Elapsed Time 733.5 733 5 183.5 183 5 122.5 122 5 100.7 100 7 5 Elapsed 537.41 161.97 103.06 84.36 4 Time(vect) 3 2 A very well optimized 1 sequential code impacts 0 Best fftw_threads 8 Threads 16_Threads the scalability of the Sequential OpenMP HT application Speedup 1 3.997275204 3 997275204 5.987755102 5 987755102 7.284011917 7 284011917 Speedup(vect) 1 3.317960116 5.214535222 6.370436226 2/24/2012 8
  • 9. Introduction to CUDA • CUDA kernels are executed by parallel threads. threads B B B B B B • A group of threads forms a thread B B B B B B block. • Shared memory among the threads in one block • Exploiting the locality of the algorithms ensures performance • Thread blocks are mapped to SMs in warps (32 threads) that receive the same instruction (SIMD) ( ) • Limited amount of memory brings the need for slow PCIe • Branches impact the efficiency of communications SIMD units 2/24/2012 9
  • 10. Porting SAR Application to CUDA • 2D Data Tiling for Loops Thread (tx, ty) in block (bx, by) is to calculate – Tile elements are computed • row (by*TILE_DIM+ty) and (by*TILE DIM+ty) by a block of threads • column (bx*TILE_DIM+tx) of the data set. – Tiling technique increases g q the number of active blocks, increasing so the level of occupancy – On the Tesla C2070 device: max 1024 threads per block. block • TILE_DIM=32 (32x32=1024) 2/24/2012 10
  • 11. CUDA Implementation Discussions • CUFFT library provides a simple interface for computing parallel FFTs – Batch execution for multiple 1 dimensional transforms 1-dimensional – Drawback: memory needed on the host side increases with: • Size of the transform • Number of the configured transforms in the batch • Operations missing in CUDA: – Library functions like cexp() and cabs() – Atomic operations of floating point variables floating-point • Transcendental instructions: efficiently execute on Special Function Units ( (SFUs). ) – sine – cosine – square root 2/24/2012 11
  • 12. Performance Results 12 • CPU vs GPU 10 – Better performance on the GPU 8 – Better power efficiency on peedup the CPU Sp 6 • Small Scale vs Large Scale 4 – For small scale images g (SCALE<20), the data set 2 fits completely on the GPU memory 0 – For large scale images CPU_Seq CPU S CPU 8 CPU 16 GPU Threads Threads (SCALE > 30), the data set Scale=10 1 7.9474 8.8247 11.0488 does not fit in the GPU Scale=20 1 7.6237 8.1752 10.6159 memory Scale=30 1 6.0354 7.0146 10.2855 Scale=60 1 5.2145 6.3704 10.2364 2/24/2012 12
  • 13. Using both CPU and GPU for processing • Programming heterogeneous systems is impacted by: – Data dependencies p – Scheduling algorithms – System Resources • Frequent Transfers between CPU and GPU should be avoided • Profiling is needed to identify the parts of the code that will benefit from executing on the GPU • In our case, it was decided to execute on the GPU only the Interpolation Loop (70% of the total execution time) i order t I t l ti L f th t t l ti ti ) in d to avoid transfers in steps like: – FFT_SHIFT – Transposition 2/24/2012 13
  • 14. Using Multiple GPU Devices • OpenMP + CUDA: One OpenMP thread per device – Separate GPU context • Each thread calls independently – Memory management functions – CUDA Kernels • 2 Approaches – Same image is reconstruction by 2 GPUs • Bottlenecks in the QPI (remote accesses) and PCIe links ( ) – Separate images are reconstructed on 2 separate GPUs (Pipelined version) • Reduced CPU <-> GPU data transfers 2/24/2012 14
  • 15. Results Updated 20 18 16 14 peedup 12 Sp 10 8 6 4 2 0 CPU 8 CPU 16 2GPUs CPU_Seq GPU GPU+CPU 2GPUs Threads Threads Pipelined Scale=10 1 7.9474 8.8247 11.0488 10.3740 2.3472 5.3086 Scale=20 1 7.6237 8.1752 10.6159 11.7166 5.7588 11.5306 Scale=30 1 6.0354 7.0146 10.2855 11.6952 8.8412 13.4404 Scale=60 1 5.5136 6.5883 10.2364 12.5270 11.3020 17.4938 2/24/2012 15
  • 16. Summary and Conclusions • Porting the SAR application to CUDA requires knowledge on the underlying hardware and on the CUDA paradigm. • For the SAR application GPUs offer better performance than CPUs – But CPUs are more power efficient • Heterogeneous computing improves performance but the Performance/Watt ratio is impacted by the number of CPU <-> GPU transfers. • Static scheduling of CUDA kernels offers no flexibility in heterogeneous computing environments h t ti i t • When using multiple GPU devices, it is very important to reduce the number of CPU <-> GPU and GPU <-> GPU transfers. <> <> transfers 2/24/2012 16
  • 17. Thank You! Questions? Fisnik Kraja Chair of Computer Architecture Technische Universität München kraja@in.tum.de