Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

GPU Compute in Medical and Print Imaging

2 828 vues

Publié le

GPU compute has leveraged discrete GPUs for a fairly limited set of academic and supercomputing system workloads until recently. With the increase in performance of integrated GPU inside an Accelerated Processing Unit (APU), introduction of Heterogeneous System Architecture (HSA) devices, and proliferation of programming tools, we are seeing GPU compute make its way into mainstream applications. In this presentation we cover GPU compute and HSA, focusing on the application of GPU compute in the Medical and Print Imaging segments. Examples of performance data are reviewed and the case is made for how GPU compute can deliver tangible benefits.

Publié dans : Technologie
  • Identifiez-vous pour voir les commentaires

GPU Compute in Medical and Print Imaging

  1. 1. GPU Compute in Medical and Print Imaging Amey Deosthali Director, Embedded Imaging
  2. 2. Medical Imaging Trends SYSTEM OPTIMIZATION AND MINIATURIZATION  Advances in visualization and increased use of 3D/4D imaging for improved diagnosis  High-end systems of yesterday becoming portables of today INCREASED USE OF 3D/4D IMAGING INTEGRATION OF MODALITIES & ADVANCED FEATURES  Endoscopic ultrasound, Augmented reality, Robotic endoscopy INCREASED SYSTEM COST PRESSURES  Expanding emerging markets, regulatory pressures, increased competition
  3. 3. Print Imaging Trends Traditional Multi-Function Printer Architecture GPU Compute based Multi- Function Printer Architecture SoC with GPU SCALABLE SOFTWARE SCALABLE ARCHITECTURE SYSTEM COST SAVINGS
  4. 4. GPU Compute and AMD APU GPU Compute in Imaging  Medical and Print Imaging workloads are well suited for GPU compute HSA architecture can deliver significant benefits in the field of Imaging  AMD APUs integrate GPU with support for Heterogeneous System Architecture (HSA)
  5. 5. GPU COMPUTE IN MEDICAL IMAGING
  6. 6. Typical Ultrasound Imaging Pipeline Transmitter Receiver Beamforming IQ Demodulation Filters - Edge enhancement - Speckle Reduction Log Compression Envelope Detection Frame Averaging 2D Image formation Frequency/Time Compounding Color flow analysis Velocity Estimation Wall Filter Spatial Doppler Scan Conversion Echo Processing Color Flow Processing Transducer GPU Friendly
  7. 7. FASTER SCANS  Evolution in algorithm complexity with GPU  Reconstruct whole image plane IMPROVED IMAGE QUALITY ACCESS TO RAW DATA  Fast data transfer and efficient use of system memory SIMPLIFIED ARCHITECTURE  Scalable SW defined architecture GPU Compute for SW Beamforming Bridge Convert JESD-204b to PCIe JESD-204b 64-256 I/O Channels Image Formation Plane Wave Imaging • FK Stolts with optimized FFT/iFFT • IQ Demodulation and Log Compression Image Post Processing Separable Filters • Sobel and Box filters Non-separable Filter • Laplacian of Gaussian De-speckle Filter • Median filter Frequency Domain Filter • Gaussian blur and Edge Enhancement filters Gen 3 PCIe® x16 dGMA support for 10+ GBps GPU coherent compounding GPU + CPU post processing
  8. 8. SW Beamforming on AMD APU Transpose 1D FFT Z Shift & Transpose 1D IFFT FK interpolation 1D IFFT Acquisition Device iGPU or dGPU Software Beamformer Direct GMA (> 10 GB/s)RF Data 1D FFT X Shift & Transpose Transpose OpenCL™ implementation of FK Stolts algorithm SW Beamformer Performance1 APU dGPU 256 Channel, 2048 Samples 1.95 ms 0.47 ms 128 Channel, 2048 Samples 1.15ms 0.29 ms Processed Output 5x5 Median Filter
  9. 9. Speckle Noise Reduction Down Sample by 2 Subtract Multiply With Coefficients Up-sample by 2 Gama Correction Down- Sample by 2 Up-Sample by 2 Sub Gama Correction Down Sample by 2 Sobel Diffusion Gama Correction Pixel Correction IQ Demodulation Output Speckle Reduction Output
  10. 10. Speckle Noise Reduction Optimization • Combine multiple functions into single kernel • Get more compute per byte of global memory access • Reduce kernel launch delay overheads • Reduce use of temporary buffers and buffer copies • Reduce CPU bottlenecks that require blocking calls by moving operations to GPU • Optimize pipeline with “in order” enqueue of OpenCL commands Block A Block B Block C Block E Block D Block A & B (Multiple OpenCL kernels) Block C & D (Multiple OpenCL kernels) Block E (Multiple OpenCL kernels) CPU Path (4.10 ms) GPU Path2 (1.01 ms) Downsample + memcpy Downsample + Optimized memcpy Color conversion, edge detection, diffusion, normalization, gamma correction, image enhancement
  11. 11. Code Migration and Optimization Process 1. Profile Identify target workloads to convert 2. Convert Target workloads from CPU to GPU 3. Block Optimization Combine multiple CPU calls to a single OpenCL kernel 4. Buffer Optimization Reduce use of temporary buffers and buffer copies 5. Pipeline Optimization Move low workload CPU operations to GPU to reduce blocking calls 6. Reduce kernel launch delay “in order” enqueue of OpenCL commands
  12. 12. Sobel Filter Optimization 8-bit Grayscale Image (1920x1080) Median Filter IPP 8 to 32-bit Float Sobel & Sobel Magnitude Max & Min 6.51ms 19.47ms Migrate Sobel filter to GPU with OpenCL A: B: 8-bit Grayscale Image (1920x1080) Median Filter IPP 8 to 32-bit Float Sobel & Sobel Magnitude Max & Min CPU Optimized Modules GPU Optimized Modules OpenCL Optimized 2X faster computation time with migration of single module to GPU3
  13. 13. GPU COMPUTE IN PRINT IMAGING
  14. 14. Print and Scan Image Pipeline
  15. 15. Accelerated RIP Pipeline Open source Ghostscript postscript renderer accelerated using GPU4 AMD G-Series Reference Board Ubuntu 14.04 Linux OS KMD GFX Driver OCL CodeGLSL Libraries C Libraries OCL 2.0 Runtime OGL 4.3 Runtime Software Stack PDF Files on Disk Bitmap File on RAMdisk PDL Interpreter Element Decompose Generate Glyph Bitmaps Bitmap Ghostscript App Planarize GPU Raster GPU Color Conversion GPU DMA DMA OpenCL GL Shader Language (GLSL) CPU Operating in Host Memory GPU Operating in Device Memory
  16. 16. GPU compute can deliver large increase in PPM performance4 RIP Pipeline acceleration: PPM performance 101.8 164 244.3 370 0 50 100 150 200 250 300 350 400 GX-412 GX-424 PPM PPM - Test case 2 @600 dpi Legacy code (no GPU accl) GPU accelerated code 27.6 44 76.6 111 0 20 40 60 80 100 120 GX-412 GX-424 PPM PPM - Test case 2 @1200 dpi Legacy code (no GPU accl) GPU accelerated code 2.4x 2.3x 2.8x 2.5x PPM: Pages per Minute performance of Ghostscript RIP pipeline
  17. 17. GPU compute can free up CPU for other value added tasks4 CPU Load: Average load across all 4 CPU cores of G-series devices under test RIP Pipeline acceleration: CPU Load Reduction 0 10 20 30 40 50 60 30 40 50 60 70 75 80 90 100 125 150 %CPULoad(Avg) PPM Average CPU Load - Test case 2 @ 600 DPI* Legacy code (no GPU accl): GX-424 Legacy code (no GPU accl): GX-412 GPU accelerated code: GX-424 GPU accelerated code: GX-412 0 10 20 30 40 50 60 70 80 5 10 15 20 25 30 35 40%CPULoad(Avg) PPM Average CPU Load - Test case 2 @ 1200 DPI* Legacy code (no GPU accl): GX-424 Legacy code (no GPU accl): GX-412 GPU accelerated code: GX-424 GPU accelerated code: GX-412
  18. 18. Optical Character Recognition: Tesseract Project Accelerated using GPU Tesseract Flow Optical Character Recognition (OCR) Project  Tesseract : Open source Optical Character Recognition(OCR) Engine GPU Compute for OCR  Most of the image preprocessing and character recognition is GPU friendly  The data structures in word recognition phase are not very GPU friendly Expected Future Improvements  Deep Neural Network (DNN) for character recognition
  19. 19. Optical Character Recognition: Demo Performance Processing time measured for above modules with CPU processing and GPU accelerated processing5 AMD APU 95W (Time in seconds) AMD APU 35W (Time in seconds) Non OpenCL (CPU only) 23.65 46.2 OpenCL (GPU Compute) 16.79 36.3 Gain 41% 27%
  20. 20. Core Scan Processing Algorithms • AMD worked with customer to accelerate partial scan pipeline using OpenCL on AMD APU and GPU • Scan pipeline includes several image processing algorithms such as grayscale conversion, edge detection, rotation, color conversion etc. • GPU compute can deliver significant improvement in processing time compared to CPU based processing6 – Translates to faster scan time and higher scan ppm Iterative algorithm optimization on AMD APU CPU Optimized (Execution Time)* OpenCL Optimized (Execution Time) OpenCL Optimized Fused Code (Execution Time) Grayscale 13.5 ms 4.6 ms (2.9x) Median 25.6 ms 3.1 ms (8.3x) Grayscale + Median 39.1 ms 7.9 ms (5.0x) 5.9 ms (6.6x)
  21. 21. Color Conversion Partial scan pipeline acceleration Document Detect and Alignment correction Quality Improvement 7 8
  22. 22. CONCLUSION
  23. 23. The Future is bright with GPU Compute Improve quality of human care with improved accuracy Empower new experiences with next generation technology Enhance performance while reducing system cost
  24. 24. Endnotes 1Testing by AMD performance labs. Measured performance of OpenCL™ implementation of FK Stolts algorithm on AMD APU and AMD FirePro GPU. System Configuration: AMD Lamar development board with Windows® 10, AMD RX427BB 35W APU, 2.7/3.6 GHz, 2133 MHz DDR3, 8GB RAM. Discrete GPU: AMD FirePro ™ W9100 GPU, 275W, 5.2 TFLOPS SP, 16GB GDDR5, 512-bit memory interface, Windows 10. Driver version 15.200.1045-150622a 2Testing by AMD performance labs. Measured performance of Speckle Noise Reduction pipeline with and without GPU acceleration, multi-threaded CPU compiler option. Image size: 768 x 252, active ROI was 712 x 252. System Configuration: AMD Lamar development board with Windows® 10, AMD RX427BB 35W APU, 2.7/3.6 GHz, 2133 MHz DDR3, 8GB RAM. Discrete GPU: AMD FirePro ™ W9100 GPU, 275W, 5.2 TFLOPS SP, 16GB GDDR5, 512-bit memory interface, Windows 10. Driver version 16.20-160405a-301215E 3Testing by AMD performance labs. Measured performance of Sobel Filter with and without GPU acceleration. 8.2 Multi Threaded Library. Image resolution: 1920x1080. Sobel filter size: 5x5 System Configuration: Advantech ComE board with Windows 7 64-bit, AMD RX425BB, 35W, 2.5/3.4 GHz, 1866 MHz DDR3, 4GB RAM, AMD driver version: 14.502.1001.1001, OpenCL 1.2 4Testing by AMD performance labs. Measured performance of Raster Image Processing with and without GPU acceleration. System Configuration: AMD GX-424CC: 25W, 2.4 GHz, 1866 MHz DDR3, 8GB RAM, AMD GX-412HC: 7W, 1.2 GHz, 1333 MHz DDR3, 8 GB RAM. Ubuntu 14.04 with AMD Catalyst Driver 14.301.1001
  25. 25. Endnotes 5Testing by AMD performance labs. Measured performance of Optical Character Recognition using Tesseract open source code with and without GPU acceleration. System Configuration: AMD APU 95W: AMD A10-7850K APU with Radeon™ HD Graphics, 3.7/4.0 GHz, AMD APU 35W: AMD A10-7400P APU with Radeon™ HD Graphics, 2.7/3.6 GHz. Windows® 8.1, OpenCL™ 1.2, version 1084.4 5Testing by AMD performance labs. Measured performance of Optical Character Recognition using Tesseract open source code with and without GPU acceleration. System Configuration: AMD APU 95W: AMD A10-7850K APU with Radeon™ HD Graphics, 3.7/4.0 GHz, AMD APU 35W: AMD A10-7400P APU with Radeon™ HD Graphics, 2.7/3.6 GHz. Windows® 8.1, OpenCL™ 1.2, version 1084.4 6Testing by AMD performance labs. Measured performance of scan pipeline performance using proprietary customer code with and without GPU acceleration. System Configuration: AMD Olive Hill+ development board, AMD RX427BB: 25W, 2.7 GHz, 1600 MHz DDR3, 8GB RAM, Windows 8.1, AMD Catalyst 14.29 drivers and OpenCL™ 1.2
  26. 26. Endnotes 7Testing by AMD performance labs. Measured performance of partial scan pipeline using proprietary customer code. System Configuration: AMD Olive Hill+ development board with AMD RX427BB: 35W, 2.7 GHz, 1600 MHz DDR3, 8GB RAM Ubuntu 14.04 and AMD Catalyst driver 14.29 8Testing by AMD performance labs. Measured performance of partial scan pipeline using proprietary customer code. System Configuration: : 2015 MacBook Pro with Intel Core i7-4980HQ 2.8 GHz, 16 GB DDR3L RAM. AMD Radeon™ R9 M370X Graphics, 2GB GDDR5, Mac OS X 10.10.3. AMD Catalyst 14.29
  27. 27. Disclaimer The information contained herein is for informational purposes only, and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale. AMD's products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD's product could create a situation where personal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time without notice. AMD does not provide a license/sublicense to any intellectual property rights relating to any to any standards, including but not limited to any audio and/or video codec technologies such as AVC/H.264/MPEG-4, AVC, VC-1, MPEG-2, and DivX/xVid. AMD, the AMD Arrow logo, AMD Catalyst, AMD CrossFire, AMD CrossFireX, AMD Radeon, ATI Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. Windows and DirectX are registered trademarks of Microsoft Corporation. ARM is a registered trademark of ARM Limited. 3DMark is a trademark of Futuremark Corporation. DivX is a registered trademark of DivX, Inc. HDMI is a trademark of HDMI Licensing, LLC. Linux is a registered trademark of Linus Torvalds. OpenCL is a trademark of Apple Inc. used by permission of Khronos. PCIe and PCI Express are registered trademarks of PCI-SIG Corporation. © 2016 Advanced Micro Devices, Inc. All rights reserved.
  28. 28. THANK YOU

×