Contenu connexe
Similaire à OFC/NFOEC: GPU-based Parallelization of System Modeling (20)
OFC/NFOEC: GPU-based Parallelization of System Modeling
- 2. Outline
• Motivation
• Numerical System Modeling
• GPU-Parallelization
• Comparison of Speedup and Accuracy
• Conclusion
2 © 2013 ADVA Optical Networking. All rights reserved.
- 3. Acknowledgments
The author would like to acknowledge the help and
contributions of
Adam Chachaj – Krone Messtechnik
Heinrich Müller – TU Dortmund
Peter Krummrich – TU Dortmund
Markus Roppelt – ADVA Optical Networking
Michael Eiselt – ADVA Optical Networking
3 © 2013 ADVA Optical Networking. All rights reserved.
- 4. Motivation
4 © 2013 ADVA Optical Networking. All rights reserved.
- 5. In Short: Computational Performance
Graphical Processing Unit
(GPU)
vs.
CPU Cluster
5 © 2013 ADVA Optical Networking. All rights reserved.
- 6. Increase in GFlop/s
• GPU performance is growing even faster than predicted by Moore„s
law and is significantly higher than CPU performance
• GPUs are attractive also for general purpose computing
(complex numerical simulations)
6 © 2013 ADVA Optical Networking. All rights reserved.
- 7. Optical System Modeling
• Simulation of (long-haul) optical transmission systems requires
numerical solution of the nonlinear Schrödinger equation
High computational effort for small step-sizes due to accurate
simulation of nonlinear fiber effects
• Precise estimation of the bit error ratio with Monte-Carlo
simulations for PMD and noise
Requires a high number of simulated bits
7 © 2013 ADVA Optical Networking. All rights reserved.
- 8. Split-Step Fourier Method (SSFM)
• Splits nonlinear Schrödinger equation in linear and nonlinear parts
• Separate solution of linear and nonlinear parts
• Solution of the linear part in the frequency domain and of the nonlinear
part in time domain (acceptable for small step-sizes)
… FFT
FFT IFFT
IFFT
IFFT …
1 Split-Step
8 © 2013 ADVA Optical Networking. All rights reserved.
- 9. Speedup Factor (GPU vs CPU)
Single precision
(SP)
Double precision
(DP)
Legend
DP: Nvidia CUDA FFT
SP: FFT using pre-calculated
twiddle factors
• Single precision arithmetic has much higher performance on GPU
(because main target group is computer gaming)
• Longer block lengths allow better parallelization
Single precision implementation desirable
9 © 2013 ADVA Optical Networking. All rights reserved.
- 10. Accuracy (in single precision)
Legend
CUFFT: Nvidia CUDA FFT
FFTW: Fastest Fourier Transform
in the West
IPP: Intel Integrated
Performance Primitives
LUT-based FFT LUT: Precalculate trigonometric
functions in DP
• Total accuracy of SSFM dominated by FFT accuracy
• Backward error grows linearly with increasing number of FFTs
• CUDA FFT shows considerably higher error than other FFT
implementations
10 © 2013 ADVA Optical Networking. All rights reserved.
- 11. Analysis: Accuracy
Why is the accuracy of CUFFT in SP relatively low?
FFT performance depends crucially on accuracy of „twiddle-
factors“ (or trigonometric functions)
HW implementation of trigonometric functions in SP on GPUs
optimized for peak performance not accuracy
What can be done to increase accuracy in single precision?
Implementation of Taylor series expansion (slow!)
Compute trigonometric functions in DP on CPU and store them in
a look-up table on the GPU
(especially suited to the split-step Fourier method with thousands
of FFTs of similar length)
J. C. Schatzman, SIAM J. Scientific Comput. (1996).
11 © 2013 ADVA Optical Networking. All rights reserved.
- 12. Illustrative Example
CUDA FFT (SP) LUT-based FFT (SP)
-: GPU
-: CPU
• Look-up table based FFT provides a significantly increased accuracy in single-
precision arithmetics
• Look-up table holds pre-calculated „twiddle-factor“ values
Source: S. Pachnicke, et al, OFC 2011.
12 © 2013 ADVA Optical Networking. All rights reserved.
- 13. System Analysis (SSFM Simulation)
Req. OSNR deviation for BER=10-3 [dB]
GPU simulation
(in SP or DP)
vs.
CPU simulation
(in DP)
11x 112 Gb/s CP-QPSK
• GPU double precision results are (almost) identical to CPU results
• The OSNR penalty of our single precision implementation remains below
0.1 dB up to a number of approx. 125,000 split-steps
Source: S. Pachnicke, IEEE ICTON, 2010.
13 © 2013 ADVA Optical Networking. All rights reserved.
- 14. Combined Simulation in SP & DP
Calculate approximate
division of the parameter
space into strata by fast
simulations with single
precision.
The ellipses represent
parameter combinations
for which bit errors occur
during transmission.
Execute simulations with
double precision
accuracy sparsely in the
different strata to assess
the BER.
Combined simulation with single and double precision and automatic
(algorithmic) choice of amount of single precision simulations
P. Serena, et al, IEEE JLT, 2009.
S. Pachnicke, et al, OFC 2011.
14 © 2013 ADVA Optical Networking. All rights reserved.
- 15. Discussion
Robustness of algorithm has
been checked by deliberately
selecting high amount of
880,000 split-steps
• Results of combined (SP & DP) GPU simulations match well with results obtained
from CPU simulations in DP
• Speedup of up to a factor of 180 possible compared to CPU
Stratified Monte-Carlo sampling allows algorithmic choice of amount of required DP
simulations for a given accuracy
Source: S. Pachnicke, et al, OFC 2011.
15 © 2013 ADVA Optical Networking. All rights reserved.
- 16. Design Advantages
• GPU parallelization allows simulation of a long distance 80 WDM channel system on
a PC in reasonable time
Source: C. Xia, D. van den Borne, OFC, 2011
• Result: The system performance can be estimated much more precisely than with
CPU-based simulations (typically modeling only 10 WDM channel systems)
16 © 2013 ADVA Optical Networking. All rights reserved.
- 17. Conclusion
• GPUs offer a much higher computational peak performance
than CPUs
• Full benefit of GPU power only in single precision
• Increase in single precision accuracy possible by pre-computing of
trigonometric function values for FFTs
• Speedup in simulation time of more than a factor of 100 possible
compared to CPU
17 © 2013 ADVA Optical Networking. All rights reserved.
- 18. Further Reading
• N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, J. Manferdelli, “High
Performance Discrete Fourier Transforms on Graphics Processors”, Proc. of
IEEE conference on Supercomputing (SC), article no. 2 (2008).
• S. Pachnicke, “Fiber-Optic Transmission Networks: Efficient Design and
Dynamic Operation”, Springer (2011).
• J. C. Schatzman, “Accuracy of the Discrete Fourier Transform and the Fast
Fourier Transform”, SIAM J. Scientific Comput. 17, 1150-1166 (1996).
• G. Falcao, V. Silva, L. Sousa, “How GPUs can outperform ASICs for fast LDPC
decoding”, Proc. of ACM International Conference on Supercomputing
(ICS), 390-399 (2009).
• J. A. Stratton, S. S. Stone, W.-M. W. Hwu, “MCUDA: An Efficient
Implementation of CUDA Kernels for Multi-core CPUs”, Lecture Notes in
Computer Science 5335, 16-30 (2008).
• R. R. Exposito, G. L. Taboada, S. Ramos, J. Tourino, R. Doallo, “General-
purpose computation on GPUs for high performance cloud computing”, Wiley J.
Concurrency and Computation 24 (2012).
18 © 2013 ADVA Optical Networking. All rights reserved.
- 19. Thank you
spachnicke@advaoptical.com
IMPORTANT NOTICE
The content of this presentation is strictly confidential. ADVA Optical Networking is the exclusive owner or licensee of the
content, material, and information in this presentation. Any reproduction, publication or reprint, in whole or in part, is strictly
prohibited.
The information in this presentation may not be accurate, complete or up to date, and is provided without warranties or
representations of any kind, either express or implied. ADVA Optical Networking shall not be responsible for and disclaims any
liability for any loss or damages, including without limitation, direct, indirect, incidental, consequential and special damages,
alleged to have been caused by or in connection with using and/or relying on the information contained in this presentation.
Copyright © for the entire content of this presentation: ADVA Optical Networking.