AMD’s math libraries can support a range of programmers from hobbyists to ninja programmers. Kent Knox from AMD’s library team introduces you to OpenCL libraries for linear algebra, FFT, and BLAS, and shows you how to leverage the speed of OpenCL through the use of these libraries.
Review the material presented in the AMD Math libraries webinar in this deck.
For more:
Visit the AMD Developer Forums:http://devgurus.amd.com/welcome
Watch the replay: www.youtube.com/user/AMDDevCentral
Follow us on Twitter: https://twitter.com/AMDDevCentral
2. 2 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
AGENDA
clMATH
‒clBLAS
‒clFFT
ACML
clMAGMA
Bolt
LIBRARIES COVERED
A survey of available libraries
3. 3 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
CLMATHLIBRARIES
clMathLibraries is a github organization for OpenCL™
math related subprojects
https://github.com/clMathLibraries
Currently hosting two subprojects: clBLAS & clFFT
5. 5 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
CLBLAS - HTTPS://GITHUB.COM/CLMATHLIBRARIES/CLBLAS
clBLAS implements the NetLib BLAS functionality with OpenCL
‒ Level 3 – Matrix x Matrix operations, O( N^3 ), compute bound
‒ Level 2 – Matrix x Vector operations, O( N^2 ), mostly memory bound
‒ Level 1 – Vector x Vector operations, O( N ), memory bound
The API is in the same style as NetLib, but appends OpenCL structures
‒ clblasStatus clblasSgemm( clblasOrder order, clblasTranspose transA,
clblasTranspose transB, size_t M, size_t N, size_t K, cl_float alpha, const
cl_mem A, size_t offA, size_t lda, const cl_mem B, size_t offB, size_t ldb,
cl_float beta, cl_mem C, size_t offC, size_t ldc, cl_uint numCommandQueues,
cl_command_queue* commandQueues, cl_uint numEventsInWaitList, const cl_event*
eventWaitList, cl_event* events )
clBLAS assumes that the user is comfortable with OpenCL programming
‒ The host code is responsible for detecting /choosing devices, transferring memory and synchronizing
operations
API
6. 6 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
CLBLAS - HTTPS://GITHUB.COM/CLMATHLIBRARIES/CLBLAS
A proof of concept Python wrapper for clBLAS started, but only sgemm wrapped
‒https://github.com/clMathLibraries/clBLAS/tree/master/src/wrappers/python
‒Based on Cython
‒Works with PyOpenCL to manage OpenCL state
‒Would love help from the community to finish this
The community wrote a Julia wrapper for clBLAS
‒https://github.com/JuliaGPU/CLBLAS.jl
API
7. 7 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
CLBLAS - HTTPS://GITHUB.COM/CLMATHLIBRARIES/CLBLAS
• The user is responsible for running the tool on their machine
as a preprocessing step
• The tool creates a kernel database file (.kdb) that contains the best
performing kernel for a given BLAS routine
• The .kdb file is specific to an OpenCL device; will be named after
that device; e.g. tahiti.kdb
• Example
• export CLBLAS_STORAGE_PATH = /usr/local/lib
• ./tune --gemm --double
clBLAS contains a Tune tool for finding
better OpenCL kernels
9. 9 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
CLFFT - HTTPS://GITHUB.COM/CLMATHLIBRARIES/CLFFT
clFFT implements an FFTW inspired interface with OpenCL
‒ Provides a fast and accurate platform for calculating discrete FFTs
‒ Supports 1D, 2D, and 3D transforms with a batch size that can be greater than 1
‒ Supports dimension lengths that can be any mix of powers of 2, 3, and 5
‒ Supports single and double precision floating point formats
clFFT assumes that the user is comfortable with OpenCL programming
‒ The host code is responsible for detecting/choosing devices, transferring memory and synchronizing
operations
The community wrote a Python wrapper for clFFT
‒https://github.com/geggo/gpyfft
The community wrote a Julia wrapper for clFFT
‒https://github.com/JuliaGPU/CLFFT.jl
API
10. 10 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
CLFFT - HTTPS://GITHUB.COM/CLMATHLIBRARIES/CLFFT
• Users set all FFT state in an FFT plan object when initializing
• Call ‘BakePlan’ using the plan object to tell the library to JIT and
compile the kernel outside of performance sensitive loops
• Reuse those plans as much as possible!
clFFT contains the concept of ‘plans’,
which allows the library to tune OpenCL
kernels at runtime
11. 11 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
CLFFT - HTTPS://GITHUB.COM/CLMATHLIBRARIES/CLFFT
PERFORMANCE
clFFT v2.3.1 included in ACML
v6.1
This version contains
optimizations not yet pushed
into public github repo
You can use the clFFT.h header
file from GitHub to compile
your application, then use the
binary from ACML
Benchmark system
64bit Linux
FirePro W9100
Catalyst Pro
14.301.1010
AMD A10-7850K
13. 13 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
ACML 6 INTRODUCES HETEROGENEOUS COMPUTE
OpenCL can be a difficult language to learn
‒ There exists legacy applications that won’t be ported to OpenCL
‒ They might be willing to sacrifice peak performance for program
portability
ACML 6 includes clBLAS & clFFT as new backends
‒ ACML hides all OpenCL programming from end users
‒ Client programs do not need to change at all; they only relink ACML 6
When ACML determines that a particular BLAS or FFT call will
gain benefit from offloading computation, it will do so without
knowledge of the client program
LEVERAGING CLMATH LIBRARIES TO ACCELERATE WITH OPENCL
ACML 6 keeps the same API!
14. 14 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
NEW FFTW WRAPPER
ACML 6 now ships with fftw.h
FFTW programs could link with ACML 6 to offload
computation onto OpenCL devices
No changes in host code required!
15. 15 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
ACMLSCRIPT
• The scripting language uses Lua, with custom ACML callback
functions
• http://www.lua.org/
• Refer to chapter 7 of the ACML documentation for more
information on how to modify or create your own scripts
ACML includes a new scripting
language that expresses the logic
ACML uses to offload computation
16. 16 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
ACMLSCRIPT: 3-PART VIDEO TUTORIALS
ACMLScript: Part 1
ACMLScript: Part 2
ACMLScript: Part 3
HTTPS://WWW.YOUTUBE.COM/USER/AMDDEVCENTRAL
17. 17 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
ACML- HTTPS://GITHUB.COM/CLMATHLIBRARIES/CLFFT
PERFORMANCE
ACML v6.0 sgemm
Slightly old at this time
Notice that the green line is
equivalent to Max( blue, red )
ACML loads the host
processor if the problem
is too small to benefit
from GPU acceleration
Benchmark system
AMD A10-7850K
CPU & GPU
64bit Linux
Catalyst 14.301.1001
19. 19 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
CLMAGMA
clMAGMA implements LAPACK functionality with
OpenCL acceleration
https://bitbucket.org/icl/clmagma
Maintained by the University of Tennessee Knoxville
20. 20 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
CLMAGMA
The newest v1.3 Supports
‒ LU, QR and Cholesky factorizations
‒ Linear and least squares solvers
‒ Reductions to Hessenberg, bidiagonal and tridiagonal forms
‒ Eigen and singular value problem solvers
‒ Orthogonal transformation routines
clMagma uses clBLAS as the GPU compute backend
‒ It currently provides static load balancing between CPU & GPU cores
Multi-GPU support
LEVERAGING CLMATH LIBRARIES TO ACCELERATE WITH OPENCL
v1.3 adds support for Windows and
Mac OSX
22. 22 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
BOLT
Bolt implements parallel C++ STL functionality with
AMP & OpenCL acceleration
Bolt on GitHub
Maintained by AMD
23. 23 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
BOLT
Bolt provides containers and algorithms that enable clients to
accelerate C++ code with minimal GPU knowledge
‒ Sorts
‒ Reductions
‒ Transforms
‒ Scans
Through control structures, clients control where data is
allocated and computed (minimal knowledge of AMP or OpenCL
is helpful here)
Bolt provides support for both OpenCL & C++ AMP paths
PARALLEL STL
Bolt provides containers such as
bolt::device_vector<>
24. 24 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
BOLT
#include <bolt/cl/device_vector.h>
#include <bolt/cl/scan.h>
#include <vector>
#include <numeric>
int main()
{
size_t length = 1024;
// Create device_vector and initialize it to 1
bolt::cl::device_vector< int > boltInput( length, 1 );
// Calculate the inclusive_scan of the device_vector
bolt::cl::inclusive_scan(boltInput.begin(),boltInput.end(),boltInput.begin( ) );
// Create an std vector and initialize it to 1
std::vector< int > stdInput( length, 1 );
// Calculate the inclusive_scan of the std vector
bolt::cl::inclusive_scan(stdInput.begin( ),stdInput.end( ),stdInput.begin( ) );
return 0;
}
EXAMPLE CODE
25. 25 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014
Q&A & CONTACT INFO
For More Info:
Follow us on Twitter: @AMDDevCentral
Visit our forums: http://devgurus.amd.com/welcome
Visit our website: www.developer.amd.com
Watch the replay: www.youtube.com/user/AMDDevCentral
Download the presentation: www.slideshare.net/DevCentralAMD