SIMD extensions have been a feature of choice for processor manufacturers for a couple of decades. Designed to exploit data parallelism in applications at the instruction level, these extensions still require a high level of expertise or the use of potentially fragile compiler support or vendor-specific libraries. While a large fraction of their theoretical accelerations can be obtained using such tools, exploiting such hardware becomes tedious as soon as application portability across hardware is required.
Accessing such capabilities directly from C++ code could be a major improvements in a lot of use cases. Different take on this has been proposed either by the community or as an actual standard proposal. Solutions include pragma based annotations, standard algorithms policies, full blown compiler support and libraries.
In this talk we will present one such solution - the Boost.SIMD library (currently being proposed as such) which takes a library approach to this issues.
We will go over the basic notion required to grasp SIMD programming in general. Then, we'll discuss the different existing approaches. We will describe Boost.SIMD API and API design to demonstrate how it solves issues raised by the actual idiomatic way of writting SIMD enabled code. Design issues like standard algorithm integration, memory handling and how to fill the gaps in SIMD instructions sets will be discussed. Finally, we show its performances with respect to a subset of well known benchmarks.
2. NumScale in a few words
Our company
French start-up specialized in software performance
We sell C++ libraries to master modern hardware performance
Consulting & training on all things C++ or HPC
NumScale and C++
Member of the ISO C++ French National Body
Enthusiastic user & contributor to OSS projects
Involved in the European C++ community
1 of 33
4. Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
3 of 33
5. Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
Who knows about parallel programming ?
3 of 33
6. Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
Who knows about parallel programming ?
Who knows about SIMD/multimedia extensions ?
3 of 33
7. Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
Who knows about parallel programming ?
Who knows about SIMD/multimedia extensions ?
Who uses SIMD extensions like SSE, AVX, VMX or NEON ?
3 of 33
8. Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
Who knows about parallel programming ?
Who knows about SIMD/multimedia extensions ?
Who uses SIMD extensions like SSE, AVX, VMX or NEON ?
Who has nightmares because of those ?
3 of 33
9. What is SIMD ? - A french cuisine approach
Some workload to process
4 of 33
10. What is SIMD ? - A french cuisine approach
A regular CPU
4 of 33
11. What is SIMD ? - A french cuisine approach
Single Instruction, Single Data processing
4 of 33
12. What is SIMD ? - A french cuisine approach
A SIMD enabled CPU
4 of 33
13. What is SIMD ? - A french cuisine approach
Single Instruction, Multiple Data processing
4 of 33
14. What is SIMD ? - For real
Instructions
Data
Results
SISD SIMD
Principes
Wide registers store N > 1 values.
Special instructions process those
registers.
Code uses a blitter like approach.
5 of 33
15. What is SIMD ? - For real
Instructions
Data
Results
SISD SIMD Benets
Speed-up of N on cache-hot data
Avoid premature scale-out
Maximize FLOPS/Watts
5 of 33
17. The Many Ways to Vectorize
Implicit vectorization
Auto-Vectorization
Compiler hints
Explicit vectorization
Langages extensions
SIMD Intrinsics libraries
Vector Intrinsics
Inline Assembly
7 of 33
18. The Many Ways to Vectorize
Implicit vectorization
Auto-Vectorization
Compiler hints
Explicit vectorization
Langages extensions
SIMD Intrinsics libraries
Vector Intrinsics
Inline Assembly (let’s just say no right now)
7 of 33
19. Implicit vectorization
Auto-vectorizer
Compile-time analysis of loop nest
May use special hints
Only safe transformations are applied
template <typename T>
void f(T* restrict a, T* restrict b, int size)
{
#pragma ivdep
for(int i=0;i<size ;++i)
a[i] += b[i];
}
8 of 33
20. OpenMP4
Principle
Flag loop as must be vectorized
Support for reductions & SIMD functions
User is in charge of checking validity
template <typename T> T f(T* a, T* b, int size)
{
T res=0;
#pragma omp simd reduction (+:res)
for(int i=0;i<size ;++i)
{
a[i] += b[i];
res += b[i];
}
return res;
}
9 of 33
21. SIMD Intrinsics Library
Principle
Wrap SIMD computations in a library
Support SIMD idioms with algorithms or other abstractions
Improve portability across compilers
Examples
Agner Fog’s x86 library
Vc, NOVA
gSIMD, Cyme
Boost.SIMD
10 of 33
25. Explicit usage of intrinsics
// Altivec
// reinterpret as u16
short0 = (__vector unsigned short)a0;
short1 = (__vector unsigned short)a1;
// shifting constant
shift = vec_splat_u32 (-16);
sf = vec_rl(a1, shift_);
// Compute high part of the product
high = vec_msum( short0 , (__vector unsigned short)sf
, vec_splat_u32 (0)
);
// Complete by adding low part of the 16 bits product
return vec_add( vec_sl(high , shift_)
, vec_mulo(short0 , short1)
);
11 of 33
26. Implicit vs Explicit SIMD
Implicit
Automatic dependency
analysis (e.g. reductions).
Recognises idioms with data
dependencies.
Non-inline functions are
scalar.
Limited support for
outer-loop vectorisation
Relies on the compiler’s
vectorizable patterns library
Explicit
No dependency analysis
Recognises idioms without
data dependencies.
Non-inline functions can be
vectorised.
Outer loops can be
vectorised.
May be more cross-compiler
portable.
12 of 33
28. From bSIMD to Boost.SIMD
bSIMD
NumScale closed source software for SIMD programming
Explicit SIMD library
Supports x86, PPC, ARM architectures
Provides domain specic algorithms
Boost.SIMD
Open Source sub-part of bSIMD
Supports x86 and Power6
Provides STD like algorithms
14 of 33
29. The Boost.SIMD register abstraction
pack<T,N>
Usable as a regular Value Type
Wraps a block of contiguous N elements of type T
pack<T> picks the optimal N for current hardware
Constraints
T is a fundamental type
logical<T> is used to handle boolean
N must be a power of 2.
15 of 33
30. The Boost.SIMD register abstraction
pack<T,N>
Usable as a regular Value Type
Wraps a block of contiguous N elements of type T
pack<T> picks the optimal N for current hardware
What if ?
Let’s have C, the current hardware register size for type T
If N == C, use the native register directly
If N < C, use a scalar array (for now)
If N > C, use an aggregate of 2 x pack<T,N/2>
15 of 33
31. Getting data into packs
Constructors
pack<T,N> x(U v) : ll x with N v
pack<T,N> x(U v...) : ll x with (v0,v1,...)
pack<T,N> x(T* ptr) : load N element from aligned memory ptr
pack<T,N> x(It b, It e) : load N element from the [b,e[ Range
Explicit Memory Load
load<T>(U* ptr [,Offset o] )
load<T>(mask_ptr<U> ptr [,Offset o] )
aligned_load<T>(U* ptr [,int o] )
aligned_load<T>(mask_ptr<U> ptr [,int o] )
16 of 33
32. Getting data into packs
Constructors
pack<T,N> x(U v) : ll x with N v
pack<T,N> x(U v...) : ll x with (v0,v1,...)
pack<T,N> x(T* ptr) : load N element from aligned memory ptr
pack<T,N> x(It b, It e) : load N element from the [b,e[ Range
Misaligned Loads
aligned_store<T,N>(ptr)
Load an unaligned address with static misalignment
Optimized to be faster than unaligned load
16 of 33
33. Getting data into packs
Constructors
pack<T,N> x(U v) : ll x with N v
pack<T,N> x(U v...) : ll x with (v0,v1,...)
pack<T,N> x(T* ptr) : load N element from aligned memory ptr
pack<T,N> x(It b, It e) : load N element from the [b,e[ Range
Explicit Memory Store
store<T>(U* ptr [,Offset o] )
store<T>(mask_ptr<U> ptr [,Offset o] )
aligned_store<T>(U* ptr [,int o] )
aligned_store<T>(mask_ptr<U> ptr [,int o] )
16 of 33
34. Supported operations on pack
Basic Operators
All operators are available with possible scalar mixing
No convertion nor promotion
Comparisons
==, !=, <, <=, >, >= perform SIMD comparisons.
compare_equal, compare_less perform reductive comparisons.
Other properties
Models RandomAccessRange
p[i] return a proxy to access the register internal value
17 of 33
35. Selection of available functions
Arithmetic
saturated arithmetics
long multiplication
oat/int conversion
round, oor, ceil, trunc
sqrt, cbrt
hypot
average
random
min/max
rounded division and
remainder
Bitwise
select
andnot, ornot
popcnt
ffs
ror, rol
rshr, rshl
twopower
IEEE
ilogb
frexp
ldexp
next/prev
ulpdist
exponent/mantissa
Predicates
comparison to zero
negated comparisons
is_unord, is_nan,
is_invalid
is_odd, is_even
majority
18 of 33
36. Selection of available functions
Reduction
any, all, none
nbtrue
minimum/maximum
sum
product, dot product
SWAR
group/split
combine/slice
splatted reduction
cumsum
sort
shuffle
interleaving
deinterleaving
19 of 33
37. {and, Shuffling, Permutation}1,0,2
Principles
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuffling patterns
Basic permutations
reverse, broadcast
interleave, deinterleave
repeat, slide
runtime lookup
20 of 33
38. {and, Shuffling, Permutation}1,0,2
Principles
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuffling patterns
Shuffle
Arbitrary permutation of elements
Optimizable if known at compile-time
Available for one or two parameters
20 of 33
39. {and, Shuffling, Permutation}1,0,2
Principles
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuffling patterns
pack <float ,4> a{1,2,3,4};
pack <float ,4> b{10 ,20 ,30 ,40};
// r1 = [1 1 4 4 ]
auto r1 = shuffle <0,0,3,3>(a);
// r2 = [1 0 0 10 ]
auto r2 = shuffle <0,-1,-1,4>(a,b);
20 of 33
40. {and, Shuffling, Permutation}1,0,2
Principles
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuffling patterns
struct reverse_
{
template <class I, class C> struct apply
: std:: integral_constant <int ,C::value -I::value -1>;
{};
};
// res = [4 3 2 1]
pack <float > res = shuffle <reverse_ >(a);
20 of 33
41. {and, Shuffling, Permutation}1,0,2
Principles
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuffling patterns
constexpr int mix_half(int i, int c)
{
return i < c/2 ? i+c : i;
};
// res = [10 20 3 4]
pack <float > res = shuffle <pattern <mix_half >>(a,b);
20 of 33
42. Integration with the Standard Library
Algorithms :
SIMD transform
SIMD reduce
Use generic functor/lambda for mixing scalar/SIMD
Allocators
Ranges :
boost::simd::input_range
boost::simd::output_range
boost::simd::aligned_input_range
boost::simd::aligned_output_range
boost::simd::segmented_input_range
boost::simd::segmented_output_range
21 of 33
43. Integration with the Standard Library
std::vector <float , simd::allocator <float > > v(N);
simd:: transform( v.begin(), v.end()
, []( auto const& p)
{
return p * 2.f;
}
);
22 of 33
44. Integration with the Standard Library
std::vector <float , simd::allocator <float > > i(N), o(N);
auto x = simd:: reduce( i.data(), i.data()+N
, 0.f
);
auto y = simd:: reduce( i.data(), i.data()+N, 0.f
, []( auto&& a,auto&& e){return a+e*e;}
, 0.f, simd::plus
);
22 of 33
47. Basic Functions
Single precision math functions (cycles/values)
Hardware : Core i7 SandyBridge, AVX
Function Range std Scalar SIMD
exp [−10, 10] 46 38 7
log [−10, −10] 42 37 5
asin [−1, 1] 40 35 13
cos [−20π, 20π] 66 47 6
restricted_(cos) [−π/4, π/4] 32 9 1.3
25 of 33
48. Julia set generator
Generate a fractal image using the Julia funtion
Largely compute-bound
Challenge : Workload depends on pixel location
26 of 33
49. Julia set generator
template <class T> auto julia(T const& a, T const& b)
{
as_integer_t <T> res {0};
std:: size_t max_iter {0};
T x{0}, y{0};
do {
auto x2 = x * x;
auto y2 = y * y;
auto mask = x2 + y2 < 4;
auto xy = 2 * x * y;
x = x2 - y2 + a;
y = xy + b;
res = if_inc(mask , res);
} while(any(mask) && max_iter ++ < 256);
return res;
}
27 of 33
50. Julia set generator
Timing w/ Boost.SIMD and other solutions
from An Evaluation of current SIMD programming Models for C++ Pohl et al., 2015
28 of 33
51. Interaction with Boost.Odeint
Coupled/Uncoupled Roessler system
Written by Mario Mulanski
Showcase effects of both cache and SIMD
Use Boost.ODEINT for the ODE system
Use Boost.SIMD to vectorize the system
Results
Minimal disruption in the code
Global x3 performances gain
See the whole code at https ://github.com/mariomulansky/olsos
29 of 33
52. Interaction with Boost.Odeint
template <class S, class D>
void operator ()(const S &x_, D &dxdt_ , double t) const
{
auto x = boost ::begin( x_ );
auto dxdt = boost::begin( dxdt_ );
const int N = boost::size(x_);
for( int j=1; j<N/dim -1; ++j )
{
const int i = j*dim;
dxdt[i] = -1.0*x[i + 1] - x[i + 2] +
m_d * (x[i - dim] + x[i + dim] - 2.0 * x[i]);
dxdt[i + 1] = x[i] + m_a * x[i + 1];
dxdt[i + 2] = m_b + x[i + 2] * (x[i] - m_c);
}
}
29 of 33
56. Conclusion
High level SIMD in C++11/14
Designing a C++ library for low level performance primitives is
possible
C++11/14 features play nice with SIMD intrinsics
SIMD specic idioms maps to modern C++ components
Boost.SIMD
To be proposed for review this fall
Find us on https ://github.com/numscale/boost.simd
Tests and feedback welcome
31 of 33
57. This talk would not have been feasible without
The bSIMD team
Lead Developer : Charly Chevalier
Developers : Jean-Thierry Lapresté, Guillaume Quintin
Tests and Doc : Alan Kelly, Kenny Peou
Our supporters
Tim Blenchman, our earliest adopter
Serge Guelton, for integrating Boost.SIMD into pythran
Mario Mulansky, for its work with Boost.SIMD & Boost.Odeint
Sylvain Jubertie,Ian Masliah, for testing Boost.SIMD in clever ways
32 of 33