SIMD machines — machines capable of evaluating the same instruction on several elements of data in parallel — are nowadays commonplace and diverse, be it in supercomputers, desktop computers or even mobile ones. Numerous tools and libraries can make use of that technology to speed up their computations, yet it could be argued that there is no library that provides a satisfying minimalistic, high-level and platform-agnostic interface for the C++ developer.
3. Unlocked software performance
Parallelism is everywhere
The Obvious One
Multi-cores
Many-cores
Distributed systems
The Embedded One
Pipeline
Super-scalar, out of orders CPUs
SIMD Instructions Sets
2 of 32
4. Unlocked software performance
Parallelism is everywhere
The Obvious One
Multi-cores
Many-cores
Distributed systems
The Embedded One
Pipeline
Super-scalar, out of orders CPUs
SIMD Instructions Sets
2 of 32
5. Unlocked software performance
What is SIMD ?
Instructions
Data
Results
SISD SIMD
Principes
Single Instruction, Multiple Data
Each operation is applied on N
values in a single register of xed
size (128,256,512bits)
Can be up to N times faster than
regular ALU
3 of 32
6. Unlocked software performance
Why using SIMD ?
Speedup of x2 to x16 that may be combined with other parallelism
source
Reduce computing time without changing the infrastructure
Give great results for all kind of regular or less regular computing
patterns
4 of 32
11. Unlocked software performance
SIMD the good ol’way, int32 * int32 -> int32
// Altivec
// reinterpret as u16
short0 = (__vector unsigned short)a0;
short1 = (__vector unsigned short)a1;
// shifting constant
shift = vec_splat_u32 (-16);
sf = vec_rl(a1, shift_);
// Compute high part of the product
high = vec_msum( short0 , (__vector unsigned short)sf
, vec_splat_u32 (0)
);
// Complete by adding low part of the 16 bits product
return vec_add( vec_sl(high , shift_)
, vec_mulo(short0 , short1)
);
6 of 32
12. Unlocked software performance
Isn’t it a compiler’s job ?
Autovectorization
Issues with :
memory constraint
vectorisability of the code must be obvious
What about library functions ?
Compiler may be confused or miss a critical clue
Conclusion
Explicit SIMD can garantee the level of performance
Challenge : Keeping a multi-architecture SIMD code up to date
7 of 32
13. Unlocked software performance
Our approach
High level abstraction
Designing a SIMD Domain-Specic Embedded Language (DSEL)
Abstracting SIMD registers as data block
High level optimisation at expression’s scope
Integration within C++
Make SIMD code generic and cross-hardware
Integration with the standard library
Use modern C++ idioms
8 of 32
15. Unlocked software performance
SIMD abstraction
pack<T,N>
pack<T, N> SIMD register of N elements of type T
pack<T> same with an optimal N for current hardware
Behave as a value of type T but apply operation to all its elements at once.
Constraints
T is a fundamental type
logical<T> is used to handle boolean
N must be a power of 2.
10 of 32
16. Unlocked software performance
Operations on pack
Basic Operators
All language operators are available : pack<T> ⊕ pack<T> , pack<T> ⊕ T , T ⊕
pack<T>
No convertion nor promotion though :
uint8_t(255) + uint8_t(1) = uint8_t(0)
Comparisons
==, !=, <, <=,> et >= perform SIMD comparisons.
compare_equal, compare_less return a single boolean.
Other proprerties
Models RandomAccessFusionSequence and RandomAccessRange
p[i] return a proxy to access the register internal value
11 of 32
17. Unlocked software performance
Memory access
Loading and Storing
(aligned_load/store) and (load/store)
Support for statically known misalignment
Support for conditionnal and sparse access (w/r to hardware)
12 of 32
18. Unlocked software performance
Memory access
Loading and Storing
(aligned_load/store) and (load/store)
Support for statically known misalignment
Support for conditionnal and sparse access (w/r to hardware)
Examples
aligned_load< pack<T, N> >(p, i) load a pack from the aligned address p + i.
0D 0E 0F 10 11 12 13 14 15 16 17 18
aligned_load<pack<float>>(0x10,0)
Main Memory
... ...
10 11 12 13
12 of 32
19. Unlocked software performance
Memory access
Loading and Storing
(aligned_load/store) and (load/store)
Support for statically known misalignment
Support for conditionnal and sparse access (w/r to hardware)
Examples
aligned_load< pack<T, N>, Offset>(p, i) load a pack from the aligned address
p + i - Offset.
0D 0E 0F 10 11 12 13 14 15 16 17 18
aligned_load<pack<float>,2>(0x10,2)
Main Memory
... ...
12 13 14 15
12 of 32
20. Unlocked software performance
Shuffle and Swizzle
Genral Principles
Elements of SIMD register can be permuted by the hardware
Turn complex memory access into computations
Provided by the shuffle function
Examples :
// a = [ 1 2 3 4 ]
pack <float > a = enumerate < pack <float > >(1);
// b = [ 10 11 12 13 ]
pack <float > b = enumerate < pack <float > >(10);
// res = [4 12 0 10]
pack <float > res = shuffle <3,6,-1,4>(a,b);
13 of 32
21. Unlocked software performance
Shuffle and Swizzle
Genral Principles
Elements of SIMD register can be permuted by the hardware
Turn complex memory access into computations
Provided by the shuffle function
Examples :
struct reverse_
{
template <class I, class C>
struct apply : mpl::int_ <C::value - I::value - 1> {};
};
// res = [n n-1 ... 2 1]
pack <float > res = shuffle <reverse_ >(a);
13 of 32
23. Unlocked software performance
Integration with the STL
Algorithms :
SIMD transform
SIMD fold
Use polymorphic functor or lambda for mixing scalar/SIMD
Iterators :
Provide SIMD aware walkthroughs
boost::simd::(aligned_)(input/output_)iterator
boost::simd::direct_output_iterator
boost::simd::shifted_iterator
14 of 32
24. Unlocked software performance
Integration with the STL
std::vector <float , simd::allocator <float > > v(N);
simd:: transform( v.begin(), v.end()
, []( auto const& p)
{
return p * 2.f;
}
);
15 of 32
26. Unlocked software performance
Hardware Optimisations
Problem :
Most SIMD hardware support fused operations like fma.
Those optimisations must remain transparent to the user
We use Expression Templates so B.SIMD auto-optimizes those patterns.
Examples :
a * b + c becomes fma(a, b, c)
a + b * c becomes fma(b, c, a)
!(a < b) becomes is_nle(a, b)
16 of 32
27. Unlocked software performance
Supported Hardwares
Open Source
Intel SSE2-4, AVX
PowerPC VMX
Proprietary
ARM Neon
Intel AVX2, XOP, FMA3, FMA4
Intel MIC
17 of 32
28. Unlocked software performance
Other functions ...
Arithmetic
saturated arithmetics
long multiplication
oat/int conversion
round, oor, ceil, trunc
sqrt, cbrt
hypot
average
random
min/max
roudned division et
remainder
Bitwise
select
andnot, ornot
popcnt
ffs
ror, rol
rshr, rshl
twopower
IEEE
ilogb, frexp
ldexp
next/prev
ulpdist
Predicates
comparison to zero
negated comparisons
is_unord, is_nan,
is_invalid
is_odd, is_even
majority
18 of 32
29. Unlocked software performance
Reduction et SWAR operations
Reduction
any, all
nbtrue
minimum/maximum,
posmin/posmax
sum
product, dot product
SWAR
group/split
splatted reduction
cumsum
sort
19 of 32
31. Unlocked software performance
Basic Functions
Single precision trignometrics
Hardware : Core i7 SandyBridge, AVX using cycles/values
Function Range std Scalar SIMD
exp [−10, 10] 46 38 7
log [−10, −10] 42 37 5
asin [−1, 1] 40 35 13
cos [−20π, 20π] 66 47 6
fast_cos [−π/4, π/4] 32 9 1.3
21 of 32
32. Unlocked software performance
Julia set generator
Generate a fractal image using the Julia funtion
Purely compute-bound
Challenge : Workload depends on pixel location
22 of 32
33. Unlocked software performance
Julia set generator
template <class T> typename meta:: as_integer <T>:: type
julia(T const& a, T const& b)
{
typename meta::as_integer <T>:: type iter;
std:: size_t i = 0;
T x, y;
do {
T x2 = x * x;
T y2 = y * y;
T xy = s_t(2) * x * y;
x = x2 - y2 + a;
y = xy + b;
iter = selinc(x2 + y2 < T(4), iter);
} while(any(mask) && i++ < 256);
return iter;
}
23 of 32
34. Unlocked software performance
Julia set generator
256 x 256 512 x 512 1024 x 1024 2048 x 2048
0
200
400
600
800
x2.93
x2.99
x3.02
x3.03
x6.64
x6.94
x6.09
x6.16
x6.52
x6.81
x5.97
x6.05
Size
cpe
scalar SSE2
simd SSE2
simd AVX
simd AVX2
24 of 32
35. Unlocked software performance
Motion Detection
Based on Manzanera’s Sigma Delta algorithm
Background substraction
Pixel intensity modeled as gaussian distributions
Challenge : Low arithmetic intensity
25 of 32
36. Unlocked software performance
Motion Detection
template <typename T>
T sigma_delta(T& bkg , T const& frm , T& var)
{
bkg = selinc(bkg < frm , seldec(bkg > fr, bkg));
T dif = dist(bkg , frm);
T mul = muls(dif ,3);
var = if_else( dif != T(0)
, selinc(var < mul , seldec(var > mul , var))
, var
);
return if_zero_else_one( dif < var );
}
26 of 32
37. Unlocked software performance
Motion Detection
480 x 640 @5 600 x 800 @5 1024 x 1280 @5 1080 x 1920 @5 2160 x 3840 @5
0
1
2
3
·104
x3.80
x6.63
x3.64
x5.63
x3.50
x6.69
x6.53
x5.75
x5.74
x5.71
x26.26
x26.96
x19.05
x17.19
x10.81
Size
FPS
scalar SSE2
simd SSE2
simd AVX
simd AVX2
27 of 32
38. Unlocked software performance
Sparse Tridiagonal Solver
Solve Ax = b with sparse A and
multiple x
Application in uid mechanics
Challenge : Using SIMD despite
being sparse
Solution : Shuffle for local
densication
28 of 32
41. Unlocked software performance
Conclusion
Boost.SIMD
We can design a C++ library for low level performance primitives
A lot of success with a lot of different applications
Find us on https ://github.com/jfalcou/nt2
Tests, comments and feedback welcome
Soon
Rewrite before submission to Boost is in progress
More architectures to come !
31 of 32