Pragmatic Speedup with Boost.SIMD Unlocked software performance

Pragmatic Speedup with Boost.SIMD
Unlocked software performance
Joel Falcou
NumScale
24 février 2016

Challenges of SIMD
programming

Parallelism is everywhere
The Obvious One
Multi-cores
Many-cores
Distributed systems
The Embedded One
Pipeline
Super-scalar, out of orders CPUs
SIMD Instructions Sets
2 of 32

What is SIMD ?
Instructions
Data
Results
SISD SIMD
Principes
Single Instruction, Multiple Data
Each operation is applied on N
values in a single register of xed
size (128,256,512bits)
Can be up to N times faster than
regular ALU
3 of 32

Why using SIMD ?
Speedup of x2 to x16 that may be combined with other parallelism
source
Reduce computing time without changing the infrastructure
Give great results for all kind of regular or less regular computing
patterns
4 of 32

1001 avors of SIMD
Intel x86
MMX 64 bits oat, double
SSE 128 bits oat
SSE2 128 bits int8, int16, int32, int64,
double
SSE3, SSSE3
SSE4a (AMD)
SSE4.1, SSE4.2
AVX 256 bits oat, double
AVX2 256 bits int8, int16, int32, int64
FMA3
FMA4, XOP (AMD)
MIC 512 bits oat, double, int32, int64
PowerPC
AltiVec 128 bits int8, int16, int32,
int64, oat
Cell SPU et VSX, 128 bits int8,
int16, int32, int64, oat, double
QPX 512 bits double
ARM
VFP 64 bits oat, double
NEON 64 bits et 128 bits oat,
int8, int16, int32, int64
5 of 32

SIMD the good ol’way, int32 * int32 -> int32
// NEON
return vmul_s32(a0 , a1); // 64-bit
return vmulq_s32(a0 , a1); // 128-bit
6 of 32

// SSE4.1
return _mm_mullo_epi32(a0, a1);
6 of 32

// SSE2
return
_mm_or_si128(
_mm_and_si128(
_mm_mul_epu32(a0,a1),
_mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0)
)
, _mm_slli_si128(
_mm_and_si128(
_mm_mul_epu32( _mm_srli_si128(a0 ,4)
, _mm_srli_si128(a1 ,4)
)
, _mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0)
)
, 4
)
);
6 of 32

// Altivec
// reinterpret as u16
short0 = (__vector unsigned short)a0;
short1 = (__vector unsigned short)a1;
// shifting constant
shift = vec_splat_u32 (-16);
sf = vec_rl(a1, shift_);
// Compute high part of the product
high = vec_msum( short0 , (__vector unsigned short)sf
, vec_splat_u32 (0)
);
// Complete by adding low part of the 16 bits product
return vec_add( vec_sl(high , shift_)
, vec_mulo(short0 , short1)
);
6 of 32

Isn’t it a compiler’s job ?
Autovectorization
Issues with :
memory constraint
vectorisability of the code must be obvious
What about library functions ?
Compiler may be confused or miss a critical clue
Conclusion
Explicit SIMD can garantee the level of performance
Challenge : Keeping a multi-architecture SIMD code up to date
7 of 32

Our approach
High level abstraction
Designing a SIMD Domain-Specic Embedded Language (DSEL)
Abstracting SIMD registers as data block
High level optimisation at expression’s scope
Integration within C++
Make SIMD code generic and cross-hardware
Integration with the standard library
Use modern C++ idioms
8 of 32

Boost.SIMD
†
Boost.SIMD is a candidate for inclusion in the Boost

SIMD abstraction
pack<T,N>
pack<T, N> SIMD register of N elements of type T
pack<T> same with an optimal N for current hardware
Behave as a value of type T but apply operation to all its elements at once.
Constraints
T is a fundamental type
logical<T> is used to handle boolean
N must be a power of 2.
10 of 32

Operations on pack
Basic Operators
All language operators are available : pack<T> ⊕ pack<T> , pack<T> ⊕ T , T ⊕
pack<T>
No convertion nor promotion though :
uint8_t(255) + uint8_t(1) = uint8_t(0)
Comparisons
==, !=, <, <=,> et >= perform SIMD comparisons.
compare_equal, compare_less return a single boolean.
Other proprerties
Models RandomAccessFusionSequence and RandomAccessRange
p[i] return a proxy to access the register internal value
11 of 32

Memory access
Loading and Storing
(aligned_load/store) and (load/store)
Support for statically known misalignment
Support for conditionnal and sparse access (w/r to hardware)
12 of 32

Memory access
Loading and Storing
Examples
aligned_load< pack<T, N> >(p, i) load a pack from the aligned address p + i.
0D 0E 0F 10 11 12 13 14 15 16 17 18
aligned_load<pack<ﬂoat>>(0x10,0)
Main Memory
... ...
10 11 12 13
12 of 32

Memory access
Loading and Storing
Examples
aligned_load< pack<T, N>, Offset>(p, i) load a pack from the aligned address
p + i - Offset.
0D 0E 0F 10 11 12 13 14 15 16 17 18
aligned_load<pack<ﬂoat>,2>(0x10,2)
Main Memory
... ...
12 13 14 15
12 of 32

Shuﬄe and Swizzle
Genral Principles
Elements of SIMD register can be permuted by the hardware
Turn complex memory access into computations
Provided by the shuffle function
Examples :
// a = [ 1 2 3 4 ]
pack <float > a = enumerate < pack <float > >(1);
// b = [ 10 11 12 13 ]
pack <float > b = enumerate < pack <float > >(10);
// res = [4 12 0 10]
pack <float > res = shuffle <3,6,-1,4>(a,b);
13 of 32

Shuﬄe and Swizzle
Genral Principles
Elements of SIMD register can be permuted by the hardware
Turn complex memory access into computations
Provided by the shuffle function
Examples :
struct reverse_
{
template <class I, class C>
struct apply : mpl::int_ <C::value - I::value - 1> {};
};
// res = [n n-1 ... 2 1]
pack <float > res = shuffle <reverse_ >(a);
13 of 32

Integration with the STL
Algorithms :
SIMD transform
SIMD fold
Use polymorphic functor or lambda for mixing scalar/SIMD
14 of 32

Algorithms :
SIMD transform
SIMD fold
Use polymorphic functor or lambda for mixing scalar/SIMD
Iterators :
Provide SIMD aware walkthroughs
boost::simd::(aligned_)(input/output_)iterator
boost::simd::direct_output_iterator
boost::simd::shifted_iterator
14 of 32

std::vector <float , simd::allocator <float > > v(N);
simd:: transform( v.begin(), v.end()
, []( auto const& p)
{
return p * 2.f;
}
);
15 of 32

std::vector <float , simd::allocator <float > > i(N), o(N);
std:: transform( simd:: shifted_iterator <3>(in.begin())
, simd:: shifted_iterator <3>(in.end())
, simd:: aligned_output_begin(o.begin())
, average ()
);
struct average
{
template <class T>
typename T:: value_type operator ()(T const& t) const
{
typename T:: value_type d(1./3);
return (t[0]+t[1]+t[2])*d;
}
};
15 of 32

Hardware Optimisations
Problem :
Most SIMD hardware support fused operations like fma.
Those optimisations must remain transparent to the user
We use Expression Templates so B.SIMD auto-optimizes those patterns.
Examples :
a * b + c becomes fma(a, b, c)
a + b * c becomes fma(b, c, a)
!(a < b) becomes is_nle(a, b)
16 of 32

Supported Hardwares
Open Source
Intel SSE2-4, AVX
PowerPC VMX
Proprietary
ARM Neon
Intel AVX2, XOP, FMA3, FMA4
Intel MIC
17 of 32

Other functions ...
Arithmetic
saturated arithmetics
long multiplication
oat/int conversion
round, oor, ceil, trunc
sqrt, cbrt
hypot
average
random
min/max
roudned division et
remainder
Bitwise
select
andnot, ornot
popcnt
ﬀs
ror, rol
rshr, rshl
twopower
IEEE
ilogb, frexp
ldexp
next/prev
ulpdist
Predicates
comparison to zero
negated comparisons
is_unord, is_nan,
is_invalid
is_odd, is_even
majority
18 of 32

Reduction et SWAR operations
Reduction
any, all
nbtrue
minimum/maximum,
posmin/posmax
sum
product, dot product
SWAR
group/split
splatted reduction
cumsum
sort
19 of 32

Basic Functions
Single precision trignometrics
Hardware : Core i7 SandyBridge, AVX using cycles/values
Function Range std Scalar SIMD
exp [−10, 10] 46 38 7
log [−10, −10] 42 37 5
asin [−1, 1] 40 35 13
cos [−20π, 20π] 66 47 6
fast_cos [−π/4, π/4] 32 9 1.3
21 of 32

Julia set generator
Generate a fractal image using the Julia funtion
Purely compute-bound
Challenge : Workload depends on pixel location
22 of 32

Julia set generator
template <class T> typename meta:: as_integer <T>:: type
julia(T const& a, T const& b)
{
typename meta::as_integer <T>:: type iter;
std:: size_t i = 0;
T x, y;
do {
T x2 = x * x;
T y2 = y * y;
T xy = s_t(2) * x * y;
x = x2 - y2 + a;
y = xy + b;
iter = selinc(x2 + y2 < T(4), iter);
} while(any(mask) && i++ < 256);
return iter;
}
23 of 32

Julia set generator
256 x 256 512 x 512 1024 x 1024 2048 x 2048
0
200
400
600
800
x2.93
x2.99
x3.02
x3.03
x6.64
x6.94
x6.09
x6.16
x6.52
x6.81
x5.97
x6.05
Size
cpe
scalar SSE2
simd SSE2
simd AVX
simd AVX2
24 of 32

Motion Detection
Based on Manzanera’s Sigma Delta algorithm
Background substraction
Pixel intensity modeled as gaussian distributions
Challenge : Low arithmetic intensity
25 of 32

Motion Detection
template <typename T>
T sigma_delta(T& bkg , T const& frm , T& var)
{
bkg = selinc(bkg < frm , seldec(bkg > fr, bkg));
T dif = dist(bkg , frm);
T mul = muls(dif ,3);
var = if_else( dif != T(0)
, selinc(var < mul , seldec(var > mul , var))
, var
);
return if_zero_else_one( dif < var );
}
26 of 32

Motion Detection
480 x 640 @5 600 x 800 @5 1024 x 1280 @5 1080 x 1920 @5 2160 x 3840 @5
0
1
2
3
·104
x3.80
x6.63
x3.64
x5.63
x3.50
x6.69
x6.53
x5.75
x5.74
x5.71
x26.26
x26.96
x19.05
x17.19
x10.81
Size
FPS
scalar SSE2
simd SSE2
simd AVX
simd AVX2
27 of 32

Sparse Tridiagonal Solver
Solve Ax = b with sparse A and
multiple x
Application in uid mechanics
Challenge : Using SIMD despite
being sparse
Solution : Shuﬄe for local
densication
28 of 32

Sparse Tridiagonal Solver
29 of 32

Conclusion
Boost.SIMD
We can design a C++ library for low level performance primitives
A lot of success with a lot of diﬀerent applications
Find us on https ://github.com/jfalcou/nt2
Tests, comments and feedback welcome
Soon
Rewrite before submission to Boost is in progress
More architectures to come !
31 of 32

Pragmatic Speedup with Boost.SIMD Unlocked software performance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pragmatic Speedup with Boost.SIMD Unlocked software performance

Similar to Pragmatic Speedup with Boost.SIMD Unlocked software performance (20)

More from Sergey Platonov

More from Sergey Platonov (20)

Recently uploaded

Recently uploaded (20)

Pragmatic Speedup with Boost.SIMD Unlocked software performance