Designing C++ portable SIMD support

Designing C++ portable SIMD support
Joel Falcou
NumScale
CppCon 2016

NumScale in a few words
Our company
French start-up specialized in software performance
We sell C++ libraries to master modern hardware performance
Consulting & training on all things C++ or HPC
NumScale and C++
Member of the ISO C++ French National Body
Enthusiastic user & contributor to OSS projects
Involved in the European C++ community
1 of 33

Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
3 of 33

Let’s have a poll
Who knows about parallel programming ?
3 of 33

Let’s have a poll
Who knows about SIMD/multimedia extensions ?
3 of 33

Let’s have a poll
Who uses SIMD extensions like SSE, AVX, VMX or NEON ?
3 of 33

Let’s have a poll
Who uses SIMD extensions like SSE, AVX, VMX or NEON ?
Who has nightmares because of those ?
3 of 33

What is SIMD ? - A french cuisine approach
Some workload to process
4 of 33

A regular CPU
4 of 33

Single Instruction, Single Data processing
4 of 33

A SIMD enabled CPU
4 of 33

Single Instruction, Multiple Data processing
4 of 33

What is SIMD ? - For real
Instructions
Data
Results
SISD SIMD
Principes
Wide registers store N > 1 values.
Special instructions process those
registers.
Code uses a blitter like approach.
5 of 33

What is SIMD ? - For real
Instructions
Data
Results
SISD SIMD Benets
Speed-up of N on cache-hot data
Avoid premature scale-out
Maximize FLOPS/Watts
5 of 33

1,001 avors of SIMD
Intel x86
MMX 64 bits oat, double
SSE 128 bits oat
SSE2 128 bits int8, int16, int32, int64,
double
SSE3, SSSE3
SSE4a (AMD)
SSE4.1, SSE4.2
AVX 256 bits oat, double
AVX2 256 bits int8, int16, int32, int64
FMA3
FMA4, XOP (AMD)
AVX512 512 bits oat, double, int32,
int64
PowerPC
VMX 128 bits int8, int16, int32,
int64, oat
VSX, 128 bits int8, int16, int32,
int64, oat, double
QPX, 512 bits double
ARM
VFP 64 bits oat, double
NEON 64 bits et 128 bits double,
oat, int8, int16, int32, int64
6 of 33

The Many Ways to Vectorize
Implicit vectorization
Auto-Vectorization
Compiler hints
Explicit vectorization
Langages extensions
SIMD Intrinsics libraries
Vector Intrinsics
Inline Assembly
7 of 33

The Many Ways to Vectorize
Auto-Vectorization
Compiler hints
Explicit vectorization
Langages extensions
SIMD Intrinsics libraries
Vector Intrinsics
Inline Assembly (let’s just say no right now)
7 of 33

Auto-vectorizer
Compile-time analysis of loop nest
May use special hints
Only safe transformations are applied
template <typename T>
void f(T* restrict a, T* restrict b, int size)
{
#pragma ivdep
for(int i=0;i<size ;++i)
a[i] += b[i];
}
8 of 33

OpenMP4
Principle
Flag loop as must be vectorized
Support for reductions & SIMD functions
User is in charge of checking validity
template <typename T> T f(T* a, T* b, int size)
{
T res=0;
#pragma omp simd reduction (+:res)
for(int i=0;i<size ;++i)
{
a[i] += b[i];
res += b[i];
}
return res;
}
9 of 33

SIMD Intrinsics Library
Principle
Wrap SIMD computations in a library
Support SIMD idioms with algorithms or other abstractions
Improve portability across compilers
Examples
Agner Fog’s x86 library
Vc, NOVA
gSIMD, Cyme
Boost.SIMD
10 of 33

Explicit usage of intrinsics
// NEON
return vmul_s32(a0 , a1); // 64-bit
return vmulq_s32(a0 , a1); // 128-bit
11 of 33

// SSE4.1
return _mm_mullo_epi32(a0, a1);
11 of 33

// SSE2
return
_mm_or_si128(
_mm_and_si128(
_mm_mul_epu32(a0,a1),
_mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0)
)
, _mm_slli_si128(
_mm_and_si128(
_mm_mul_epu32( _mm_srli_si128(a0 ,4)
, _mm_srli_si128(a1 ,4)
)
, _mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0)
)
, 4
)
);
11 of 33

// Altivec
// reinterpret as u16
short0 = (__vector unsigned short)a0;
short1 = (__vector unsigned short)a1;
// shifting constant
shift = vec_splat_u32 (-16);
sf = vec_rl(a1, shift_);
// Compute high part of the product
high = vec_msum( short0 , (__vector unsigned short)sf
, vec_splat_u32 (0)
);
// Complete by adding low part of the 16 bits product
return vec_add( vec_sl(high , shift_)
, vec_mulo(short0 , short1)
);
11 of 33

Implicit vs Explicit SIMD
Implicit
Automatic dependency
analysis (e.g. reductions).
Recognises idioms with data
dependencies.
Non-inline functions are
scalar.
Limited support for
outer-loop vectorisation
Relies on the compiler’s
vectorizable patterns library
Explicit
No dependency analysis
Recognises idioms without
data dependencies.
Non-inline functions can be
vectorised.
Outer loops can be
vectorised.
May be more cross-compiler
portable.
12 of 33

bSIMD and Boost.SIMD†
†
Boost.SIMD is a candidate for acceptance as a Boost Library

From bSIMD to Boost.SIMD
bSIMD
NumScale closed source software for SIMD programming
Explicit SIMD library
Supports x86, PPC, ARM architectures
Provides domain specic algorithms
Boost.SIMD
Open Source sub-part of bSIMD
Supports x86 and Power6
Provides STD like algorithms
14 of 33

The Boost.SIMD register abstraction
pack<T,N>
Usable as a regular Value Type
Wraps a block of contiguous N elements of type T
pack<T> picks the optimal N for current hardware
Constraints
T is a fundamental type
logical<T> is used to handle boolean
N must be a power of 2.
15 of 33

The Boost.SIMD register abstraction
pack<T,N>
Usable as a regular Value Type
Wraps a block of contiguous N elements of type T
pack<T> picks the optimal N for current hardware
What if ?
Let’s have C, the current hardware register size for type T
If N == C, use the native register directly
If N < C, use a scalar array (for now)
If N > C, use an aggregate of 2 x pack<T,N/2>
15 of 33

Getting data into packs
Constructors
pack<T,N> x(U v) : ll x with N v
pack<T,N> x(U v...) : ll x with (v0,v1,...)
pack<T,N> x(T* ptr) : load N element from aligned memory ptr
pack<T,N> x(It b, It e) : load N element from the [b,e[ Range
Explicit Memory Load
load<T>(U* ptr [,Offset o] )
load<T>(mask_ptr<U> ptr [,Offset o] )
aligned_load<T>(U* ptr [,int o] )
aligned_load<T>(mask_ptr<U> ptr [,int o] )
16 of 33

Constructors
Misaligned Loads
aligned_store<T,N>(ptr)
Load an unaligned address with static misalignment
Optimized to be faster than unaligned load
16 of 33

Constructors
Explicit Memory Store
store<T>(U* ptr [,Offset o] )
store<T>(mask_ptr<U> ptr [,Offset o] )
aligned_store<T>(U* ptr [,int o] )
aligned_store<T>(mask_ptr<U> ptr [,int o] )
16 of 33

Supported operations on pack
Basic Operators
All operators are available with possible scalar mixing
No convertion nor promotion
Comparisons
==, !=, <, <=, >, >= perform SIMD comparisons.
compare_equal, compare_less perform reductive comparisons.
Other properties
Models RandomAccessRange
p[i] return a proxy to access the register internal value
17 of 33

Selection of available functions
Arithmetic
saturated arithmetics
long multiplication
oat/int conversion
round, oor, ceil, trunc
sqrt, cbrt
hypot
average
random
min/max
rounded division and
remainder
Bitwise
select
andnot, ornot
popcnt
ﬀs
ror, rol
rshr, rshl
twopower
IEEE
ilogb
frexp
ldexp
next/prev
ulpdist
exponent/mantissa
Predicates
comparison to zero
negated comparisons
is_unord, is_nan,
is_invalid
is_odd, is_even
majority
18 of 33

Selection of available functions
Reduction
any, all, none
nbtrue
minimum/maximum
sum
product, dot product
SWAR
group/split
combine/slice
splatted reduction
cumsum
sort
shuﬄe
interleaving
deinterleaving
19 of 33

{and, Shuﬄing, Permutation}1,0,2
Principles
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuﬄing patterns
Basic permutations
reverse, broadcast
interleave, deinterleave
repeat, slide
runtime lookup
20 of 33

Principles
Shuﬄe
Arbitrary permutation of elements
Optimizable if known at compile-time
Available for one or two parameters
20 of 33

Principles
pack <float ,4> a{1,2,3,4};
pack <float ,4> b{10 ,20 ,30 ,40};
// r1 = [1 1 4 4 ]
auto r1 = shuffle <0,0,3,3>(a);
// r2 = [1 0 0 10 ]
auto r2 = shuffle <0,-1,-1,4>(a,b);
20 of 33

Principles
struct reverse_
{
template <class I, class C> struct apply
: std:: integral_constant <int ,C::value -I::value -1>;
{};
};
// res = [4 3 2 1]
pack <float > res = shuffle <reverse_ >(a);
20 of 33

Principles
constexpr int mix_half(int i, int c)
{
return i < c/2 ? i+c : i;
};
// res = [10 20 3 4]
pack <float > res = shuffle <pattern <mix_half >>(a,b);
20 of 33

Integration with the Standard Library
Algorithms :
SIMD transform
SIMD reduce
Use generic functor/lambda for mixing scalar/SIMD
Allocators
Ranges :
boost::simd::input_range
boost::simd::output_range
boost::simd::aligned_input_range
boost::simd::aligned_output_range
boost::simd::segmented_input_range
boost::simd::segmented_output_range
21 of 33

std::vector <float , simd::allocator <float > > v(N);
simd:: transform( v.begin(), v.end()
, []( auto const& p)
{
return p * 2.f;
}
);
22 of 33

std::vector <float , simd::allocator <float > > i(N), o(N);
auto x = simd:: reduce( i.data(), i.data()+N
, 0.f
);
auto y = simd:: reduce( i.data(), i.data()+N, 0.f
, []( auto&& a,auto&& e){return a+e*e;}
, 0.f, simd::plus
);
22 of 33

Basic Functions
Single precision math functions (cycles/values)
Hardware : Core i7 SandyBridge, AVX
Function Range std Scalar SIMD
exp [−10, 10] 46 38 7
log [−10, −10] 42 37 5
asin [−1, 1] 40 35 13
cos [−20π, 20π] 66 47 6
restricted_(cos) [−π/4, π/4] 32 9 1.3
25 of 33

Julia set generator
Generate a fractal image using the Julia funtion
Largely compute-bound
Challenge : Workload depends on pixel location
26 of 33

Julia set generator
template <class T> auto julia(T const& a, T const& b)
{
as_integer_t <T> res {0};
std:: size_t max_iter {0};
T x{0}, y{0};
do {
auto x2 = x * x;
auto y2 = y * y;
auto mask = x2 + y2 < 4;
auto xy = 2 * x * y;
x = x2 - y2 + a;
y = xy + b;
res = if_inc(mask , res);
} while(any(mask) && max_iter ++ < 256);
return res;
}
27 of 33

Julia set generator
Timing w/ Boost.SIMD and other solutions
from An Evaluation of current SIMD programming Models for C++ Pohl et al., 2015
28 of 33

Interaction with Boost.Odeint
Coupled/Uncoupled Roessler system
Written by Mario Mulanski
Showcase eﬀects of both cache and SIMD
Use Boost.ODEINT for the ODE system
Use Boost.SIMD to vectorize the system
Results
Minimal disruption in the code
Global x3 performances gain
See the whole code at https ://github.com/mariomulansky/olsos
29 of 33

template <class S, class D>
void operator ()(const S &x_, D &dxdt_ , double t) const
{
auto x = boost ::begin( x_ );
auto dxdt = boost::begin( dxdt_ );
const int N = boost::size(x_);
for( int j=1; j<N/dim -1; ++j )
{
const int i = j*dim;
dxdt[i] = -1.0*x[i + 1] - x[i + 2] +
m_d * (x[i - dim] + x[i + dim] - 2.0 * x[i]);
dxdt[i + 1] = x[i] + m_a * x[i + 1];
dxdt[i + 2] = m_b + x[i + 2] * (x[i] - m_c);
}
}
29 of 33

// Scalar call
using state_type = std::vector <double >;
state_type x(N);
odeint :: runge_kutta4 <state_type > rk4 ;
odeint :: integrate_const(rk4 , roessler , x, 0.0, T , dt);
// Boost.SIMD call
using alloc_t = simd::allocator <double >;
using state_type = vector <pack <double >,alloc_t >;
state_type x ( N/ pack <double >:: static_size );
odeint :: runge_kutta4 < state_type > rk4 ;
odeint :: integrate_const(rk4 , roessler , x, 0.0, T, dt);
29 of 33

29 of 33

Conclusion
High level SIMD in C++11/14
Designing a C++ library for low level performance primitives is
possible
C++11/14 features play nice with SIMD intrinsics
SIMD specic idioms maps to modern C++ components
Boost.SIMD
To be proposed for review this fall
Find us on https ://github.com/numscale/boost.simd
Tests and feedback welcome
31 of 33

This talk would not have been feasible without
The bSIMD team
Lead Developer : Charly Chevalier
Developers : Jean-Thierry Lapresté, Guillaume Quintin
Tests and Doc : Alan Kelly, Kenny Peou
Our supporters
Tim Blenchman, our earliest adopter
Serge Guelton, for integrating Boost.SIMD into pythran
Mario Mulansky, for its work with Boost.SIMD & Boost.Odeint
Sylvain Jubertie,Ian Masliah, for testing Boost.SIMD in clever ways
32 of 33

Designing C++ portable SIMD support

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Designing C++ portable SIMD support

Similar to Designing C++ portable SIMD support (20)

More from Joel Falcou

More from Joel Falcou (9)

Recently uploaded

Recently uploaded (20)

Designing C++ portable SIMD support