Designing C++ portable SIMD support
Joel Falcou
CppCon 2016
NumScale in a few words
Our company
French start-up specialized in software performance
We sell C++ libraries to master modern hardware performance
Consulting & training on all things C++ or HPC
NumScale and C++
Member of the ISO C++ French National Body
Enthusiastic user & contributor to OSS projects
Involved in the European C++ community
What in the world is SIMD ?
Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
Who knows about parallel programming ?
Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
Who knows about parallel programming ?
Who knows about SIMD/multimedia extensions ?
Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
Who knows about parallel programming ?
Who knows about SIMD/multimedia extensions ?
Who uses SIMD extensions like SSE, AVX, VMX or NEON ?
Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
Who knows about parallel programming ?
Who knows about SIMD/multimedia extensions ?
Who uses SIMD extensions like SSE, AVX, VMX or NEON ?
Who has nightmares because of those ?
What is SIMD ? - A french cuisine approach
Some workload to process
What is SIMD ? - A french cuisine approach
A regular CPU
What is SIMD ? - A french cuisine approach
Single Instruction, Single Data processing
What is SIMD ? - A french cuisine approach
A SIMD enabled CPU
What is SIMD ? - A french cuisine approach
Single Instruction, Multiple Data processing
What is SIMD ? - For real
Wide registers store N > 1 values.
Special instructions process those
Code uses a blitter like approach.
What is SIMD ? - For real
Speed-up of N on cache-hot data
Avoid premature scale-out
Maximize FLOPS/Watts
1,001 avors of SIMD
Intel x86
MMX 64 bits oat, double
SSE 128 bits oat
SSE2 128 bits int8, int16, int32, int64,
SSE4.1, SSE4.2
AVX 256 bits oat, double
AVX2 256 bits int8, int16, int32, int64
AVX512 512 bits oat, double, int32,
VMX 128 bits int8, int16, int32,
int64, oat
VSX, 128 bits int8, int16, int32,
int64, oat, double
QPX, 512 bits double
VFP 64 bits oat, double
NEON 64 bits et 128 bits double,
oat, int8, int16, int32, int64
The Many Ways to Vectorize
Implicit vectorization
Compiler hints
Explicit vectorization
Langages extensions
SIMD Intrinsics libraries
Vector Intrinsics
Inline Assembly
The Many Ways to Vectorize
Implicit vectorization
Compiler hints
Explicit vectorization
Langages extensions
SIMD Intrinsics libraries
Vector Intrinsics
Inline Assembly (let’s just say no right now)
Implicit vectorization
Compile-time analysis of loop nest
May use special hints
Only safe transformations are applied
template <typename T>
void f(T* restrict a, T* restrict b, int size)
#pragma ivdep
for(int i=0;i<size ;++i)
a[i] += b[i];
Flag loop as must be vectorized
Support for reductions & SIMD functions
User is in charge of checking validity
template <typename T> T f(T* a, T* b, int size)
T res=0;
#pragma omp simd reduction (+:res)
for(int i=0;i<size ;++i)
a[i] += b[i];
res += b[i];
return res;
SIMD Intrinsics Library
Wrap SIMD computations in a library
Support SIMD idioms with algorithms or other abstractions
Improve portability across compilers
Agner Fog’s x86 library
gSIMD, Cyme
Explicit usage of intrinsics
return vmul_s32(a0 , a1); // 64-bit
return vmulq_s32(a0 , a1); // 128-bit
11 of 33
Explicit usage of intrinsics
// SSE4.1
return _mm_mullo_epi32(a0, a1);
11 of 33
Explicit usage of intrinsics
// SSE2
_mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0)
, _mm_slli_si128(
_mm_mul_epu32( _mm_srli_si128(a0 ,4)
, _mm_srli_si128(a1 ,4)
, _mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0)
, 4
Explicit usage of intrinsics
// Altivec
// reinterpret as u16
short0 = (__vector unsigned short)a0;
short1 = (__vector unsigned short)a1;
// shifting constant
shift = vec_splat_u32 (-16);
sf = vec_rl(a1, shift_);
// Compute high part of the product
high = vec_msum( short0 , (__vector unsigned short)sf
, vec_splat_u32 (0)
// Complete by adding low part of the 16 bits product
return vec_add( vec_sl(high , shift_)
, vec_mulo(short0 , short1)
Implicit vs Explicit SIMD
Automatic dependency
analysis (e.g. reductions).
Recognises idioms with data
Non-inline functions are
Limited support for
outer-loop vectorisation
Relies on the compiler’s
vectorizable patterns library
No dependency analysis
Recognises idioms without
data dependencies.
Non-inline functions can be
Outer loops can be
May be more cross-compiler
bSIMD and Boost.SIMD†
Boost.SIMD is a candidate for acceptance as a Boost Library
From bSIMD to Boost.SIMD
NumScale closed source software for SIMD programming
Explicit SIMD library
Supports x86, PPC, ARM architectures
Provides domain specic algorithms
Open Source sub-part of bSIMD
Supports x86 and Power6
Provides STD like algorithms
The Boost.SIMD register abstraction
Usable as a regular Value Type
Wraps a block of contiguous N elements of type T
pack<T> picks the optimal N for current hardware
T is a fundamental type
logical<T> is used to handle boolean
N must be a power of 2.
The Boost.SIMD register abstraction
Usable as a regular Value Type
Wraps a block of contiguous N elements of type T
pack<T> picks the optimal N for current hardware
What if ?
Let’s have C, the current hardware register size for type T
If N == C, use the native register directly
If N < C, use a scalar array (for now)
If N > C, use an aggregate of 2 x pack<T,N/2>
Getting data into packs
pack<T,N> x(U v) : ll x with N v
pack<T,N> x(U v...) : ll x with (v0,v1,...)
pack<T,N> x(T* ptr) : load N element from aligned memory ptr
pack<T,N> x(It b, It e) : load N element from the [b,e[ Range
Explicit Memory Load
load<T>(U* ptr [,Offset o] )
load<T>(mask_ptr<U> ptr [,Offset o] )
aligned_load<T>(U* ptr [,int o] )
aligned_load<T>(mask_ptr<U> ptr [,int o] )
Getting data into packs
pack<T,N> x(U v) : ll x with N v
pack<T,N> x(U v...) : ll x with (v0,v1,...)
pack<T,N> x(T* ptr) : load N element from aligned memory ptr
pack<T,N> x(It b, It e) : load N element from the [b,e[ Range
Misaligned Loads
Load an unaligned address with static misalignment
Optimized to be faster than unaligned load
Getting data into packs
pack<T,N> x(U v) : ll x with N v
pack<T,N> x(U v...) : ll x with (v0,v1,...)
pack<T,N> x(T* ptr) : load N element from aligned memory ptr
pack<T,N> x(It b, It e) : load N element from the [b,e[ Range
Explicit Memory Store
store<T>(U* ptr [,Offset o] )
store<T>(mask_ptr<U> ptr [,Offset o] )
aligned_store<T>(U* ptr [,int o] )
aligned_store<T>(mask_ptr<U> ptr [,int o] )
Supported operations on pack
Basic Operators
All operators are available with possible scalar mixing
No convertion nor promotion
==, !=, <, <=, >, >= perform SIMD comparisons.
compare_equal, compare_less perform reductive comparisons.
Other properties
Models RandomAccessRange
p[i] return a proxy to access the register internal value
Selection of available functions
saturated arithmetics
long multiplication
oat/int conversion
round, oor, ceil, trunc
sqrt, cbrt
rounded division and
andnot, ornot
ror, rol
rshr, rshl
comparison to zero
negated comparisons
is_unord, is_nan,
is_odd, is_even
Selection of available functions
any, all, none
product, dot product
splatted reduction
{and, Shuffling, Permutation}1,0,2
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuffling patterns
Basic permutations
reverse, broadcast
interleave, deinterleave
repeat, slide
runtime lookup
{and, Shuffling, Permutation}1,0,2
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuffling patterns
Arbitrary permutation of elements
Optimizable if known at compile-time
Available for one or two parameters
{and, Shuffling, Permutation}1,0,2
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuffling patterns
pack <float ,4> a{1,2,3,4};
pack <float ,4> b{10 ,20 ,30 ,40};
// r1 = [1 1 4 4 ]
auto r1 = shuffle <0,0,3,3>(a);
// r2 = [1 0 0 10 ]
auto r2 = shuffle <0,-1,-1,4>(a,b);
{and, Shuffling, Permutation}1,0,2
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuffling patterns
struct reverse_
template <class I, class C> struct apply
: std:: integral_constant <int ,C::value -I::value -1>;
// res = [4 3 2 1]
pack <float > res = shuffle <reverse_ >(a);
{and, Shuffling, Permutation}1,0,2
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuffling patterns
constexpr int mix_half(int i, int c)
return i < c/2 ? i+c : i;
// res = [10 20 3 4]
pack <float > res = shuffle <pattern <mix_half >>(a,b);
Integration with the Standard Library
Algorithms :
SIMD transform
SIMD reduce
Use generic functor/lambda for mixing scalar/SIMD
Ranges :
Integration with the Standard Library
std::vector <float , simd::allocator <float > > v(N);
simd:: transform( v.begin(), v.end()
, []( auto const& p)
return p * 2.f;
Integration with the Standard Library
std::vector <float , simd::allocator <float > > i(N), o(N);
auto x = simd:: reduce(,
, 0.f
auto y = simd:: reduce(,, 0.f
, []( auto&& a,auto&& e){return a+e*e;}
, 0.f, simd::plus
Under the SIMD Hood
Performances !
Basic Functions
Single precision math functions (cycles/values)
Hardware : Core i7 SandyBridge, AVX
Function Range std Scalar SIMD
exp [−10, 10] 46 38 7
log [−10, −10] 42 37 5
asin [−1, 1] 40 35 13
cos [−20π, 20π] 66 47 6
restricted_(cos) [−π/4, π/4] 32 9 1.3
Julia set generator
Generate a fractal image using the Julia funtion
Largely compute-bound
Challenge : Workload depends on pixel location
26 of 33
Julia set generator
template <class T> auto julia(T const& a, T const& b)
as_integer_t <T> res {0};
std:: size_t max_iter {0};
T x{0}, y{0};
do {
auto x2 = x * x;
auto y2 = y * y;
auto mask = x2 + y2 < 4;
auto xy = 2 * x * y;
x = x2 - y2 + a;
y = xy + b;
res = if_inc(mask , res);
} while(any(mask) && max_iter ++ < 256);
return res;
Julia set generator
Timing w/ Boost.SIMD and other solutions
from An Evaluation of current SIMD programming Models for C++ Pohl et al., 2015
28 of 33
Interaction with Boost.Odeint
Coupled/Uncoupled Roessler system
Written by Mario Mulanski
Showcase effects of both cache and SIMD
Use Boost.ODEINT for the ODE system
Use Boost.SIMD to vectorize the system
Minimal disruption in the code
Global x3 performances gain
See the whole code at https ://
Interaction with Boost.Odeint
template <class S, class D>
void operator ()(const S &x_, D &dxdt_ , double t) const
auto x = boost ::begin( x_ );
auto dxdt = boost::begin( dxdt_ );
const int N = boost::size(x_);
for( int j=1; j<N/dim -1; ++j )
const int i = j*dim;
dxdt[i] = -1.0*x[i + 1] - x[i + 2] +
m_d * (x[i - dim] + x[i + dim] - 2.0 * x[i]);
dxdt[i + 1] = x[i] + m_a * x[i + 1];
dxdt[i + 2] = m_b + x[i + 2] * (x[i] - m_c);
Interaction with Boost.Odeint
// Scalar call
using state_type = std::vector <double >;
state_type x(N);
odeint :: runge_kutta4 <state_type > rk4 ;
odeint :: integrate_const(rk4 , roessler , x, 0.0, T , dt);
// Boost.SIMD call
using alloc_t = simd::allocator <double >;
using state_type = vector <pack <double >,alloc_t >;
state_type x ( N/ pack <double >:: static_size );
odeint :: runge_kutta4 < state_type > rk4 ;
odeint :: integrate_const(rk4 , roessler , x, 0.0, T, dt);
Interaction with Boost.Odeint
High level SIMD in C++11/14
Designing a C++ library for low level performance primitives is
C++11/14 features play nice with SIMD intrinsics
SIMD specic idioms maps to modern C++ components
To be proposed for review this fall
Find us on https ://
Tests and feedback welcome
This talk would not have been feasible without
The bSIMD team
Lead Developer : Charly Chevalier
Developers : Jean-Thierry Lapresté, Guillaume Quintin
Tests and Doc : Alan Kelly, Kenny Peou
Our supporters
Tim Blenchman, our earliest adopter
Serge Guelton, for integrating Boost.SIMD into pythran
Mario Mulansky, for its work with Boost.SIMD & Boost.Odeint
Sylvain Jubertie,Ian Masliah, for testing Boost.SIMD in clever ways
Thanks for your attention !

