SlideShare a Scribd company logo
1 of 58
Download to read offline
Designing C++ portable SIMD support
Joel Falcou
NumScale
CppCon 2016
NumScale in a few words
Our company
French start-up specialized in software performance
We sell C++ libraries to master modern hardware performance
Consulting & training on all things C++ or HPC
NumScale and C++
Member of the ISO C++ French National Body
Enthusiastic user & contributor to OSS projects
Involved in the European C++ community
1 of 33
What in the world is SIMD ?
Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
3 of 33
Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
Who knows about parallel programming ?
3 of 33
Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
Who knows about parallel programming ?
Who knows about SIMD/multimedia extensions ?
3 of 33
Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
Who knows about parallel programming ?
Who knows about SIMD/multimedia extensions ?
Who uses SIMD extensions like SSE, AVX, VMX or NEON ?
3 of 33
Is SIMD that obscure ?
Let’s have a poll
Who needs performances in their daily job ?
Who knows about parallel programming ?
Who knows about SIMD/multimedia extensions ?
Who uses SIMD extensions like SSE, AVX, VMX or NEON ?
Who has nightmares because of those ?
3 of 33
What is SIMD ? - A french cuisine approach
Some workload to process
4 of 33
What is SIMD ? - A french cuisine approach
A regular CPU
4 of 33
What is SIMD ? - A french cuisine approach
Single Instruction, Single Data processing
4 of 33
What is SIMD ? - A french cuisine approach
A SIMD enabled CPU
4 of 33
What is SIMD ? - A french cuisine approach
Single Instruction, Multiple Data processing
4 of 33
What is SIMD ? - For real
Instructions
Data
Results
SISD SIMD
Principes
Wide registers store N > 1 values.
Special instructions process those
registers.
Code uses a blitter like approach.
5 of 33
What is SIMD ? - For real
Instructions
Data
Results
SISD SIMD Benets
Speed-up of N on cache-hot data
Avoid premature scale-out
Maximize FLOPS/Watts
5 of 33
1,001 avors of SIMD
Intel x86
MMX 64 bits oat, double
SSE 128 bits oat
SSE2 128 bits int8, int16, int32, int64,
double
SSE3, SSSE3
SSE4a (AMD)
SSE4.1, SSE4.2
AVX 256 bits oat, double
AVX2 256 bits int8, int16, int32, int64
FMA3
FMA4, XOP (AMD)
AVX512 512 bits oat, double, int32,
int64
PowerPC
VMX 128 bits int8, int16, int32,
int64, oat
VSX, 128 bits int8, int16, int32,
int64, oat, double
QPX, 512 bits double
ARM
VFP 64 bits oat, double
NEON 64 bits et 128 bits double,
oat, int8, int16, int32, int64
6 of 33
The Many Ways to Vectorize
Implicit vectorization
Auto-Vectorization
Compiler hints
Explicit vectorization
Langages extensions
SIMD Intrinsics libraries
Vector Intrinsics
Inline Assembly
7 of 33
The Many Ways to Vectorize
Implicit vectorization
Auto-Vectorization
Compiler hints
Explicit vectorization
Langages extensions
SIMD Intrinsics libraries
Vector Intrinsics
Inline Assembly (let’s just say no right now)
7 of 33
Implicit vectorization
Auto-vectorizer
Compile-time analysis of loop nest
May use special hints
Only safe transformations are applied
template <typename T>
void f(T* restrict a, T* restrict b, int size)
{
#pragma ivdep
for(int i=0;i<size ;++i)
a[i] += b[i];
}
8 of 33
OpenMP4
Principle
Flag loop as must be vectorized
Support for reductions & SIMD functions
User is in charge of checking validity
template <typename T> T f(T* a, T* b, int size)
{
T res=0;
#pragma omp simd reduction (+:res)
for(int i=0;i<size ;++i)
{
a[i] += b[i];
res += b[i];
}
return res;
}
9 of 33
SIMD Intrinsics Library
Principle
Wrap SIMD computations in a library
Support SIMD idioms with algorithms or other abstractions
Improve portability across compilers
Examples
Agner Fog’s x86 library
Vc, NOVA
gSIMD, Cyme
Boost.SIMD
10 of 33
Explicit usage of intrinsics
// NEON
return vmul_s32(a0 , a1); // 64-bit
return vmulq_s32(a0 , a1); // 128-bit
11 of 33
Explicit usage of intrinsics
// SSE4.1
return _mm_mullo_epi32(a0, a1);
11 of 33
Explicit usage of intrinsics
// SSE2
return
_mm_or_si128(
_mm_and_si128(
_mm_mul_epu32(a0,a1),
_mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0)
)
, _mm_slli_si128(
_mm_and_si128(
_mm_mul_epu32( _mm_srli_si128(a0 ,4)
, _mm_srli_si128(a1 ,4)
)
, _mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0)
)
, 4
)
);
11 of 33
Explicit usage of intrinsics
// Altivec
// reinterpret as u16
short0 = (__vector unsigned short)a0;
short1 = (__vector unsigned short)a1;
// shifting constant
shift = vec_splat_u32 (-16);
sf = vec_rl(a1, shift_);
// Compute high part of the product
high = vec_msum( short0 , (__vector unsigned short)sf
, vec_splat_u32 (0)
);
// Complete by adding low part of the 16 bits product
return vec_add( vec_sl(high , shift_)
, vec_mulo(short0 , short1)
);
11 of 33
Implicit vs Explicit SIMD
Implicit
Automatic dependency
analysis (e.g. reductions).
Recognises idioms with data
dependencies.
Non-inline functions are
scalar.
Limited support for
outer-loop vectorisation
Relies on the compiler’s
vectorizable patterns library
Explicit
No dependency analysis
Recognises idioms without
data dependencies.
Non-inline functions can be
vectorised.
Outer loops can be
vectorised.
May be more cross-compiler
portable.
12 of 33
bSIMD and Boost.SIMD†
†
Boost.SIMD is a candidate for acceptance as a Boost Library
From bSIMD to Boost.SIMD
bSIMD
NumScale closed source software for SIMD programming
Explicit SIMD library
Supports x86, PPC, ARM architectures
Provides domain specic algorithms
Boost.SIMD
Open Source sub-part of bSIMD
Supports x86 and Power6
Provides STD like algorithms
14 of 33
The Boost.SIMD register abstraction
pack<T,N>
Usable as a regular Value Type
Wraps a block of contiguous N elements of type T
pack<T> picks the optimal N for current hardware
Constraints
T is a fundamental type
logical<T> is used to handle boolean
N must be a power of 2.
15 of 33
The Boost.SIMD register abstraction
pack<T,N>
Usable as a regular Value Type
Wraps a block of contiguous N elements of type T
pack<T> picks the optimal N for current hardware
What if ?
Let’s have C, the current hardware register size for type T
If N == C, use the native register directly
If N < C, use a scalar array (for now)
If N > C, use an aggregate of 2 x pack<T,N/2>
15 of 33
Getting data into packs
Constructors
pack<T,N> x(U v) : ll x with N v
pack<T,N> x(U v...) : ll x with (v0,v1,...)
pack<T,N> x(T* ptr) : load N element from aligned memory ptr
pack<T,N> x(It b, It e) : load N element from the [b,e[ Range
Explicit Memory Load
load<T>(U* ptr [,Offset o] )
load<T>(mask_ptr<U> ptr [,Offset o] )
aligned_load<T>(U* ptr [,int o] )
aligned_load<T>(mask_ptr<U> ptr [,int o] )
16 of 33
Getting data into packs
Constructors
pack<T,N> x(U v) : ll x with N v
pack<T,N> x(U v...) : ll x with (v0,v1,...)
pack<T,N> x(T* ptr) : load N element from aligned memory ptr
pack<T,N> x(It b, It e) : load N element from the [b,e[ Range
Misaligned Loads
aligned_store<T,N>(ptr)
Load an unaligned address with static misalignment
Optimized to be faster than unaligned load
16 of 33
Getting data into packs
Constructors
pack<T,N> x(U v) : ll x with N v
pack<T,N> x(U v...) : ll x with (v0,v1,...)
pack<T,N> x(T* ptr) : load N element from aligned memory ptr
pack<T,N> x(It b, It e) : load N element from the [b,e[ Range
Explicit Memory Store
store<T>(U* ptr [,Offset o] )
store<T>(mask_ptr<U> ptr [,Offset o] )
aligned_store<T>(U* ptr [,int o] )
aligned_store<T>(mask_ptr<U> ptr [,int o] )
16 of 33
Supported operations on pack
Basic Operators
All operators are available with possible scalar mixing
No convertion nor promotion
Comparisons
==, !=, <, <=, >, >= perform SIMD comparisons.
compare_equal, compare_less perform reductive comparisons.
Other properties
Models RandomAccessRange
p[i] return a proxy to access the register internal value
17 of 33
Selection of available functions
Arithmetic
saturated arithmetics
long multiplication
oat/int conversion
round, oor, ceil, trunc
sqrt, cbrt
hypot
average
random
min/max
rounded division and
remainder
Bitwise
select
andnot, ornot
popcnt
ffs
ror, rol
rshr, rshl
twopower
IEEE
ilogb
frexp
ldexp
next/prev
ulpdist
exponent/mantissa
Predicates
comparison to zero
negated comparisons
is_unord, is_nan,
is_invalid
is_odd, is_even
majority
18 of 33
Selection of available functions
Reduction
any, all, none
nbtrue
minimum/maximum
sum
product, dot product
SWAR
group/split
combine/slice
splatted reduction
cumsum
sort
shuffle
interleaving
deinterleaving
19 of 33
{and, Shuffling, Permutation}1,0,2
Principles
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuffling patterns
Basic permutations
reverse, broadcast
interleave, deinterleave
repeat, slide
runtime lookup
20 of 33
{and, Shuffling, Permutation}1,0,2
Principles
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuffling patterns
Shuffle
Arbitrary permutation of elements
Optimizable if known at compile-time
Available for one or two parameters
20 of 33
{and, Shuffling, Permutation}1,0,2
Principles
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuffling patterns
pack <float ,4> a{1,2,3,4};
pack <float ,4> b{10 ,20 ,30 ,40};
// r1 = [1 1 4 4 ]
auto r1 = shuffle <0,0,3,3>(a);
// r2 = [1 0 0 10 ]
auto r2 = shuffle <0,-1,-1,4>(a,b);
20 of 33
{and, Shuffling, Permutation}1,0,2
Principles
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuffling patterns
struct reverse_
{
template <class I, class C> struct apply
: std:: integral_constant <int ,C::value -I::value -1>;
{};
};
// res = [4 3 2 1]
pack <float > res = shuffle <reverse_ >(a);
20 of 33
{and, Shuffling, Permutation}1,0,2
Principles
Data reordering is #1 technique in SIMD
Use cases : transpose, AoS/SoA transformations
Turn memory access into computations
Support for specic permutations
Support for arbitrary shuffling patterns
constexpr int mix_half(int i, int c)
{
return i < c/2 ? i+c : i;
};
// res = [10 20 3 4]
pack <float > res = shuffle <pattern <mix_half >>(a,b);
20 of 33
Integration with the Standard Library
Algorithms :
SIMD transform
SIMD reduce
Use generic functor/lambda for mixing scalar/SIMD
Allocators
Ranges :
boost::simd::input_range
boost::simd::output_range
boost::simd::aligned_input_range
boost::simd::aligned_output_range
boost::simd::segmented_input_range
boost::simd::segmented_output_range
21 of 33
Integration with the Standard Library
std::vector <float , simd::allocator <float > > v(N);
simd:: transform( v.begin(), v.end()
, []( auto const& p)
{
return p * 2.f;
}
);
22 of 33
Integration with the Standard Library
std::vector <float , simd::allocator <float > > i(N), o(N);
auto x = simd:: reduce( i.data(), i.data()+N
, 0.f
);
auto y = simd:: reduce( i.data(), i.data()+N, 0.f
, []( auto&& a,auto&& e){return a+e*e;}
, 0.f, simd::plus
);
22 of 33
Under the SIMD Hood
Performances !
Basic Functions
Single precision math functions (cycles/values)
Hardware : Core i7 SandyBridge, AVX
Function Range std Scalar SIMD
exp [−10, 10] 46 38 7
log [−10, −10] 42 37 5
asin [−1, 1] 40 35 13
cos [−20π, 20π] 66 47 6
restricted_(cos) [−π/4, π/4] 32 9 1.3
25 of 33
Julia set generator
Generate a fractal image using the Julia funtion
Largely compute-bound
Challenge : Workload depends on pixel location
26 of 33
Julia set generator
template <class T> auto julia(T const& a, T const& b)
{
as_integer_t <T> res {0};
std:: size_t max_iter {0};
T x{0}, y{0};
do {
auto x2 = x * x;
auto y2 = y * y;
auto mask = x2 + y2 < 4;
auto xy = 2 * x * y;
x = x2 - y2 + a;
y = xy + b;
res = if_inc(mask , res);
} while(any(mask) && max_iter ++ < 256);
return res;
}
27 of 33
Julia set generator
Timing w/ Boost.SIMD and other solutions
from An Evaluation of current SIMD programming Models for C++ Pohl et al., 2015
28 of 33
Interaction with Boost.Odeint
Coupled/Uncoupled Roessler system
Written by Mario Mulanski
Showcase effects of both cache and SIMD
Use Boost.ODEINT for the ODE system
Use Boost.SIMD to vectorize the system
Results
Minimal disruption in the code
Global x3 performances gain
See the whole code at https ://github.com/mariomulansky/olsos
29 of 33
Interaction with Boost.Odeint
template <class S, class D>
void operator ()(const S &x_, D &dxdt_ , double t) const
{
auto x = boost ::begin( x_ );
auto dxdt = boost::begin( dxdt_ );
const int N = boost::size(x_);
for( int j=1; j<N/dim -1; ++j )
{
const int i = j*dim;
dxdt[i] = -1.0*x[i + 1] - x[i + 2] +
m_d * (x[i - dim] + x[i + dim] - 2.0 * x[i]);
dxdt[i + 1] = x[i] + m_a * x[i + 1];
dxdt[i + 2] = m_b + x[i + 2] * (x[i] - m_c);
}
}
29 of 33
Interaction with Boost.Odeint
// Scalar call
using state_type = std::vector <double >;
state_type x(N);
odeint :: runge_kutta4 <state_type > rk4 ;
odeint :: integrate_const(rk4 , roessler , x, 0.0, T , dt);
// Boost.SIMD call
using alloc_t = simd::allocator <double >;
using state_type = vector <pack <double >,alloc_t >;
state_type x ( N/ pack <double >:: static_size );
odeint :: runge_kutta4 < state_type > rk4 ;
odeint :: integrate_const(rk4 , roessler , x, 0.0, T, dt);
29 of 33
Interaction with Boost.Odeint
29 of 33
Conclusion
Conclusion
High level SIMD in C++11/14
Designing a C++ library for low level performance primitives is
possible
C++11/14 features play nice with SIMD intrinsics
SIMD specic idioms maps to modern C++ components
Boost.SIMD
To be proposed for review this fall
Find us on https ://github.com/numscale/boost.simd
Tests and feedback welcome
31 of 33
This talk would not have been feasible without
The bSIMD team
Lead Developer : Charly Chevalier
Developers : Jean-Thierry Lapresté, Guillaume Quintin
Tests and Doc : Alan Kelly, Kenny Peou
Our supporters
Tim Blenchman, our earliest adopter
Serge Guelton, for integrating Boost.SIMD into pythran
Mario Mulansky, for its work with Boost.SIMD & Boost.Odeint
Sylvain Jubertie,Ian Masliah, for testing Boost.SIMD in clever ways
32 of 33
Thanks for your attention !

More Related Content

What's hot (20)

Modern C++
Modern C++Modern C++
Modern C++
 
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
 
Computer Programming- Lecture 7
Computer Programming- Lecture 7Computer Programming- Lecture 7
Computer Programming- Lecture 7
 
Computer Programming- Lecture 10
Computer Programming- Lecture 10Computer Programming- Lecture 10
Computer Programming- Lecture 10
 
Computer Programming- Lecture 5
Computer Programming- Lecture 5 Computer Programming- Lecture 5
Computer Programming- Lecture 5
 
Computer Programming- Lecture 9
Computer Programming- Lecture 9Computer Programming- Lecture 9
Computer Programming- Lecture 9
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。
 
Lecture 12: Classes and Files
Lecture 12: Classes and FilesLecture 12: Classes and Files
Lecture 12: Classes and Files
 
Getting Started Cpp
Getting Started CppGetting Started Cpp
Getting Started Cpp
 
Verilog 語法教學
Verilog 語法教學 Verilog 語法教學
Verilog 語法教學
 
Quiz 9
Quiz 9Quiz 9
Quiz 9
 
Computer Programming- Lecture 6
Computer Programming- Lecture 6Computer Programming- Lecture 6
Computer Programming- Lecture 6
 
Verilog hdl
Verilog hdlVerilog hdl
Verilog hdl
 
Code Tuning
Code TuningCode Tuning
Code Tuning
 
Computer Programming- Lecture 4
Computer Programming- Lecture 4Computer Programming- Lecture 4
Computer Programming- Lecture 4
 
C++11 & C++14
C++11 & C++14C++11 & C++14
C++11 & C++14
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
C++11
C++11C++11
C++11
 
C++11
C++11C++11
C++11
 
C++17 introduction - Meetup @EtixLabs
C++17 introduction - Meetup @EtixLabsC++17 introduction - Meetup @EtixLabs
C++17 introduction - Meetup @EtixLabs
 

Viewers also liked

'Embedding' a meta state machine
'Embedding' a meta state machine'Embedding' a meta state machine
'Embedding' a meta state machineemBO_Conference
 
A possible future of resource constrained software development
A possible future of resource constrained software developmentA possible future of resource constrained software development
A possible future of resource constrained software developmentemBO_Conference
 
standardese - a WIP next-gen Doxygen
standardese - a WIP next-gen Doxygenstandardese - a WIP next-gen Doxygen
standardese - a WIP next-gen DoxygenemBO_Conference
 
Data-driven HAL generation
Data-driven HAL generationData-driven HAL generation
Data-driven HAL generationemBO_Conference
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsemBO_Conference
 
Device-specific Clang Tooling for Embedded Systems
Device-specific Clang Tooling for Embedded SystemsDevice-specific Clang Tooling for Embedded Systems
Device-specific Clang Tooling for Embedded SystemsemBO_Conference
 
Cecti rodrigo lopez - buqueda del clique
Cecti   rodrigo lopez - buqueda del clique Cecti   rodrigo lopez - buqueda del clique
Cecti rodrigo lopez - buqueda del clique Roderik Lowenstein
 
Reto3 - Creación de insignias
Reto3 - Creación de insigniasReto3 - Creación de insignias
Reto3 - Creación de insigniasSabka RJ
 
ESO 2º UD 1 Estadística descriptiva (trabajo cooperativo)
ESO 2º UD 1 Estadística descriptiva (trabajo cooperativo)ESO 2º UD 1 Estadística descriptiva (trabajo cooperativo)
ESO 2º UD 1 Estadística descriptiva (trabajo cooperativo)Jose Gallegos
 
Elements of C++11
Elements of C++11Elements of C++11
Elements of C++11Uilian Ries
 
DuPage County Presentation: Maya Angelou and Speech Writing
DuPage County Presentation: Maya Angelou and Speech WritingDuPage County Presentation: Maya Angelou and Speech Writing
DuPage County Presentation: Maya Angelou and Speech WritingAmy Vujaklija
 
Mock Servers - Fake All the Things!
Mock Servers - Fake All the Things!Mock Servers - Fake All the Things!
Mock Servers - Fake All the Things!Atlassian
 

Viewers also liked (16)

'Embedding' a meta state machine
'Embedding' a meta state machine'Embedding' a meta state machine
'Embedding' a meta state machine
 
A possible future of resource constrained software development
A possible future of resource constrained software developmentA possible future of resource constrained software development
A possible future of resource constrained software development
 
standardese - a WIP next-gen Doxygen
standardese - a WIP next-gen Doxygenstandardese - a WIP next-gen Doxygen
standardese - a WIP next-gen Doxygen
 
Data-driven HAL generation
Data-driven HAL generationData-driven HAL generation
Data-driven HAL generation
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
Device-specific Clang Tooling for Embedded Systems
Device-specific Clang Tooling for Embedded SystemsDevice-specific Clang Tooling for Embedded Systems
Device-specific Clang Tooling for Embedded Systems
 
New pop.pptx.pptx.pptx.pptx
New pop.pptx.pptx.pptx.pptxNew pop.pptx.pptx.pptx.pptx
New pop.pptx.pptx.pptx.pptx
 
Sports in the world
Sports in the world Sports in the world
Sports in the world
 
Cecti rodrigo lopez - buqueda del clique
Cecti   rodrigo lopez - buqueda del clique Cecti   rodrigo lopez - buqueda del clique
Cecti rodrigo lopez - buqueda del clique
 
Tesis de-albahaca-2016-2017
Tesis de-albahaca-2016-2017Tesis de-albahaca-2016-2017
Tesis de-albahaca-2016-2017
 
Reto3 - Creación de insignias
Reto3 - Creación de insigniasReto3 - Creación de insignias
Reto3 - Creación de insignias
 
ESO 2º UD 1 Estadística descriptiva (trabajo cooperativo)
ESO 2º UD 1 Estadística descriptiva (trabajo cooperativo)ESO 2º UD 1 Estadística descriptiva (trabajo cooperativo)
ESO 2º UD 1 Estadística descriptiva (trabajo cooperativo)
 
Elements of C++11
Elements of C++11Elements of C++11
Elements of C++11
 
Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
 
DuPage County Presentation: Maya Angelou and Speech Writing
DuPage County Presentation: Maya Angelou and Speech WritingDuPage County Presentation: Maya Angelou and Speech Writing
DuPage County Presentation: Maya Angelou and Speech Writing
 
Mock Servers - Fake All the Things!
Mock Servers - Fake All the Things!Mock Servers - Fake All the Things!
Mock Servers - Fake All the Things!
 

Similar to Designing C++ portable SIMD support

The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...Positive Hack Days
 
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Unity Technologies
 
SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptxdk03006
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 
Lcdf4 chap 03_p2
Lcdf4 chap 03_p2Lcdf4 chap 03_p2
Lcdf4 chap 03_p2ozgur_can
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityDefconRussia
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
 
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra Umbra Software
 
How Triton can help to reverse virtual machine based software protections
How Triton can help to reverse virtual machine based software protectionsHow Triton can help to reverse virtual machine based software protections
How Triton can help to reverse virtual machine based software protectionsJonathan Salwan
 
Efficient JIT to 32-bit Arches
Efficient JIT to 32-bit ArchesEfficient JIT to 32-bit Arches
Efficient JIT to 32-bit ArchesNetronome
 
thu-blake-gdc-2014-final
thu-blake-gdc-2014-finalthu-blake-gdc-2014-final
thu-blake-gdc-2014-finalRobert Taylor
 
1 introduction to dsp processor 20140919
1 introduction to dsp processor 201409191 introduction to dsp processor 20140919
1 introduction to dsp processor 20140919Hans Kuo
 
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Laurent Leturgez
 
Sudhir tms 320 f 2812
Sudhir tms 320 f 2812 Sudhir tms 320 f 2812
Sudhir tms 320 f 2812 vijaydeepakg
 
Everybody be cool, this is a ROPpery
Everybody be cool, this is a ROPperyEverybody be cool, this is a ROPpery
Everybody be cool, this is a ROPperyVincenzo Iozzo
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMDWei-Ta Wang
 
Embedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaEmbedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaJohanAspro
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systemsVsevolod Stakhov
 

Similar to Designing C++ portable SIMD support (20)

Joel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMDJoel Falcou, Boost.SIMD
Joel Falcou, Boost.SIMD
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
 
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
 
SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptx
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
Lcdf4 chap 03_p2
Lcdf4 chap 03_p2Lcdf4 chap 03_p2
Lcdf4 chap 03_p2
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software security
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
 
How Triton can help to reverse virtual machine based software protections
How Triton can help to reverse virtual machine based software protectionsHow Triton can help to reverse virtual machine based software protections
How Triton can help to reverse virtual machine based software protections
 
Efficient JIT to 32-bit Arches
Efficient JIT to 32-bit ArchesEfficient JIT to 32-bit Arches
Efficient JIT to 32-bit Arches
 
thu-blake-gdc-2014-final
thu-blake-gdc-2014-finalthu-blake-gdc-2014-final
thu-blake-gdc-2014-final
 
1 introduction to dsp processor 20140919
1 introduction to dsp processor 201409191 introduction to dsp processor 20140919
1 introduction to dsp processor 20140919
 
Arduino: Arduino starter kit
Arduino: Arduino starter kitArduino: Arduino starter kit
Arduino: Arduino starter kit
 
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
 
Sudhir tms 320 f 2812
Sudhir tms 320 f 2812 Sudhir tms 320 f 2812
Sudhir tms 320 f 2812
 
Everybody be cool, this is a ROPpery
Everybody be cool, this is a ROPperyEverybody be cool, this is a ROPpery
Everybody be cool, this is a ROPpery
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
 
Embedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaEmbedded system Design introduction _ Karakola
Embedded system Design introduction _ Karakola
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systems
 

More from Joel Falcou

HDR Defence - Software Abstractions for Parallel Architectures
HDR Defence - Software Abstractions for Parallel ArchitecturesHDR Defence - Software Abstractions for Parallel Architectures
HDR Defence - Software Abstractions for Parallel ArchitecturesJoel Falcou
 
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...Joel Falcou
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELJoel Falcou
 
Software Abstractions for Parallel Hardware
Software Abstractions for Parallel HardwareSoftware Abstractions for Parallel Hardware
Software Abstractions for Parallel HardwareJoel Falcou
 
(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel ArchitecturesJoel Falcou
 
Designing Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.ProtoDesigning Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.ProtoJoel Falcou
 
Generative and Meta-Programming - Modern C++ Design for Parallel Computing
Generative and Meta-Programming - Modern C++ Design for Parallel ComputingGenerative and Meta-Programming - Modern C++ Design for Parallel Computing
Generative and Meta-Programming - Modern C++ Design for Parallel ComputingJoel Falcou
 

More from Joel Falcou (9)

HDR Defence - Software Abstractions for Parallel Architectures
HDR Defence - Software Abstractions for Parallel ArchitecturesHDR Defence - Software Abstractions for Parallel Architectures
HDR Defence - Software Abstractions for Parallel Architectures
 
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
 
Boost.SIMD
Boost.SIMDBoost.SIMD
Boost.SIMD
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSEL
 
Software Abstractions for Parallel Hardware
Software Abstractions for Parallel HardwareSoftware Abstractions for Parallel Hardware
Software Abstractions for Parallel Hardware
 
(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures(Costless) Software Abstractions for Parallel Architectures
(Costless) Software Abstractions for Parallel Architectures
 
Boost.Dispatch
Boost.DispatchBoost.Dispatch
Boost.Dispatch
 
Designing Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.ProtoDesigning Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.Proto
 
Generative and Meta-Programming - Modern C++ Design for Parallel Computing
Generative and Meta-Programming - Modern C++ Design for Parallel ComputingGenerative and Meta-Programming - Modern C++ Design for Parallel Computing
Generative and Meta-Programming - Modern C++ Design for Parallel Computing
 

Recently uploaded

A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 

Recently uploaded (20)

A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 

Designing C++ portable SIMD support

  • 1. Designing C++ portable SIMD support Joel Falcou NumScale CppCon 2016
  • 2. NumScale in a few words Our company French start-up specialized in software performance We sell C++ libraries to master modern hardware performance Consulting & training on all things C++ or HPC NumScale and C++ Member of the ISO C++ French National Body Enthusiastic user & contributor to OSS projects Involved in the European C++ community 1 of 33
  • 3. What in the world is SIMD ?
  • 4. Is SIMD that obscure ? Let’s have a poll Who needs performances in their daily job ? 3 of 33
  • 5. Is SIMD that obscure ? Let’s have a poll Who needs performances in their daily job ? Who knows about parallel programming ? 3 of 33
  • 6. Is SIMD that obscure ? Let’s have a poll Who needs performances in their daily job ? Who knows about parallel programming ? Who knows about SIMD/multimedia extensions ? 3 of 33
  • 7. Is SIMD that obscure ? Let’s have a poll Who needs performances in their daily job ? Who knows about parallel programming ? Who knows about SIMD/multimedia extensions ? Who uses SIMD extensions like SSE, AVX, VMX or NEON ? 3 of 33
  • 8. Is SIMD that obscure ? Let’s have a poll Who needs performances in their daily job ? Who knows about parallel programming ? Who knows about SIMD/multimedia extensions ? Who uses SIMD extensions like SSE, AVX, VMX or NEON ? Who has nightmares because of those ? 3 of 33
  • 9. What is SIMD ? - A french cuisine approach Some workload to process 4 of 33
  • 10. What is SIMD ? - A french cuisine approach A regular CPU 4 of 33
  • 11. What is SIMD ? - A french cuisine approach Single Instruction, Single Data processing 4 of 33
  • 12. What is SIMD ? - A french cuisine approach A SIMD enabled CPU 4 of 33
  • 13. What is SIMD ? - A french cuisine approach Single Instruction, Multiple Data processing 4 of 33
  • 14. What is SIMD ? - For real Instructions Data Results SISD SIMD Principes Wide registers store N > 1 values. Special instructions process those registers. Code uses a blitter like approach. 5 of 33
  • 15. What is SIMD ? - For real Instructions Data Results SISD SIMD Benets Speed-up of N on cache-hot data Avoid premature scale-out Maximize FLOPS/Watts 5 of 33
  • 16. 1,001 avors of SIMD Intel x86 MMX 64 bits oat, double SSE 128 bits oat SSE2 128 bits int8, int16, int32, int64, double SSE3, SSSE3 SSE4a (AMD) SSE4.1, SSE4.2 AVX 256 bits oat, double AVX2 256 bits int8, int16, int32, int64 FMA3 FMA4, XOP (AMD) AVX512 512 bits oat, double, int32, int64 PowerPC VMX 128 bits int8, int16, int32, int64, oat VSX, 128 bits int8, int16, int32, int64, oat, double QPX, 512 bits double ARM VFP 64 bits oat, double NEON 64 bits et 128 bits double, oat, int8, int16, int32, int64 6 of 33
  • 17. The Many Ways to Vectorize Implicit vectorization Auto-Vectorization Compiler hints Explicit vectorization Langages extensions SIMD Intrinsics libraries Vector Intrinsics Inline Assembly 7 of 33
  • 18. The Many Ways to Vectorize Implicit vectorization Auto-Vectorization Compiler hints Explicit vectorization Langages extensions SIMD Intrinsics libraries Vector Intrinsics Inline Assembly (let’s just say no right now) 7 of 33
  • 19. Implicit vectorization Auto-vectorizer Compile-time analysis of loop nest May use special hints Only safe transformations are applied template <typename T> void f(T* restrict a, T* restrict b, int size) { #pragma ivdep for(int i=0;i<size ;++i) a[i] += b[i]; } 8 of 33
  • 20. OpenMP4 Principle Flag loop as must be vectorized Support for reductions & SIMD functions User is in charge of checking validity template <typename T> T f(T* a, T* b, int size) { T res=0; #pragma omp simd reduction (+:res) for(int i=0;i<size ;++i) { a[i] += b[i]; res += b[i]; } return res; } 9 of 33
  • 21. SIMD Intrinsics Library Principle Wrap SIMD computations in a library Support SIMD idioms with algorithms or other abstractions Improve portability across compilers Examples Agner Fog’s x86 library Vc, NOVA gSIMD, Cyme Boost.SIMD 10 of 33
  • 22. Explicit usage of intrinsics // NEON return vmul_s32(a0 , a1); // 64-bit return vmulq_s32(a0 , a1); // 128-bit 11 of 33
  • 23. Explicit usage of intrinsics // SSE4.1 return _mm_mullo_epi32(a0, a1); 11 of 33
  • 24. Explicit usage of intrinsics // SSE2 return _mm_or_si128( _mm_and_si128( _mm_mul_epu32(a0,a1), _mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0) ) , _mm_slli_si128( _mm_and_si128( _mm_mul_epu32( _mm_srli_si128(a0 ,4) , _mm_srli_si128(a1 ,4) ) , _mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0) ) , 4 ) ); 11 of 33
  • 25. Explicit usage of intrinsics // Altivec // reinterpret as u16 short0 = (__vector unsigned short)a0; short1 = (__vector unsigned short)a1; // shifting constant shift = vec_splat_u32 (-16); sf = vec_rl(a1, shift_); // Compute high part of the product high = vec_msum( short0 , (__vector unsigned short)sf , vec_splat_u32 (0) ); // Complete by adding low part of the 16 bits product return vec_add( vec_sl(high , shift_) , vec_mulo(short0 , short1) ); 11 of 33
  • 26. Implicit vs Explicit SIMD Implicit Automatic dependency analysis (e.g. reductions). Recognises idioms with data dependencies. Non-inline functions are scalar. Limited support for outer-loop vectorisation Relies on the compiler’s vectorizable patterns library Explicit No dependency analysis Recognises idioms without data dependencies. Non-inline functions can be vectorised. Outer loops can be vectorised. May be more cross-compiler portable. 12 of 33
  • 27. bSIMD and Boost.SIMD† † Boost.SIMD is a candidate for acceptance as a Boost Library
  • 28. From bSIMD to Boost.SIMD bSIMD NumScale closed source software for SIMD programming Explicit SIMD library Supports x86, PPC, ARM architectures Provides domain specic algorithms Boost.SIMD Open Source sub-part of bSIMD Supports x86 and Power6 Provides STD like algorithms 14 of 33
  • 29. The Boost.SIMD register abstraction pack<T,N> Usable as a regular Value Type Wraps a block of contiguous N elements of type T pack<T> picks the optimal N for current hardware Constraints T is a fundamental type logical<T> is used to handle boolean N must be a power of 2. 15 of 33
  • 30. The Boost.SIMD register abstraction pack<T,N> Usable as a regular Value Type Wraps a block of contiguous N elements of type T pack<T> picks the optimal N for current hardware What if ? Let’s have C, the current hardware register size for type T If N == C, use the native register directly If N < C, use a scalar array (for now) If N > C, use an aggregate of 2 x pack<T,N/2> 15 of 33
  • 31. Getting data into packs Constructors pack<T,N> x(U v) : ll x with N v pack<T,N> x(U v...) : ll x with (v0,v1,...) pack<T,N> x(T* ptr) : load N element from aligned memory ptr pack<T,N> x(It b, It e) : load N element from the [b,e[ Range Explicit Memory Load load<T>(U* ptr [,Offset o] ) load<T>(mask_ptr<U> ptr [,Offset o] ) aligned_load<T>(U* ptr [,int o] ) aligned_load<T>(mask_ptr<U> ptr [,int o] ) 16 of 33
  • 32. Getting data into packs Constructors pack<T,N> x(U v) : ll x with N v pack<T,N> x(U v...) : ll x with (v0,v1,...) pack<T,N> x(T* ptr) : load N element from aligned memory ptr pack<T,N> x(It b, It e) : load N element from the [b,e[ Range Misaligned Loads aligned_store<T,N>(ptr) Load an unaligned address with static misalignment Optimized to be faster than unaligned load 16 of 33
  • 33. Getting data into packs Constructors pack<T,N> x(U v) : ll x with N v pack<T,N> x(U v...) : ll x with (v0,v1,...) pack<T,N> x(T* ptr) : load N element from aligned memory ptr pack<T,N> x(It b, It e) : load N element from the [b,e[ Range Explicit Memory Store store<T>(U* ptr [,Offset o] ) store<T>(mask_ptr<U> ptr [,Offset o] ) aligned_store<T>(U* ptr [,int o] ) aligned_store<T>(mask_ptr<U> ptr [,int o] ) 16 of 33
  • 34. Supported operations on pack Basic Operators All operators are available with possible scalar mixing No convertion nor promotion Comparisons ==, !=, <, <=, >, >= perform SIMD comparisons. compare_equal, compare_less perform reductive comparisons. Other properties Models RandomAccessRange p[i] return a proxy to access the register internal value 17 of 33
  • 35. Selection of available functions Arithmetic saturated arithmetics long multiplication oat/int conversion round, oor, ceil, trunc sqrt, cbrt hypot average random min/max rounded division and remainder Bitwise select andnot, ornot popcnt ffs ror, rol rshr, rshl twopower IEEE ilogb frexp ldexp next/prev ulpdist exponent/mantissa Predicates comparison to zero negated comparisons is_unord, is_nan, is_invalid is_odd, is_even majority 18 of 33
  • 36. Selection of available functions Reduction any, all, none nbtrue minimum/maximum sum product, dot product SWAR group/split combine/slice splatted reduction cumsum sort shuffle interleaving deinterleaving 19 of 33
  • 37. {and, Shuffling, Permutation}1,0,2 Principles Data reordering is #1 technique in SIMD Use cases : transpose, AoS/SoA transformations Turn memory access into computations Support for specic permutations Support for arbitrary shuffling patterns Basic permutations reverse, broadcast interleave, deinterleave repeat, slide runtime lookup 20 of 33
  • 38. {and, Shuffling, Permutation}1,0,2 Principles Data reordering is #1 technique in SIMD Use cases : transpose, AoS/SoA transformations Turn memory access into computations Support for specic permutations Support for arbitrary shuffling patterns Shuffle Arbitrary permutation of elements Optimizable if known at compile-time Available for one or two parameters 20 of 33
  • 39. {and, Shuffling, Permutation}1,0,2 Principles Data reordering is #1 technique in SIMD Use cases : transpose, AoS/SoA transformations Turn memory access into computations Support for specic permutations Support for arbitrary shuffling patterns pack <float ,4> a{1,2,3,4}; pack <float ,4> b{10 ,20 ,30 ,40}; // r1 = [1 1 4 4 ] auto r1 = shuffle <0,0,3,3>(a); // r2 = [1 0 0 10 ] auto r2 = shuffle <0,-1,-1,4>(a,b); 20 of 33
  • 40. {and, Shuffling, Permutation}1,0,2 Principles Data reordering is #1 technique in SIMD Use cases : transpose, AoS/SoA transformations Turn memory access into computations Support for specic permutations Support for arbitrary shuffling patterns struct reverse_ { template <class I, class C> struct apply : std:: integral_constant <int ,C::value -I::value -1>; {}; }; // res = [4 3 2 1] pack <float > res = shuffle <reverse_ >(a); 20 of 33
  • 41. {and, Shuffling, Permutation}1,0,2 Principles Data reordering is #1 technique in SIMD Use cases : transpose, AoS/SoA transformations Turn memory access into computations Support for specic permutations Support for arbitrary shuffling patterns constexpr int mix_half(int i, int c) { return i < c/2 ? i+c : i; }; // res = [10 20 3 4] pack <float > res = shuffle <pattern <mix_half >>(a,b); 20 of 33
  • 42. Integration with the Standard Library Algorithms : SIMD transform SIMD reduce Use generic functor/lambda for mixing scalar/SIMD Allocators Ranges : boost::simd::input_range boost::simd::output_range boost::simd::aligned_input_range boost::simd::aligned_output_range boost::simd::segmented_input_range boost::simd::segmented_output_range 21 of 33
  • 43. Integration with the Standard Library std::vector <float , simd::allocator <float > > v(N); simd:: transform( v.begin(), v.end() , []( auto const& p) { return p * 2.f; } ); 22 of 33
  • 44. Integration with the Standard Library std::vector <float , simd::allocator <float > > i(N), o(N); auto x = simd:: reduce( i.data(), i.data()+N , 0.f ); auto y = simd:: reduce( i.data(), i.data()+N, 0.f , []( auto&& a,auto&& e){return a+e*e;} , 0.f, simd::plus ); 22 of 33
  • 47. Basic Functions Single precision math functions (cycles/values) Hardware : Core i7 SandyBridge, AVX Function Range std Scalar SIMD exp [−10, 10] 46 38 7 log [−10, −10] 42 37 5 asin [−1, 1] 40 35 13 cos [−20π, 20π] 66 47 6 restricted_(cos) [−π/4, π/4] 32 9 1.3 25 of 33
  • 48. Julia set generator Generate a fractal image using the Julia funtion Largely compute-bound Challenge : Workload depends on pixel location 26 of 33
  • 49. Julia set generator template <class T> auto julia(T const& a, T const& b) { as_integer_t <T> res {0}; std:: size_t max_iter {0}; T x{0}, y{0}; do { auto x2 = x * x; auto y2 = y * y; auto mask = x2 + y2 < 4; auto xy = 2 * x * y; x = x2 - y2 + a; y = xy + b; res = if_inc(mask , res); } while(any(mask) && max_iter ++ < 256); return res; } 27 of 33
  • 50. Julia set generator Timing w/ Boost.SIMD and other solutions from An Evaluation of current SIMD programming Models for C++ Pohl et al., 2015 28 of 33
  • 51. Interaction with Boost.Odeint Coupled/Uncoupled Roessler system Written by Mario Mulanski Showcase effects of both cache and SIMD Use Boost.ODEINT for the ODE system Use Boost.SIMD to vectorize the system Results Minimal disruption in the code Global x3 performances gain See the whole code at https ://github.com/mariomulansky/olsos 29 of 33
  • 52. Interaction with Boost.Odeint template <class S, class D> void operator ()(const S &x_, D &dxdt_ , double t) const { auto x = boost ::begin( x_ ); auto dxdt = boost::begin( dxdt_ ); const int N = boost::size(x_); for( int j=1; j<N/dim -1; ++j ) { const int i = j*dim; dxdt[i] = -1.0*x[i + 1] - x[i + 2] + m_d * (x[i - dim] + x[i + dim] - 2.0 * x[i]); dxdt[i + 1] = x[i] + m_a * x[i + 1]; dxdt[i + 2] = m_b + x[i + 2] * (x[i] - m_c); } } 29 of 33
  • 53. Interaction with Boost.Odeint // Scalar call using state_type = std::vector <double >; state_type x(N); odeint :: runge_kutta4 <state_type > rk4 ; odeint :: integrate_const(rk4 , roessler , x, 0.0, T , dt); // Boost.SIMD call using alloc_t = simd::allocator <double >; using state_type = vector <pack <double >,alloc_t >; state_type x ( N/ pack <double >:: static_size ); odeint :: runge_kutta4 < state_type > rk4 ; odeint :: integrate_const(rk4 , roessler , x, 0.0, T, dt); 29 of 33
  • 56. Conclusion High level SIMD in C++11/14 Designing a C++ library for low level performance primitives is possible C++11/14 features play nice with SIMD intrinsics SIMD specic idioms maps to modern C++ components Boost.SIMD To be proposed for review this fall Find us on https ://github.com/numscale/boost.simd Tests and feedback welcome 31 of 33
  • 57. This talk would not have been feasible without The bSIMD team Lead Developer : Charly Chevalier Developers : Jean-Thierry Lapresté, Guillaume Quintin Tests and Doc : Alan Kelly, Kenny Peou Our supporters Tim Blenchman, our earliest adopter Serge Guelton, for integrating Boost.SIMD into pythran Mario Mulansky, for its work with Boost.SIMD & Boost.Odeint Sylvain Jubertie,Ian Masliah, for testing Boost.SIMD in clever ways 32 of 33
  • 58. Thanks for your attention !