SlideShare a Scribd company logo
1 of 42
Download to read offline
Pragmatic Speedup with Boost.SIMD
Unlocked software performance
Joel Falcou
NumScale
24 février 2016
Challenges of SIMD
programming
Unlocked software performance
Parallelism is everywhere
The Obvious One
Multi-cores
Many-cores
Distributed systems
The Embedded One
Pipeline
Super-scalar, out of orders CPUs
SIMD Instructions Sets
2 of 32
Unlocked software performance
Parallelism is everywhere
The Obvious One
Multi-cores
Many-cores
Distributed systems
The Embedded One
Pipeline
Super-scalar, out of orders CPUs
SIMD Instructions Sets
2 of 32
Unlocked software performance
What is SIMD ?
Instructions
Data
Results
SISD SIMD
Principes
Single Instruction, Multiple Data
Each operation is applied on N
values in a single register of xed
size (128,256,512bits)
Can be up to N times faster than
regular ALU
3 of 32
Unlocked software performance
Why using SIMD ?
Speedup of x2 to x16 that may be combined with other parallelism
source
Reduce computing time without changing the infrastructure
Give great results for all kind of regular or less regular computing
patterns
4 of 32
Unlocked software performance
1001 avors of SIMD
Intel x86
MMX 64 bits oat, double
SSE 128 bits oat
SSE2 128 bits int8, int16, int32, int64,
double
SSE3, SSSE3
SSE4a (AMD)
SSE4.1, SSE4.2
AVX 256 bits oat, double
AVX2 256 bits int8, int16, int32, int64
FMA3
FMA4, XOP (AMD)
MIC 512 bits oat, double, int32, int64
PowerPC
AltiVec 128 bits int8, int16, int32,
int64, oat
Cell SPU et VSX, 128 bits int8,
int16, int32, int64, oat, double
QPX 512 bits double
ARM
VFP 64 bits oat, double
NEON 64 bits et 128 bits oat,
int8, int16, int32, int64
5 of 32
Unlocked software performance
SIMD the good ol’way, int32 * int32 -> int32
// NEON
return vmul_s32(a0 , a1); // 64-bit
return vmulq_s32(a0 , a1); // 128-bit
6 of 32
Unlocked software performance
SIMD the good ol’way, int32 * int32 -> int32
// SSE4.1
return _mm_mullo_epi32(a0, a1);
6 of 32
Unlocked software performance
SIMD the good ol’way, int32 * int32 -> int32
// SSE2
return
_mm_or_si128(
_mm_and_si128(
_mm_mul_epu32(a0,a1),
_mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0)
)
, _mm_slli_si128(
_mm_and_si128(
_mm_mul_epu32( _mm_srli_si128(a0 ,4)
, _mm_srli_si128(a1 ,4)
)
, _mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0)
)
, 4
)
);
6 of 32
Unlocked software performance
SIMD the good ol’way, int32 * int32 -> int32
// Altivec
// reinterpret as u16
short0 = (__vector unsigned short)a0;
short1 = (__vector unsigned short)a1;
// shifting constant
shift = vec_splat_u32 (-16);
sf = vec_rl(a1, shift_);
// Compute high part of the product
high = vec_msum( short0 , (__vector unsigned short)sf
, vec_splat_u32 (0)
);
// Complete by adding low part of the 16 bits product
return vec_add( vec_sl(high , shift_)
, vec_mulo(short0 , short1)
);
6 of 32
Unlocked software performance
Isn’t it a compiler’s job ?
Autovectorization
Issues with :
memory constraint
vectorisability of the code must be obvious
What about library functions ?
Compiler may be confused or miss a critical clue
Conclusion
Explicit SIMD can garantee the level of performance
Challenge : Keeping a multi-architecture SIMD code up to date
7 of 32
Unlocked software performance
Our approach
High level abstraction
Designing a SIMD Domain-Specic Embedded Language (DSEL)
Abstracting SIMD registers as data block
High level optimisation at expression’s scope
Integration within C++
Make SIMD code generic and cross-hardware
Integration with the standard library
Use modern C++ idioms
8 of 32
Boost.SIMD
†
Boost.SIMD is a candidate for inclusion in the Boost
Unlocked software performance
SIMD abstraction
pack<T,N>
pack<T, N> SIMD register of N elements of type T
pack<T> same with an optimal N for current hardware
Behave as a value of type T but apply operation to all its elements at once.
Constraints
T is a fundamental type
logical<T> is used to handle boolean
N must be a power of 2.
10 of 32
Unlocked software performance
Operations on pack
Basic Operators
All language operators are available : pack<T> ⊕ pack<T> , pack<T> ⊕ T , T ⊕
pack<T>
No convertion nor promotion though :
uint8_t(255) + uint8_t(1) = uint8_t(0)
Comparisons
==, !=, <, <=,> et >= perform SIMD comparisons.
compare_equal, compare_less return a single boolean.
Other proprerties
Models RandomAccessFusionSequence and RandomAccessRange
p[i] return a proxy to access the register internal value
11 of 32
Unlocked software performance
Memory access
Loading and Storing
(aligned_load/store) and (load/store)
Support for statically known misalignment
Support for conditionnal and sparse access (w/r to hardware)
12 of 32
Unlocked software performance
Memory access
Loading and Storing
(aligned_load/store) and (load/store)
Support for statically known misalignment
Support for conditionnal and sparse access (w/r to hardware)
Examples
aligned_load< pack<T, N> >(p, i) load a pack from the aligned address p + i.
0D 0E 0F 10 11 12 13 14 15 16 17 18
aligned_load<pack<float>>(0x10,0)
Main Memory
... ...
10 11 12 13
12 of 32
Unlocked software performance
Memory access
Loading and Storing
(aligned_load/store) and (load/store)
Support for statically known misalignment
Support for conditionnal and sparse access (w/r to hardware)
Examples
aligned_load< pack<T, N>, Offset>(p, i) load a pack from the aligned address
p + i - Offset.
0D 0E 0F 10 11 12 13 14 15 16 17 18
aligned_load<pack<float>,2>(0x10,2)
Main Memory
... ...
12 13 14 15
12 of 32
Unlocked software performance
Shuffle and Swizzle
Genral Principles
Elements of SIMD register can be permuted by the hardware
Turn complex memory access into computations
Provided by the shuffle function
Examples :
// a = [ 1 2 3 4 ]
pack <float > a = enumerate < pack <float > >(1);
// b = [ 10 11 12 13 ]
pack <float > b = enumerate < pack <float > >(10);
// res = [4 12 0 10]
pack <float > res = shuffle <3,6,-1,4>(a,b);
13 of 32
Unlocked software performance
Shuffle and Swizzle
Genral Principles
Elements of SIMD register can be permuted by the hardware
Turn complex memory access into computations
Provided by the shuffle function
Examples :
struct reverse_
{
template <class I, class C>
struct apply : mpl::int_ <C::value - I::value - 1> {};
};
// res = [n n-1 ... 2 1]
pack <float > res = shuffle <reverse_ >(a);
13 of 32
Unlocked software performance
Integration with the STL
Algorithms :
SIMD transform
SIMD fold
Use polymorphic functor or lambda for mixing scalar/SIMD
14 of 32
Unlocked software performance
Integration with the STL
Algorithms :
SIMD transform
SIMD fold
Use polymorphic functor or lambda for mixing scalar/SIMD
Iterators :
Provide SIMD aware walkthroughs
boost::simd::(aligned_)(input/output_)iterator
boost::simd::direct_output_iterator
boost::simd::shifted_iterator
14 of 32
Unlocked software performance
Integration with the STL
std::vector <float , simd::allocator <float > > v(N);
simd:: transform( v.begin(), v.end()
, []( auto const& p)
{
return p * 2.f;
}
);
15 of 32
Unlocked software performance
Integration with the STL
std::vector <float , simd::allocator <float > > i(N), o(N);
std:: transform( simd:: shifted_iterator <3>(in.begin())
, simd:: shifted_iterator <3>(in.end())
, simd:: aligned_output_begin(o.begin())
, average ()
);
struct average
{
template <class T>
typename T:: value_type operator ()(T const& t) const
{
typename T:: value_type d(1./3);
return (t[0]+t[1]+t[2])*d;
}
};
15 of 32
Unlocked software performance
Hardware Optimisations
Problem :
Most SIMD hardware support fused operations like fma.
Those optimisations must remain transparent to the user
We use Expression Templates so B.SIMD auto-optimizes those patterns.
Examples :
a * b + c becomes fma(a, b, c)
a + b * c becomes fma(b, c, a)
!(a < b) becomes is_nle(a, b)
16 of 32
Unlocked software performance
Supported Hardwares
Open Source
Intel SSE2-4, AVX
PowerPC VMX
Proprietary
ARM Neon
Intel AVX2, XOP, FMA3, FMA4
Intel MIC
17 of 32
Unlocked software performance
Other functions ...
Arithmetic
saturated arithmetics
long multiplication
oat/int conversion
round, oor, ceil, trunc
sqrt, cbrt
hypot
average
random
min/max
roudned division et
remainder
Bitwise
select
andnot, ornot
popcnt
ffs
ror, rol
rshr, rshl
twopower
IEEE
ilogb, frexp
ldexp
next/prev
ulpdist
Predicates
comparison to zero
negated comparisons
is_unord, is_nan,
is_invalid
is_odd, is_even
majority
18 of 32
Unlocked software performance
Reduction et SWAR operations
Reduction
any, all
nbtrue
minimum/maximum,
posmin/posmax
sum
product, dot product
SWAR
group/split
splatted reduction
cumsum
sort
19 of 32
Performances !
Unlocked software performance
Basic Functions
Single precision trignometrics
Hardware : Core i7 SandyBridge, AVX using cycles/values
Function Range std Scalar SIMD
exp [−10, 10] 46 38 7
log [−10, −10] 42 37 5
asin [−1, 1] 40 35 13
cos [−20π, 20π] 66 47 6
fast_cos [−π/4, π/4] 32 9 1.3
21 of 32
Unlocked software performance
Julia set generator
Generate a fractal image using the Julia funtion
Purely compute-bound
Challenge : Workload depends on pixel location
22 of 32
Unlocked software performance
Julia set generator
template <class T> typename meta:: as_integer <T>:: type
julia(T const& a, T const& b)
{
typename meta::as_integer <T>:: type iter;
std:: size_t i = 0;
T x, y;
do {
T x2 = x * x;
T y2 = y * y;
T xy = s_t(2) * x * y;
x = x2 - y2 + a;
y = xy + b;
iter = selinc(x2 + y2 < T(4), iter);
} while(any(mask) && i++ < 256);
return iter;
}
23 of 32
Unlocked software performance
Julia set generator
256 x 256 512 x 512 1024 x 1024 2048 x 2048
0
200
400
600
800
x2.93
x2.99
x3.02
x3.03
x6.64
x6.94
x6.09
x6.16
x6.52
x6.81
x5.97
x6.05
Size
cpe
scalar SSE2
simd SSE2
simd AVX
simd AVX2
24 of 32
Unlocked software performance
Motion Detection
Based on Manzanera’s Sigma Delta algorithm
Background substraction
Pixel intensity modeled as gaussian distributions
Challenge : Low arithmetic intensity
25 of 32
Unlocked software performance
Motion Detection
template <typename T>
T sigma_delta(T& bkg , T const& frm , T& var)
{
bkg = selinc(bkg < frm , seldec(bkg > fr, bkg));
T dif = dist(bkg , frm);
T mul = muls(dif ,3);
var = if_else( dif != T(0)
, selinc(var < mul , seldec(var > mul , var))
, var
);
return if_zero_else_one( dif < var );
}
26 of 32
Unlocked software performance
Motion Detection
480 x 640 @5 600 x 800 @5 1024 x 1280 @5 1080 x 1920 @5 2160 x 3840 @5
0
1
2
3
·104
x3.80
x6.63
x3.64
x5.63
x3.50
x6.69
x6.53
x5.75
x5.74
x5.71
x26.26
x26.96
x19.05
x17.19
x10.81
Size
FPS
scalar SSE2
simd SSE2
simd AVX
simd AVX2
27 of 32
Unlocked software performance
Sparse Tridiagonal Solver
Solve Ax = b with sparse A and
multiple x
Application in uid mechanics
Challenge : Using SIMD despite
being sparse
Solution : Shuffle for local
densication
28 of 32
Unlocked software performance
Sparse Tridiagonal Solver
29 of 32
Conclusion
Unlocked software performance
Conclusion
Boost.SIMD
We can design a C++ library for low level performance primitives
A lot of success with a lot of different applications
Find us on https ://github.com/jfalcou/nt2
Tests, comments and feedback welcome
Soon
Rewrite before submission to Boost is in progress
More architectures to come !
31 of 32
Thanks for your attention !

More Related Content

What's hot

Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)Mr. Vengineer
 
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Mr. Vengineer
 
TensorFlow local Python XLA client
TensorFlow local Python XLA clientTensorFlow local Python XLA client
TensorFlow local Python XLA clientMr. Vengineer
 
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
Gor Nishanov,  C++ Coroutines – a negative overhead abstractionGor Nishanov,  C++ Coroutines – a negative overhead abstraction
Gor Nishanov, C++ Coroutines – a negative overhead abstractionSergey Platonov
 
HKG15-207: Advanced Toolchain Usage Part 3
HKG15-207: Advanced Toolchain Usage Part 3HKG15-207: Advanced Toolchain Usage Part 3
HKG15-207: Advanced Toolchain Usage Part 3Linaro
 
PVS-Studio in 2021 - Error Examples
PVS-Studio in 2021 - Error ExamplesPVS-Studio in 2021 - Error Examples
PVS-Studio in 2021 - Error ExamplesAndrey Karpov
 
Node.js System: The Landing
Node.js System: The LandingNode.js System: The Landing
Node.js System: The LandingHaci Murat Yaman
 
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...corehard_by
 
How Data Flow analysis works in a static code analyzer
How Data Flow analysis works in a static code analyzerHow Data Flow analysis works in a static code analyzer
How Data Flow analysis works in a static code analyzerAndrey Karpov
 
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2 「エッジAIモダン計測制御の世界」オ...
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2  「エッジAIモダン計測制御の世界」オ...Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2  「エッジAIモダン計測制御の世界」オ...
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2 「エッジAIモダン計測制御の世界」オ...Mr. Vengineer
 
Fuzzing: The New Unit Testing
Fuzzing: The New Unit TestingFuzzing: The New Unit Testing
Fuzzing: The New Unit TestingDmitry Vyukov
 
GPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMPGPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMPMiller Lee
 
C++ game development with oxygine
C++ game development with oxygineC++ game development with oxygine
C++ game development with oxyginecorehard_by
 
Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispDamien Cassou
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linuxMiller Lee
 
модели акторов в с++ миф или реальность
модели акторов в с++ миф или реальностьмодели акторов в с++ миф или реальность
модели акторов в с++ миф или реальностьcorehard_by
 
Tensor comprehensions
Tensor comprehensionsTensor comprehensions
Tensor comprehensionsMr. Vengineer
 
PVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentPVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentOOO "Program Verification Systems"
 

What's hot (20)

TVM VTA (TSIM)
TVM VTA (TSIM) TVM VTA (TSIM)
TVM VTA (TSIM)
 
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)Bridge TensorFlow to run on Intel nGraph backends (v0.4)
Bridge TensorFlow to run on Intel nGraph backends (v0.4)
 
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)Bridge TensorFlow to run on Intel nGraph backends (v0.5)
Bridge TensorFlow to run on Intel nGraph backends (v0.5)
 
Qt Rest Server
Qt Rest ServerQt Rest Server
Qt Rest Server
 
TensorFlow local Python XLA client
TensorFlow local Python XLA clientTensorFlow local Python XLA client
TensorFlow local Python XLA client
 
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
Gor Nishanov,  C++ Coroutines – a negative overhead abstractionGor Nishanov,  C++ Coroutines – a negative overhead abstraction
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
 
HKG15-207: Advanced Toolchain Usage Part 3
HKG15-207: Advanced Toolchain Usage Part 3HKG15-207: Advanced Toolchain Usage Part 3
HKG15-207: Advanced Toolchain Usage Part 3
 
PVS-Studio in 2021 - Error Examples
PVS-Studio in 2021 - Error ExamplesPVS-Studio in 2021 - Error Examples
PVS-Studio in 2021 - Error Examples
 
Node.js System: The Landing
Node.js System: The LandingNode.js System: The Landing
Node.js System: The Landing
 
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
 
How Data Flow analysis works in a static code analyzer
How Data Flow analysis works in a static code analyzerHow Data Flow analysis works in a static code analyzer
How Data Flow analysis works in a static code analyzer
 
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2 「エッジAIモダン計測制御の世界」オ...
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2  「エッジAIモダン計測制御の世界」オ...Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2  「エッジAIモダン計測制御の世界」オ...
Google Edge TPUで TensorFlow Liteを使った時に 何をやっているのかを妄想してみる 2 「エッジAIモダン計測制御の世界」オ...
 
Fuzzing: The New Unit Testing
Fuzzing: The New Unit TestingFuzzing: The New Unit Testing
Fuzzing: The New Unit Testing
 
GPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMPGPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMP
 
C++ game development with oxygine
C++ game development with oxygineC++ game development with oxygine
C++ game development with oxygine
 
Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common Lisp
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linux
 
модели акторов в с++ миф или реальность
модели акторов в с++ миф или реальностьмодели акторов в с++ миф или реальность
модели акторов в с++ миф или реальность
 
Tensor comprehensions
Tensor comprehensionsTensor comprehensions
Tensor comprehensions
 
PVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentPVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications development
 

Similar to Pragmatic Speedup with Boost.SIMD Unlocked software performance

Designing C++ portable SIMD support
Designing C++ portable SIMD supportDesigning C++ portable SIMD support
Designing C++ portable SIMD supportJoel Falcou
 
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Laurent Leturgez
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Jorisimec.archive
 
The Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging SystemThe Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging SystemMelissa Luster
 
SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptxdk03006
 
07 processor basics
07 processor basics07 processor basics
07 processor basicsMurali M
 
A taste of GlobalISel
A taste of GlobalISelA taste of GlobalISel
A taste of GlobalISelIgalia
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerMarina Kolpakova
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -evechiportal
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to knowRoberto Agostino Vitillo
 
ADS Lab 5 Report
ADS Lab 5 ReportADS Lab 5 Report
ADS Lab 5 ReportRiddhi Shah
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...Positive Hack Days
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5PRADEEP
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityDefconRussia
 
Sudhir tms 320 f 2812
Sudhir tms 320 f 2812 Sudhir tms 320 f 2812
Sudhir tms 320 f 2812 vijaydeepakg
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systemsVsevolod Stakhov
 
Snake Game on FPGA in Verilog
Snake Game on FPGA in VerilogSnake Game on FPGA in Verilog
Snake Game on FPGA in VerilogKrishnajith S S
 

Similar to Pragmatic Speedup with Boost.SIMD Unlocked software performance (20)

Designing C++ portable SIMD support
Designing C++ portable SIMD supportDesigning C++ portable SIMD support
Designing C++ portable SIMD support
 
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
Ukoug15 SIMD outside and inside Oracle 12c (12.1.0.2)
 
Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
The Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging SystemThe Principle Of Ultrasound Imaging System
The Principle Of Ultrasound Imaging System
 
SIMD.pptx
SIMD.pptxSIMD.pptx
SIMD.pptx
 
07 processor basics
07 processor basics07 processor basics
07 processor basics
 
A taste of GlobalISel
A taste of GlobalISelA taste of GlobalISel
A taste of GlobalISel
 
Reduction
ReductionReduction
Reduction
 
Coding style for good synthesis
Coding style for good synthesisCoding style for good synthesis
Coding style for good synthesis
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Track c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eveTrack c-High speed transaction-based hw-sw coverification -eve
Track c-High speed transaction-based hw-sw coverification -eve
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
 
ADS Lab 5 Report
ADS Lab 5 ReportADS Lab 5 Report
ADS Lab 5 Report
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
Georgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software securityGeorgy Nosenko - An introduction to the use SMT solvers for software security
Georgy Nosenko - An introduction to the use SMT solvers for software security
 
Sudhir tms 320 f 2812
Sudhir tms 320 f 2812 Sudhir tms 320 f 2812
Sudhir tms 320 f 2812
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systems
 
Snake Game on FPGA in Verilog
Snake Game on FPGA in VerilogSnake Game on FPGA in Verilog
Snake Game on FPGA in Verilog
 

More from Sergey Platonov

Полухин Антон, Как делать не надо: C++ велосипедостроение для профессионалов
Полухин Антон, Как делать не надо: C++ велосипедостроение для профессионаловПолухин Антон, Как делать не надо: C++ велосипедостроение для профессионалов
Полухин Антон, Как делать не надо: C++ велосипедостроение для профессионаловSergey Platonov
 
Григорий Демченко, Универсальный адаптер
Григорий Демченко, Универсальный адаптерГригорий Демченко, Универсальный адаптер
Григорий Демченко, Универсальный адаптерSergey Platonov
 
Антон Бикинеев, Reflection in C++Next
Антон Бикинеев,  Reflection in C++NextАнтон Бикинеев,  Reflection in C++Next
Антон Бикинеев, Reflection in C++NextSergey Platonov
 
Василий Сорокин, Простой REST сервер на Qt с рефлексией
Василий Сорокин, Простой REST сервер на Qt с рефлексиейВасилий Сорокин, Простой REST сервер на Qt с рефлексией
Василий Сорокин, Простой REST сервер на Qt с рефлексиейSergey Platonov
 
Сергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и Javascript
Сергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и JavascriptСергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и Javascript
Сергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и JavascriptSergey Platonov
 
Лев Казаркин, Удивительные приключения регистров SSE или в поисках одного бага
Лев Казаркин, Удивительные приключения регистров SSE или в поисках одного багаЛев Казаркин, Удивительные приключения регистров SSE или в поисках одного бага
Лев Казаркин, Удивительные приключения регистров SSE или в поисках одного багаSergey Platonov
 
Антон Бикинеев, Writing good std::future&lt; C++ >
Антон Бикинеев, Writing good std::future&lt; C++ >Антон Бикинеев, Writing good std::future&lt; C++ >
Антон Бикинеев, Writing good std::future&lt; C++ >Sergey Platonov
 
Павел Филонов, Разделяй и управляй вместе с Conan.io
Павел Филонов, Разделяй и управляй вместе с Conan.ioПавел Филонов, Разделяй и управляй вместе с Conan.io
Павел Филонов, Разделяй и управляй вместе с Conan.ioSergey Platonov
 
Григорий Демченко, Асинхронность и неблокирующая синхронизация
Григорий Демченко, Асинхронность и неблокирующая синхронизацияГригорий Демченко, Асинхронность и неблокирующая синхронизация
Григорий Демченко, Асинхронность и неблокирующая синхронизацияSergey Platonov
 
Антон Полухин. C++17
Антон Полухин. C++17Антон Полухин. C++17
Антон Полухин. C++17Sergey Platonov
 
Павел Беликов, Как избежать ошибок, используя современный C++
Павел Беликов, Как избежать ошибок, используя современный C++Павел Беликов, Как избежать ошибок, используя современный C++
Павел Беликов, Как избежать ошибок, используя современный C++Sergey Platonov
 
Денис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на Qt
Денис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на QtДенис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на Qt
Денис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на QtSergey Platonov
 
Алексей Кутумов, Coroutines everywhere
Алексей Кутумов, Coroutines everywhereАлексей Кутумов, Coroutines everywhere
Алексей Кутумов, Coroutines everywhereSergey Platonov
 
Дмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI векеДмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI векеSergey Platonov
 
Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...
Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...
Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...Sergey Platonov
 
Павел Довгалюк, Обратная отладка
Павел Довгалюк, Обратная отладкаПавел Довгалюк, Обратная отладка
Павел Довгалюк, Обратная отладкаSergey Platonov
 
Никита Глушков, К вопросу о реализации кроссплатформенных фреймворков
Никита Глушков, К вопросу о реализации кроссплатформенных фреймворковНикита Глушков, К вопросу о реализации кроссплатформенных фреймворков
Никита Глушков, К вопросу о реализации кроссплатформенных фреймворковSergey Platonov
 
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Sergey Platonov
 
Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...
Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...
Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...Sergey Platonov
 
Александр Фокин, Рефлексия в C++
Александр Фокин, Рефлексия в C++Александр Фокин, Рефлексия в C++
Александр Фокин, Рефлексия в C++Sergey Platonov
 

More from Sergey Platonov (20)

Полухин Антон, Как делать не надо: C++ велосипедостроение для профессионалов
Полухин Антон, Как делать не надо: C++ велосипедостроение для профессионаловПолухин Антон, Как делать не надо: C++ велосипедостроение для профессионалов
Полухин Антон, Как делать не надо: C++ велосипедостроение для профессионалов
 
Григорий Демченко, Универсальный адаптер
Григорий Демченко, Универсальный адаптерГригорий Демченко, Универсальный адаптер
Григорий Демченко, Универсальный адаптер
 
Антон Бикинеев, Reflection in C++Next
Антон Бикинеев,  Reflection in C++NextАнтон Бикинеев,  Reflection in C++Next
Антон Бикинеев, Reflection in C++Next
 
Василий Сорокин, Простой REST сервер на Qt с рефлексией
Василий Сорокин, Простой REST сервер на Qt с рефлексиейВасилий Сорокин, Простой REST сервер на Qt с рефлексией
Василий Сорокин, Простой REST сервер на Qt с рефлексией
 
Сергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и Javascript
Сергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и JavascriptСергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и Javascript
Сергей Шамбир, Адаптация Promise/A+ для взаимодействия между C++ и Javascript
 
Лев Казаркин, Удивительные приключения регистров SSE или в поисках одного бага
Лев Казаркин, Удивительные приключения регистров SSE или в поисках одного багаЛев Казаркин, Удивительные приключения регистров SSE или в поисках одного бага
Лев Казаркин, Удивительные приключения регистров SSE или в поисках одного бага
 
Антон Бикинеев, Writing good std::future&lt; C++ >
Антон Бикинеев, Writing good std::future&lt; C++ >Антон Бикинеев, Writing good std::future&lt; C++ >
Антон Бикинеев, Writing good std::future&lt; C++ >
 
Павел Филонов, Разделяй и управляй вместе с Conan.io
Павел Филонов, Разделяй и управляй вместе с Conan.ioПавел Филонов, Разделяй и управляй вместе с Conan.io
Павел Филонов, Разделяй и управляй вместе с Conan.io
 
Григорий Демченко, Асинхронность и неблокирующая синхронизация
Григорий Демченко, Асинхронность и неблокирующая синхронизацияГригорий Демченко, Асинхронность и неблокирующая синхронизация
Григорий Демченко, Асинхронность и неблокирующая синхронизация
 
Антон Полухин. C++17
Антон Полухин. C++17Антон Полухин. C++17
Антон Полухин. C++17
 
Павел Беликов, Как избежать ошибок, используя современный C++
Павел Беликов, Как избежать ошибок, используя современный C++Павел Беликов, Как избежать ошибок, используя современный C++
Павел Беликов, Как избежать ошибок, используя современный C++
 
Денис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на Qt
Денис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на QtДенис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на Qt
Денис Кандров, Пушкова Евгения, QSpec: тестирование графических приложений на Qt
 
Алексей Кутумов, Coroutines everywhere
Алексей Кутумов, Coroutines everywhereАлексей Кутумов, Coroutines everywhere
Алексей Кутумов, Coroutines everywhere
 
Дмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI векеДмитрий Нестерук, Паттерны проектирования в XXI веке
Дмитрий Нестерук, Паттерны проектирования в XXI веке
 
Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...
Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...
Александр Тарасенко, Использование python для автоматизации отладки С/C++ код...
 
Павел Довгалюк, Обратная отладка
Павел Довгалюк, Обратная отладкаПавел Довгалюк, Обратная отладка
Павел Довгалюк, Обратная отладка
 
Никита Глушков, К вопросу о реализации кроссплатформенных фреймворков
Никита Глушков, К вопросу о реализации кроссплатформенных фреймворковНикита Глушков, К вопросу о реализации кроссплатформенных фреймворков
Никита Глушков, К вопросу о реализации кроссплатформенных фреймворков
 
Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...Dori Exterman, Considerations for choosing the parallel computing strategy th...
Dori Exterman, Considerations for choosing the parallel computing strategy th...
 
Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...
Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...
Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...
 
Александр Фокин, Рефлексия в C++
Александр Фокин, Рефлексия в C++Александр Фокин, Рефлексия в C++
Александр Фокин, Рефлексия в C++
 

Recently uploaded

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 

Recently uploaded (20)

HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 

Pragmatic Speedup with Boost.SIMD Unlocked software performance

  • 1. Pragmatic Speedup with Boost.SIMD Unlocked software performance Joel Falcou NumScale 24 février 2016
  • 3. Unlocked software performance Parallelism is everywhere The Obvious One Multi-cores Many-cores Distributed systems The Embedded One Pipeline Super-scalar, out of orders CPUs SIMD Instructions Sets 2 of 32
  • 4. Unlocked software performance Parallelism is everywhere The Obvious One Multi-cores Many-cores Distributed systems The Embedded One Pipeline Super-scalar, out of orders CPUs SIMD Instructions Sets 2 of 32
  • 5. Unlocked software performance What is SIMD ? Instructions Data Results SISD SIMD Principes Single Instruction, Multiple Data Each operation is applied on N values in a single register of xed size (128,256,512bits) Can be up to N times faster than regular ALU 3 of 32
  • 6. Unlocked software performance Why using SIMD ? Speedup of x2 to x16 that may be combined with other parallelism source Reduce computing time without changing the infrastructure Give great results for all kind of regular or less regular computing patterns 4 of 32
  • 7. Unlocked software performance 1001 avors of SIMD Intel x86 MMX 64 bits oat, double SSE 128 bits oat SSE2 128 bits int8, int16, int32, int64, double SSE3, SSSE3 SSE4a (AMD) SSE4.1, SSE4.2 AVX 256 bits oat, double AVX2 256 bits int8, int16, int32, int64 FMA3 FMA4, XOP (AMD) MIC 512 bits oat, double, int32, int64 PowerPC AltiVec 128 bits int8, int16, int32, int64, oat Cell SPU et VSX, 128 bits int8, int16, int32, int64, oat, double QPX 512 bits double ARM VFP 64 bits oat, double NEON 64 bits et 128 bits oat, int8, int16, int32, int64 5 of 32
  • 8. Unlocked software performance SIMD the good ol’way, int32 * int32 -> int32 // NEON return vmul_s32(a0 , a1); // 64-bit return vmulq_s32(a0 , a1); // 128-bit 6 of 32
  • 9. Unlocked software performance SIMD the good ol’way, int32 * int32 -> int32 // SSE4.1 return _mm_mullo_epi32(a0, a1); 6 of 32
  • 10. Unlocked software performance SIMD the good ol’way, int32 * int32 -> int32 // SSE2 return _mm_or_si128( _mm_and_si128( _mm_mul_epu32(a0,a1), _mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0) ) , _mm_slli_si128( _mm_and_si128( _mm_mul_epu32( _mm_srli_si128(a0 ,4) , _mm_srli_si128(a1 ,4) ) , _mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0) ) , 4 ) ); 6 of 32
  • 11. Unlocked software performance SIMD the good ol’way, int32 * int32 -> int32 // Altivec // reinterpret as u16 short0 = (__vector unsigned short)a0; short1 = (__vector unsigned short)a1; // shifting constant shift = vec_splat_u32 (-16); sf = vec_rl(a1, shift_); // Compute high part of the product high = vec_msum( short0 , (__vector unsigned short)sf , vec_splat_u32 (0) ); // Complete by adding low part of the 16 bits product return vec_add( vec_sl(high , shift_) , vec_mulo(short0 , short1) ); 6 of 32
  • 12. Unlocked software performance Isn’t it a compiler’s job ? Autovectorization Issues with : memory constraint vectorisability of the code must be obvious What about library functions ? Compiler may be confused or miss a critical clue Conclusion Explicit SIMD can garantee the level of performance Challenge : Keeping a multi-architecture SIMD code up to date 7 of 32
  • 13. Unlocked software performance Our approach High level abstraction Designing a SIMD Domain-Specic Embedded Language (DSEL) Abstracting SIMD registers as data block High level optimisation at expression’s scope Integration within C++ Make SIMD code generic and cross-hardware Integration with the standard library Use modern C++ idioms 8 of 32
  • 14. Boost.SIMD † Boost.SIMD is a candidate for inclusion in the Boost
  • 15. Unlocked software performance SIMD abstraction pack<T,N> pack<T, N> SIMD register of N elements of type T pack<T> same with an optimal N for current hardware Behave as a value of type T but apply operation to all its elements at once. Constraints T is a fundamental type logical<T> is used to handle boolean N must be a power of 2. 10 of 32
  • 16. Unlocked software performance Operations on pack Basic Operators All language operators are available : pack<T> ⊕ pack<T> , pack<T> ⊕ T , T ⊕ pack<T> No convertion nor promotion though : uint8_t(255) + uint8_t(1) = uint8_t(0) Comparisons ==, !=, <, <=,> et >= perform SIMD comparisons. compare_equal, compare_less return a single boolean. Other proprerties Models RandomAccessFusionSequence and RandomAccessRange p[i] return a proxy to access the register internal value 11 of 32
  • 17. Unlocked software performance Memory access Loading and Storing (aligned_load/store) and (load/store) Support for statically known misalignment Support for conditionnal and sparse access (w/r to hardware) 12 of 32
  • 18. Unlocked software performance Memory access Loading and Storing (aligned_load/store) and (load/store) Support for statically known misalignment Support for conditionnal and sparse access (w/r to hardware) Examples aligned_load< pack<T, N> >(p, i) load a pack from the aligned address p + i. 0D 0E 0F 10 11 12 13 14 15 16 17 18 aligned_load<pack<float>>(0x10,0) Main Memory ... ... 10 11 12 13 12 of 32
  • 19. Unlocked software performance Memory access Loading and Storing (aligned_load/store) and (load/store) Support for statically known misalignment Support for conditionnal and sparse access (w/r to hardware) Examples aligned_load< pack<T, N>, Offset>(p, i) load a pack from the aligned address p + i - Offset. 0D 0E 0F 10 11 12 13 14 15 16 17 18 aligned_load<pack<float>,2>(0x10,2) Main Memory ... ... 12 13 14 15 12 of 32
  • 20. Unlocked software performance Shuffle and Swizzle Genral Principles Elements of SIMD register can be permuted by the hardware Turn complex memory access into computations Provided by the shuffle function Examples : // a = [ 1 2 3 4 ] pack <float > a = enumerate < pack <float > >(1); // b = [ 10 11 12 13 ] pack <float > b = enumerate < pack <float > >(10); // res = [4 12 0 10] pack <float > res = shuffle <3,6,-1,4>(a,b); 13 of 32
  • 21. Unlocked software performance Shuffle and Swizzle Genral Principles Elements of SIMD register can be permuted by the hardware Turn complex memory access into computations Provided by the shuffle function Examples : struct reverse_ { template <class I, class C> struct apply : mpl::int_ <C::value - I::value - 1> {}; }; // res = [n n-1 ... 2 1] pack <float > res = shuffle <reverse_ >(a); 13 of 32
  • 22. Unlocked software performance Integration with the STL Algorithms : SIMD transform SIMD fold Use polymorphic functor or lambda for mixing scalar/SIMD 14 of 32
  • 23. Unlocked software performance Integration with the STL Algorithms : SIMD transform SIMD fold Use polymorphic functor or lambda for mixing scalar/SIMD Iterators : Provide SIMD aware walkthroughs boost::simd::(aligned_)(input/output_)iterator boost::simd::direct_output_iterator boost::simd::shifted_iterator 14 of 32
  • 24. Unlocked software performance Integration with the STL std::vector <float , simd::allocator <float > > v(N); simd:: transform( v.begin(), v.end() , []( auto const& p) { return p * 2.f; } ); 15 of 32
  • 25. Unlocked software performance Integration with the STL std::vector <float , simd::allocator <float > > i(N), o(N); std:: transform( simd:: shifted_iterator <3>(in.begin()) , simd:: shifted_iterator <3>(in.end()) , simd:: aligned_output_begin(o.begin()) , average () ); struct average { template <class T> typename T:: value_type operator ()(T const& t) const { typename T:: value_type d(1./3); return (t[0]+t[1]+t[2])*d; } }; 15 of 32
  • 26. Unlocked software performance Hardware Optimisations Problem : Most SIMD hardware support fused operations like fma. Those optimisations must remain transparent to the user We use Expression Templates so B.SIMD auto-optimizes those patterns. Examples : a * b + c becomes fma(a, b, c) a + b * c becomes fma(b, c, a) !(a < b) becomes is_nle(a, b) 16 of 32
  • 27. Unlocked software performance Supported Hardwares Open Source Intel SSE2-4, AVX PowerPC VMX Proprietary ARM Neon Intel AVX2, XOP, FMA3, FMA4 Intel MIC 17 of 32
  • 28. Unlocked software performance Other functions ... Arithmetic saturated arithmetics long multiplication oat/int conversion round, oor, ceil, trunc sqrt, cbrt hypot average random min/max roudned division et remainder Bitwise select andnot, ornot popcnt ffs ror, rol rshr, rshl twopower IEEE ilogb, frexp ldexp next/prev ulpdist Predicates comparison to zero negated comparisons is_unord, is_nan, is_invalid is_odd, is_even majority 18 of 32
  • 29. Unlocked software performance Reduction et SWAR operations Reduction any, all nbtrue minimum/maximum, posmin/posmax sum product, dot product SWAR group/split splatted reduction cumsum sort 19 of 32
  • 31. Unlocked software performance Basic Functions Single precision trignometrics Hardware : Core i7 SandyBridge, AVX using cycles/values Function Range std Scalar SIMD exp [−10, 10] 46 38 7 log [−10, −10] 42 37 5 asin [−1, 1] 40 35 13 cos [−20π, 20π] 66 47 6 fast_cos [−π/4, π/4] 32 9 1.3 21 of 32
  • 32. Unlocked software performance Julia set generator Generate a fractal image using the Julia funtion Purely compute-bound Challenge : Workload depends on pixel location 22 of 32
  • 33. Unlocked software performance Julia set generator template <class T> typename meta:: as_integer <T>:: type julia(T const& a, T const& b) { typename meta::as_integer <T>:: type iter; std:: size_t i = 0; T x, y; do { T x2 = x * x; T y2 = y * y; T xy = s_t(2) * x * y; x = x2 - y2 + a; y = xy + b; iter = selinc(x2 + y2 < T(4), iter); } while(any(mask) && i++ < 256); return iter; } 23 of 32
  • 34. Unlocked software performance Julia set generator 256 x 256 512 x 512 1024 x 1024 2048 x 2048 0 200 400 600 800 x2.93 x2.99 x3.02 x3.03 x6.64 x6.94 x6.09 x6.16 x6.52 x6.81 x5.97 x6.05 Size cpe scalar SSE2 simd SSE2 simd AVX simd AVX2 24 of 32
  • 35. Unlocked software performance Motion Detection Based on Manzanera’s Sigma Delta algorithm Background substraction Pixel intensity modeled as gaussian distributions Challenge : Low arithmetic intensity 25 of 32
  • 36. Unlocked software performance Motion Detection template <typename T> T sigma_delta(T& bkg , T const& frm , T& var) { bkg = selinc(bkg < frm , seldec(bkg > fr, bkg)); T dif = dist(bkg , frm); T mul = muls(dif ,3); var = if_else( dif != T(0) , selinc(var < mul , seldec(var > mul , var)) , var ); return if_zero_else_one( dif < var ); } 26 of 32
  • 37. Unlocked software performance Motion Detection 480 x 640 @5 600 x 800 @5 1024 x 1280 @5 1080 x 1920 @5 2160 x 3840 @5 0 1 2 3 ·104 x3.80 x6.63 x3.64 x5.63 x3.50 x6.69 x6.53 x5.75 x5.74 x5.71 x26.26 x26.96 x19.05 x17.19 x10.81 Size FPS scalar SSE2 simd SSE2 simd AVX simd AVX2 27 of 32
  • 38. Unlocked software performance Sparse Tridiagonal Solver Solve Ax = b with sparse A and multiple x Application in uid mechanics Challenge : Using SIMD despite being sparse Solution : Shuffle for local densication 28 of 32
  • 39. Unlocked software performance Sparse Tridiagonal Solver 29 of 32
  • 41. Unlocked software performance Conclusion Boost.SIMD We can design a C++ library for low level performance primitives A lot of success with a lot of different applications Find us on https ://github.com/jfalcou/nt2 Tests, comments and feedback welcome Soon Rewrite before submission to Boost is in progress More architectures to come ! 31 of 32
  • 42. Thanks for your attention !