SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
EXPLOITINGVECTORIZATION WITH ISPC
Roberto A.Vitillo (LBNL)
8/20/2013 ATLAS FSTF Meeting
1
DESIGN LEVELS
Architecture
Algorithms & Data Structures
Code
System Software
Hardware
• Independent changes on different
levels causes speedups to multiply
• Clock speed free lunch is over
‣ hardware parallelism is increasing
• How can parallelism be exploited at
the bottom layers?
2
• Single Instruction, Multiple Data:
‣ processor throughput is increased
by handling multiple data in
parallel
‣ exploiting SIMD is increasingly
becoming more important on
Xeons (see AVX-512)
‣ exploiting SIMD is mandatory to
achieve reasonable performances
on the Xeon PHI
Source:“Computer Architecture, A Quantitative Approach”
SIMD
WHY IT MATTERS
3
SIMD IN ATLAS
• Approaches:
1. vectorized linear algebra libraries
2. intrinsics & wrappers (a.k.a. gobbledygook)
3. autovectorization (hard to maintain when it works)
4. language extensions for vectorization
‣OpenMP4 (wait a few years and retry)
‣array annotations in Cilk Plus (future GCC releases)
Up to now Eigen & friends were the only
reliable tools to exploit vectorization in
ATLAS
4
MEET ISPC
• Intel SPMD Program Compiler (ISPC) extends a C-based language with “single
program, multiple data” (SPMD) constructs
• An ISPC program describes the behavior of a single program instance
‣ even though a “gang” of them is in reality being executed
‣ gang size is usually no more than 2-4x the native SIMD width of the machine
• For CUDA affectionados
‣ ISPC Program is similar to a CUDA thread
‣ An ISPC gang is similar to a CUDA warp
5
SPMD PARADIGM
Execution of a SPMD program with a gang size of 4
• Observations:
‣ diverging control flow reduces the
utilization of vector instructions
‣ vectorization adds masking
overhead
6
HELLO WORLD
export void simple(uniform float vin[], uniform float vout[],
uniform int count) {
foreach (index = 0 ... count) {
float v = vin[index];
if (v < 3.)
v = v * v;
else
v = sqrt(v);
vout[index] = v;
}
}
simple.ispc, compiled with ispc
#include <stdio.h>
#include "simple.h"
int main() {
float vin[16], vout[16];
for (int i = 0; i < 16; ++i)
vin[i] = i;
simple(vin, vout, 16);
for (int i = 0; i < 16; ++i)
printf("%d: simple(%f) = %fn", i, vin[i], vout[i]);
}
main.c, compiled with GCC
uniform variable is shared among program instances
make function available to be called from application
code
each program instance has a private instance of a
non-uniform variable (a.k.a. varying variable)
ispc function is called like any other function
from the C/C++ application
foreach expresses a parallel computation
7
DEBUGGING SUPPORT
BEYOND GDB
foreach(k = 0 ... 6){
int i = k * 7;
print("%n", i);
double* dR = &P[i];
double* dA = &P[i+3];
...
}
Prints [0, 7, 14, 21, 28, 35, ((42)), ((49))]
0 7 14 21 28 35 (42) (49)
Inactive Program Instances
gang size of 8
8
DEBUGGING SUPPORT
PERFORMANCE WARNINGS
export void foo(uniform float * uniform A, uniform int n){
foreach(i = 0 ... n){
A[i*8] *= A[i*8];
}
}
9
MATRIX MULTIPLICATION
EXPLOITING HORIZONTALVECTORIZATION WITH SMALL MATRICES
0
1
3
4
5
7
8
6.4
3.3
ISPC vs GCC DGEMM
ISPC vs GCC SGEMM
inline void mxm(uniform float * uniform A,
uniform float * uniform B,
uniform float * uniform C,
uniform int M,
uniform int N,
uniform int K,
uniform int nmat,
int idx)
{
for(uniform int i = 0; i < M; i++){
for(uniform int j = 0; j < N; j++){
float sum = 0;
for(uniform int k = 0; k < K; k++){
sum += A[i*K*nmat + k*nmat + idx] * B[k*N*nmat + j*nmat + idx];
}
C[i*N*nmat + j*nmat + idx] = sum;
}
}
}
export void gemm(uniform float * uniform A,
uniform float * uniform B,
uniform float * uniform C,
uniform int M,
uniform int N,
uniform int K,
uniform int nmat)
{
foreach(i = 0 ... nmat){
mxm(A, B, C, M, N, K, nmat, i);
}
}
xGEMM 5x5 speedup over 1000 matrices (GCC 4.8 -O3, Ivy Bridge)
10
MATRIX MULTIPLICATION
EXPLOITING HORIZONTALVECTORIZATION WITH SMALL MATRICES
0
1
3
4
5
7
8
7.6
3.7
ISPC vs GCC DGEMM
ISPC vs GCC SGEMM
inline void mxm(uniform float * uniform A,
uniform float * uniform B,
uniform float * uniform C,
uniform int M,
uniform int N,
uniform int K,
uniform int nmat,
int idx)
{
for(uniform int i = 0; i < M; i++){
for(uniform int j = 0; j < N; j++){
float sum = 0;
for(uniform int k = 0; k < K; k++){
sum += A[i*K*nmat + k*nmat + idx] * B[k*N*nmat + j*nmat + idx];
}
C[i*N*nmat + j*nmat + idx] = sum;
}
}
}
export void gemm(uniform float * uniform A,
uniform float * uniform B,
uniform float * uniform C,
uniform int M,
uniform int N,
uniform int K,
uniform int nmat)
{
foreach(i = 0 ... nmat){
mxm(A, B, C, M, N, K, nmat, i);
}
}
xGEMM 5x5 speedup over 1000 matrices (GCC 4.8 -O3, Ivy Bridge)
aligned memory!
11
KALMAN FILTER
TRACKING
• The Kalman filter method is intended
for finding the optimum estimation r of
an unknown vector rt according to the
measurements mk, k=1...n, of the vector
rt.
• Plenty of linear algebra operations so
it’s a good use case for vectorization.
• Caveats:
‣ tracks have different number of hits
(use sorting)
‣ an hit can be 1 or 2 dimensional
(serialize branching)
12
KALMAN FILTER
100 EVENTS, ~100TRACKS WITH ~10 HITS EACH
0
1
3
4
5
7
8
KalmanFilter speedup (double precision), Ivy Bridge
5.3
3.4
ISPC vs scalar GSL with AoS to SoA conversion
ISPC vs scalar GSL assuming data is preconverted
export void startFilter(uniform KalmanFilter * uniform filter,
uniform KalmanFilterParameter * uniform param){
foreach(i = 0 ... filter->ntracks){
filterTrack(filter, param, i);
}
}
inline void filterTrack(uniform KalmanFilter * uniform filter,
uniform KalmanFilterParameter * uniform param,
int i){
...
for(uniform int h = 0; h < param->max_hit_count; h++){
if(h >= param->hit_count[i])
continue;
predictState(filter, param, h, i);
predictCovariance(filter, param, h, i);
if(param->hits[h].is2Dim[i]){
...
correctGain2D(filter, i);
correctState2D(filter, i);
correctCovariance2D(filter, i);
}else{
...
correctGain1D(filter, i);
correctState1D(filter, i);
correctCovariance1D(filter, i);
}
}
...
}
13
WHAT ABOUT OPENCL?
FIRST IMPRESSIONS
• Intel’s OpenCL embeds an implicit vectorization module which has some similarities with ISPC but...
‣ ISPC warns the user if an inefficient data access pattern is detected
‣ the programmer can specify in ISPC which code has to be executed serially and which one has to be
vectorized (for vs foreach loop)
‣ variables can be declared as uniform or varying in ISPC
‣ ISPC supports lightweight kernel calls while in OpenCL an API call to a driver must be made
• OpenCL has native support for the Xeon PHI
‣ ISPC will support the Xeon PHI natively when LLVM will
• Porting code from ISPC to OpenCL and vice versa is relatively easy
‣ OpenCL comes with some boilerplate code though
‣ task parallelism compositing withTBB works with ISPC and OpenCL
‣ but ISPC is easier to compose with an arbitrary task scheduler while OpenCL requires device fission
14
WHAT ABOUT OPENCL?
FIRST IMPRESSIONS
__kernel void gemm(__global float *A,
__global float *B,
__global float *C,
const int M,
const int N,
const int K,
const int num){
const int idx = get_global_id(0);
if(idx >= num)
return;
for(int i = 0; i < M; i++){
for(int j = 0; j < N; j++){
float sum = 0;
for(int k = 0; k < K; k++){
sum += A[i*K*num + k*num + idx] * B[k*N*num + j*num + idx];
}
C[i*N*num + j*num + idx] = sum;
}
}
}
.LBB0_8:
movslq! %esi, %rsi
vmovups! (%rdi,%rsi,4), %xmm2
movslq! %ebp, %rbp
vmovups! (%rcx,%rbp,4), %xmm3
vmulps! %xmm2, %xmm3, %xmm2
vaddps! %xmm2, %xmm1, %xmm1
addl! %edx, %ebp
addl! %eax, %esi
decl! %r11d
jne! .LBB0_8
.LBB0_8:
movslq! %r11d, %r11
vmovups! (%r14,%r11,4), %ymm2
movslq! %esi, %rsi
vmovups! (%r12,%rsi,4), %ymm3
vmulps! %ymm2, %ymm3, %ymm2
vaddps! %ymm2, %ymm1, %ymm1
addl! %edx, %esi
addl! %r13d, %r11d
decl! %r10d
jne! .LBB0_8
AVX
AVX 2
W
hyisitnotusing
Bottom Line: too early for numbers
256
bitregisters?
15
FINALTHOUGHTS
THEVECTORIZATION FINAL SOLUTION?
• ISPC is by far the best option I have seen to exploit vectorization
‣ it gives the programmer an abstract machine model to reason about
‣ it warns about inefficient memory accesses (aka gathers and scatters)
• In other words, it’s not going to get much easier than that but...
‣ full C++ support would be a welcome addition
‣ linear algebra operations need to be reimplemented
• ISPC is stable, extremely well documented and open source
• For more information visit http://ispc.github.io
16
17
BACKUP
18
MEMORY MATTERS
• Addressing modes
‣ SSEx and AVX1 do not support strided accesses pattern and gather-scatter accesses which force
the compiler to generate scalar instructions
‣ Even when “fancy” access patterns are supported (e.g. IMCI and AVX2) a penalty is paid
‣ Convert your arrays of structures to structures of arrays
• Memory alignment
‣ Unaligned memory access may generate scalar instructions for otherwise vectorizable code
‣ Vectorized loads from unaligned memory suffer from a penalty
‣ The penalty decreased over the years and may become negligible in the future
‣ Align data in memory to the width of the SIMD unit if possible
19

Contenu connexe

Tendances

A "Do-It-Yourself" Specification Language with BeepBeep 3 (Talk @ Dagstuhl 2017)
A "Do-It-Yourself" Specification Language with BeepBeep 3 (Talk @ Dagstuhl 2017)A "Do-It-Yourself" Specification Language with BeepBeep 3 (Talk @ Dagstuhl 2017)
A "Do-It-Yourself" Specification Language with BeepBeep 3 (Talk @ Dagstuhl 2017)Sylvain Hallé
 
Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”Platonov Sergey
 
Basic c++ programs
Basic c++ programsBasic c++ programs
Basic c++ programsharman kaur
 
Конверсия управляемых языков в неуправляемые
Конверсия управляемых языков в неуправляемыеКонверсия управляемых языков в неуправляемые
Конверсия управляемых языков в неуправляемыеPlatonov Sergey
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerPlatonov Sergey
 
C++totural file
C++totural fileC++totural file
C++totural filehalaisumit
 
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java scriptCodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java scriptCodiLime
 
Transition graph using free monads and existentials
Transition graph using free monads and existentialsTransition graph using free monads and existentials
Transition graph using free monads and existentialsAlexander Granin
 
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...Muhammad Ulhaque
 
Operator overloading2
Operator overloading2Operator overloading2
Operator overloading2zindadili
 
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiCysinfo Cyber Security Community
 
High performance web programming with C++14
High performance web programming with C++14High performance web programming with C++14
High performance web programming with C++14Matthieu Garrigues
 
Rainer Grimm, “Functional Programming in C++11”
Rainer Grimm, “Functional Programming in C++11”Rainer Grimm, “Functional Programming in C++11”
Rainer Grimm, “Functional Programming in C++11”Platonov Sergey
 
PHP in 2018 - Q1 - AFUP Limoges
PHP in 2018 - Q1 - AFUP LimogesPHP in 2018 - Q1 - AFUP Limoges
PHP in 2018 - Q1 - AFUP Limoges✅ William Pinaud
 
C++17 introduction - Meetup @EtixLabs
C++17 introduction - Meetup @EtixLabsC++17 introduction - Meetup @EtixLabs
C++17 introduction - Meetup @EtixLabsStephane Gleizes
 

Tendances (20)

Rcpp11 useR2014
Rcpp11 useR2014Rcpp11 useR2014
Rcpp11 useR2014
 
A "Do-It-Yourself" Specification Language with BeepBeep 3 (Talk @ Dagstuhl 2017)
A "Do-It-Yourself" Specification Language with BeepBeep 3 (Talk @ Dagstuhl 2017)A "Do-It-Yourself" Specification Language with BeepBeep 3 (Talk @ Dagstuhl 2017)
A "Do-It-Yourself" Specification Language with BeepBeep 3 (Talk @ Dagstuhl 2017)
 
Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”
 
Basic c++ programs
Basic c++ programsBasic c++ programs
Basic c++ programs
 
C++ via C#
C++ via C#C++ via C#
C++ via C#
 
Конверсия управляемых языков в неуправляемые
Конверсия управляемых языков в неуправляемыеКонверсия управляемых языков в неуправляемые
Конверсия управляемых языков в неуправляемые
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
C++totural file
C++totural fileC++totural file
C++totural file
 
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java scriptCodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
CodiLime Tech Talk - Grzegorz Rozdzialik: What the java script
 
C++ tutorial
C++ tutorialC++ tutorial
C++ tutorial
 
Transition graph using free monads and existentials
Transition graph using free monads and existentialsTransition graph using free monads and existentials
Transition graph using free monads and existentials
 
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
ECSE 221 - Introduction to Computer Engineering - Tutorial 1 - Muhammad Ehtas...
 
Operator overloading2
Operator overloading2Operator overloading2
Operator overloading2
 
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
 
Day 1
Day 1Day 1
Day 1
 
C++11
C++11C++11
C++11
 
High performance web programming with C++14
High performance web programming with C++14High performance web programming with C++14
High performance web programming with C++14
 
Rainer Grimm, “Functional Programming in C++11”
Rainer Grimm, “Functional Programming in C++11”Rainer Grimm, “Functional Programming in C++11”
Rainer Grimm, “Functional Programming in C++11”
 
PHP in 2018 - Q1 - AFUP Limoges
PHP in 2018 - Q1 - AFUP LimogesPHP in 2018 - Q1 - AFUP Limoges
PHP in 2018 - Q1 - AFUP Limoges
 
C++17 introduction - Meetup @EtixLabs
C++17 introduction - Meetup @EtixLabsC++17 introduction - Meetup @EtixLabs
C++17 introduction - Meetup @EtixLabs
 

Similaire à Exploiting vectorization with ISPC

Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisFastly
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to knowRoberto Agostino Vitillo
 
C programming language tutorial
C programming language tutorial C programming language tutorial
C programming language tutorial javaTpoint s
 
How to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITEgor Bogatov
 
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...Andrey Karpov
 
MuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for CMuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for CSusumu Tokumoto
 
How to Adopt Modern C++17 into Your C++ Code
How to Adopt Modern C++17 into Your C++ CodeHow to Adopt Modern C++17 into Your C++ Code
How to Adopt Modern C++17 into Your C++ CodeMicrosoft Tech Community
 
How to Adopt Modern C++17 into Your C++ Code
How to Adopt Modern C++17 into Your C++ CodeHow to Adopt Modern C++17 into Your C++ Code
How to Adopt Modern C++17 into Your C++ CodeMicrosoft Tech Community
 
All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2goMoriyoshi Koizumi
 
Chapter 7 functions (c)
Chapter 7 functions (c)Chapter 7 functions (c)
Chapter 7 functions (c)hhliu
 
golang_getting_started.pptx
golang_getting_started.pptxgolang_getting_started.pptx
golang_getting_started.pptxGuy Komari
 
Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]
Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]
Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]Chris Adamson
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisC4Media
 

Similaire à Exploiting vectorization with ISPC (20)

Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
 
Code optimization
Code optimization Code optimization
Code optimization
 
Code optimization
Code optimization Code optimization
Code optimization
 
C programming language tutorial
C programming language tutorial C programming language tutorial
C programming language tutorial
 
functions
functionsfunctions
functions
 
How to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJIT
 
Blazing Fast Windows 8 Apps using Visual C++
Blazing Fast Windows 8 Apps using Visual C++Blazing Fast Windows 8 Apps using Visual C++
Blazing Fast Windows 8 Apps using Visual C++
 
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...
PVS-Studio 5.00, a solution for developers of modern resource-intensive appl...
 
MuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for CMuVM: Higher Order Mutation Analysis Virtual Machine for C
MuVM: Higher Order Mutation Analysis Virtual Machine for C
 
How to Adopt Modern C++17 into Your C++ Code
How to Adopt Modern C++17 into Your C++ CodeHow to Adopt Modern C++17 into Your C++ Code
How to Adopt Modern C++17 into Your C++ Code
 
How to Adopt Modern C++17 into Your C++ Code
How to Adopt Modern C++17 into Your C++ CodeHow to Adopt Modern C++17 into Your C++ Code
How to Adopt Modern C++17 into Your C++ Code
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2go
 
Cpp tutorial
Cpp tutorialCpp tutorial
Cpp tutorial
 
Chapter 7 functions (c)
Chapter 7 functions (c)Chapter 7 functions (c)
Chapter 7 functions (c)
 
golang_getting_started.pptx
golang_getting_started.pptxgolang_getting_started.pptx
golang_getting_started.pptx
 
Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]
Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]
Oh Crap, I Forgot (Or Never Learned) C! [CodeMash 2010]
 
CppTutorial.ppt
CppTutorial.pptCppTutorial.ppt
CppTutorial.ppt
 
Beyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic AnalysisBeyond Breakpoints: A Tour of Dynamic Analysis
Beyond Breakpoints: A Tour of Dynamic Analysis
 

Plus de Roberto Agostino Vitillo (12)

Telemetry Onboarding
Telemetry OnboardingTelemetry Onboarding
Telemetry Onboarding
 
Growing a Data Pipeline for Analytics
Growing a Data Pipeline for AnalyticsGrowing a Data Pipeline for Analytics
Growing a Data Pipeline for Analytics
 
Telemetry Datasets
Telemetry DatasetsTelemetry Datasets
Telemetry Datasets
 
Growing a SQL Query
Growing a SQL QueryGrowing a SQL Query
Growing a SQL Query
 
Telemetry Onboarding
Telemetry OnboardingTelemetry Onboarding
Telemetry Onboarding
 
All you need to know about Statistics
All you need to know about StatisticsAll you need to know about Statistics
All you need to know about Statistics
 
Spark meets Telemetry
Spark meets TelemetrySpark meets Telemetry
Spark meets Telemetry
 
Sharing C++ objects in Linux
Sharing C++ objects in LinuxSharing C++ objects in Linux
Sharing C++ objects in Linux
 
Performance tools developments
Performance tools developmentsPerformance tools developments
Performance tools developments
 
GOoDA tutorial
GOoDA tutorialGOoDA tutorial
GOoDA tutorial
 
Callgraph analysis
Callgraph analysisCallgraph analysis
Callgraph analysis
 
Inter-process communication on steroids
Inter-process communication on steroidsInter-process communication on steroids
Inter-process communication on steroids
 

Dernier

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Exploiting vectorization with ISPC

  • 1. EXPLOITINGVECTORIZATION WITH ISPC Roberto A.Vitillo (LBNL) 8/20/2013 ATLAS FSTF Meeting 1
  • 2. DESIGN LEVELS Architecture Algorithms & Data Structures Code System Software Hardware • Independent changes on different levels causes speedups to multiply • Clock speed free lunch is over ‣ hardware parallelism is increasing • How can parallelism be exploited at the bottom layers? 2
  • 3. • Single Instruction, Multiple Data: ‣ processor throughput is increased by handling multiple data in parallel ‣ exploiting SIMD is increasingly becoming more important on Xeons (see AVX-512) ‣ exploiting SIMD is mandatory to achieve reasonable performances on the Xeon PHI Source:“Computer Architecture, A Quantitative Approach” SIMD WHY IT MATTERS 3
  • 4. SIMD IN ATLAS • Approaches: 1. vectorized linear algebra libraries 2. intrinsics & wrappers (a.k.a. gobbledygook) 3. autovectorization (hard to maintain when it works) 4. language extensions for vectorization ‣OpenMP4 (wait a few years and retry) ‣array annotations in Cilk Plus (future GCC releases) Up to now Eigen & friends were the only reliable tools to exploit vectorization in ATLAS 4
  • 5. MEET ISPC • Intel SPMD Program Compiler (ISPC) extends a C-based language with “single program, multiple data” (SPMD) constructs • An ISPC program describes the behavior of a single program instance ‣ even though a “gang” of them is in reality being executed ‣ gang size is usually no more than 2-4x the native SIMD width of the machine • For CUDA affectionados ‣ ISPC Program is similar to a CUDA thread ‣ An ISPC gang is similar to a CUDA warp 5
  • 6. SPMD PARADIGM Execution of a SPMD program with a gang size of 4 • Observations: ‣ diverging control flow reduces the utilization of vector instructions ‣ vectorization adds masking overhead 6
  • 7. HELLO WORLD export void simple(uniform float vin[], uniform float vout[], uniform int count) { foreach (index = 0 ... count) { float v = vin[index]; if (v < 3.) v = v * v; else v = sqrt(v); vout[index] = v; } } simple.ispc, compiled with ispc #include <stdio.h> #include "simple.h" int main() { float vin[16], vout[16]; for (int i = 0; i < 16; ++i) vin[i] = i; simple(vin, vout, 16); for (int i = 0; i < 16; ++i) printf("%d: simple(%f) = %fn", i, vin[i], vout[i]); } main.c, compiled with GCC uniform variable is shared among program instances make function available to be called from application code each program instance has a private instance of a non-uniform variable (a.k.a. varying variable) ispc function is called like any other function from the C/C++ application foreach expresses a parallel computation 7
  • 8. DEBUGGING SUPPORT BEYOND GDB foreach(k = 0 ... 6){ int i = k * 7; print("%n", i); double* dR = &P[i]; double* dA = &P[i+3]; ... } Prints [0, 7, 14, 21, 28, 35, ((42)), ((49))] 0 7 14 21 28 35 (42) (49) Inactive Program Instances gang size of 8 8
  • 9. DEBUGGING SUPPORT PERFORMANCE WARNINGS export void foo(uniform float * uniform A, uniform int n){ foreach(i = 0 ... n){ A[i*8] *= A[i*8]; } } 9
  • 10. MATRIX MULTIPLICATION EXPLOITING HORIZONTALVECTORIZATION WITH SMALL MATRICES 0 1 3 4 5 7 8 6.4 3.3 ISPC vs GCC DGEMM ISPC vs GCC SGEMM inline void mxm(uniform float * uniform A, uniform float * uniform B, uniform float * uniform C, uniform int M, uniform int N, uniform int K, uniform int nmat, int idx) { for(uniform int i = 0; i < M; i++){ for(uniform int j = 0; j < N; j++){ float sum = 0; for(uniform int k = 0; k < K; k++){ sum += A[i*K*nmat + k*nmat + idx] * B[k*N*nmat + j*nmat + idx]; } C[i*N*nmat + j*nmat + idx] = sum; } } } export void gemm(uniform float * uniform A, uniform float * uniform B, uniform float * uniform C, uniform int M, uniform int N, uniform int K, uniform int nmat) { foreach(i = 0 ... nmat){ mxm(A, B, C, M, N, K, nmat, i); } } xGEMM 5x5 speedup over 1000 matrices (GCC 4.8 -O3, Ivy Bridge) 10
  • 11. MATRIX MULTIPLICATION EXPLOITING HORIZONTALVECTORIZATION WITH SMALL MATRICES 0 1 3 4 5 7 8 7.6 3.7 ISPC vs GCC DGEMM ISPC vs GCC SGEMM inline void mxm(uniform float * uniform A, uniform float * uniform B, uniform float * uniform C, uniform int M, uniform int N, uniform int K, uniform int nmat, int idx) { for(uniform int i = 0; i < M; i++){ for(uniform int j = 0; j < N; j++){ float sum = 0; for(uniform int k = 0; k < K; k++){ sum += A[i*K*nmat + k*nmat + idx] * B[k*N*nmat + j*nmat + idx]; } C[i*N*nmat + j*nmat + idx] = sum; } } } export void gemm(uniform float * uniform A, uniform float * uniform B, uniform float * uniform C, uniform int M, uniform int N, uniform int K, uniform int nmat) { foreach(i = 0 ... nmat){ mxm(A, B, C, M, N, K, nmat, i); } } xGEMM 5x5 speedup over 1000 matrices (GCC 4.8 -O3, Ivy Bridge) aligned memory! 11
  • 12. KALMAN FILTER TRACKING • The Kalman filter method is intended for finding the optimum estimation r of an unknown vector rt according to the measurements mk, k=1...n, of the vector rt. • Plenty of linear algebra operations so it’s a good use case for vectorization. • Caveats: ‣ tracks have different number of hits (use sorting) ‣ an hit can be 1 or 2 dimensional (serialize branching) 12
  • 13. KALMAN FILTER 100 EVENTS, ~100TRACKS WITH ~10 HITS EACH 0 1 3 4 5 7 8 KalmanFilter speedup (double precision), Ivy Bridge 5.3 3.4 ISPC vs scalar GSL with AoS to SoA conversion ISPC vs scalar GSL assuming data is preconverted export void startFilter(uniform KalmanFilter * uniform filter, uniform KalmanFilterParameter * uniform param){ foreach(i = 0 ... filter->ntracks){ filterTrack(filter, param, i); } } inline void filterTrack(uniform KalmanFilter * uniform filter, uniform KalmanFilterParameter * uniform param, int i){ ... for(uniform int h = 0; h < param->max_hit_count; h++){ if(h >= param->hit_count[i]) continue; predictState(filter, param, h, i); predictCovariance(filter, param, h, i); if(param->hits[h].is2Dim[i]){ ... correctGain2D(filter, i); correctState2D(filter, i); correctCovariance2D(filter, i); }else{ ... correctGain1D(filter, i); correctState1D(filter, i); correctCovariance1D(filter, i); } } ... } 13
  • 14. WHAT ABOUT OPENCL? FIRST IMPRESSIONS • Intel’s OpenCL embeds an implicit vectorization module which has some similarities with ISPC but... ‣ ISPC warns the user if an inefficient data access pattern is detected ‣ the programmer can specify in ISPC which code has to be executed serially and which one has to be vectorized (for vs foreach loop) ‣ variables can be declared as uniform or varying in ISPC ‣ ISPC supports lightweight kernel calls while in OpenCL an API call to a driver must be made • OpenCL has native support for the Xeon PHI ‣ ISPC will support the Xeon PHI natively when LLVM will • Porting code from ISPC to OpenCL and vice versa is relatively easy ‣ OpenCL comes with some boilerplate code though ‣ task parallelism compositing withTBB works with ISPC and OpenCL ‣ but ISPC is easier to compose with an arbitrary task scheduler while OpenCL requires device fission 14
  • 15. WHAT ABOUT OPENCL? FIRST IMPRESSIONS __kernel void gemm(__global float *A, __global float *B, __global float *C, const int M, const int N, const int K, const int num){ const int idx = get_global_id(0); if(idx >= num) return; for(int i = 0; i < M; i++){ for(int j = 0; j < N; j++){ float sum = 0; for(int k = 0; k < K; k++){ sum += A[i*K*num + k*num + idx] * B[k*N*num + j*num + idx]; } C[i*N*num + j*num + idx] = sum; } } } .LBB0_8: movslq! %esi, %rsi vmovups! (%rdi,%rsi,4), %xmm2 movslq! %ebp, %rbp vmovups! (%rcx,%rbp,4), %xmm3 vmulps! %xmm2, %xmm3, %xmm2 vaddps! %xmm2, %xmm1, %xmm1 addl! %edx, %ebp addl! %eax, %esi decl! %r11d jne! .LBB0_8 .LBB0_8: movslq! %r11d, %r11 vmovups! (%r14,%r11,4), %ymm2 movslq! %esi, %rsi vmovups! (%r12,%rsi,4), %ymm3 vmulps! %ymm2, %ymm3, %ymm2 vaddps! %ymm2, %ymm1, %ymm1 addl! %edx, %esi addl! %r13d, %r11d decl! %r10d jne! .LBB0_8 AVX AVX 2 W hyisitnotusing Bottom Line: too early for numbers 256 bitregisters? 15
  • 16. FINALTHOUGHTS THEVECTORIZATION FINAL SOLUTION? • ISPC is by far the best option I have seen to exploit vectorization ‣ it gives the programmer an abstract machine model to reason about ‣ it warns about inefficient memory accesses (aka gathers and scatters) • In other words, it’s not going to get much easier than that but... ‣ full C++ support would be a welcome addition ‣ linear algebra operations need to be reimplemented • ISPC is stable, extremely well documented and open source • For more information visit http://ispc.github.io 16
  • 17. 17
  • 19. MEMORY MATTERS • Addressing modes ‣ SSEx and AVX1 do not support strided accesses pattern and gather-scatter accesses which force the compiler to generate scalar instructions ‣ Even when “fancy” access patterns are supported (e.g. IMCI and AVX2) a penalty is paid ‣ Convert your arrays of structures to structures of arrays • Memory alignment ‣ Unaligned memory access may generate scalar instructions for otherwise vectorizable code ‣ Vectorized loads from unaligned memory suffer from a penalty ‣ The penalty decreased over the years and may become negligible in the future ‣ Align data in memory to the width of the SIMD unit if possible 19