SlideShare une entreprise Scribd logo
1  sur  55
Télécharger pour lire hors ligne
Software Abstractions for Parallel Architectures 
Joel Falcou 
LRI - CNRS - INRIA 
HDR Thesis Defense 
12/01/2014
The Paradigm Change in Science 
From Experiments to Simulations 
 Simulations is now an integral 
part of the Scientic Method 
 Scientic Computing enables 
larger, faster, more accurate 
Research 
 Fast Simulation is Time Travel as 
scientic results are now more 
readily available 
Local Galaxy Cluster Simulation - Illustris project 
Computing is rst and foremost a mainstream science tool 
2 of 41
The Paradigm Change in Science 
The Parallel Hell 
 Heat Wall: Growing cores 
instead of GHz 
 Hierarchical and heterogeneous 
parallel systems are the norm 
 The Free Lunch is over as 
hardware complexity rises faster 
than the average developer skills 
Local Galaxy Cluster Simulation - Illustris project 
The real challenge in HPC is the Expressiveness/Efficiency War 
2 of 41
The Expressiveness/Efficiency War 
Single Core Era 
Performance 
Expressiveness 
C/Fort. 
C++ 
Java 
Multi-Core/SIMD Era 
Performance 
Sequential 
Expressiveness 
SIMD 
Threads 
Heterogenous Era 
Performance 
Sequential 
Expressiveness 
GPU 
Phi 
SIMD 
Threads 
Distributed 
As parallel systems complexity grows, the expressiveness gap turns into an ocean 
3 of 41
Designing tools for Scientic Computing 
Objectives 
1. Be non-disruptive 
2. Domain driven optimizations 
3. Provide intuitive API for the user 
4. Support a wide architectural landscape 
5. Be efficient 
4 of 41
Designing tools for Scientic Computing 
Objectives 
1. Be non-disruptive 
2. Domain driven optimizations 
3. Provide intuitive API for the user 
4. Support a wide architectural landscape 
5. Be efficient 
Our Approach 
 Design tools as C++ libraries (1) 
 Design these libraries as Domain Specic Embedded Languages (DSEL) (2+3) 
 Use Parallel Programming Abstractions as parallel components (4) 
 Use Generative Programming to deliver performance (5) 
4 of 41
Talk Layout 
Introduction 
Abstractions  Efficiency 
Experimental Results 
Conclusion 
5 of 41
Why Parallel Programming Models ? 
Limits of regular tools 
 Unstructured parallelism is error-prone 
 Low level parallel tools are non-composable 
 Contribute to the Expressiveness Gap 
6 of 41
Why Parallel Programming Models ? 
Limits of regular tools 
 Unstructured parallelism is error-prone 
 Low level parallel tools are non-composable 
 Contribute to the Expressiveness Gap 
Available Models 
 Performance centric: P-RAM, LOG-P, BSP 
 Pattern centric: Futures, Skeletons 
 Data centric: HTA, PGAS 
6 of 41
Why Parallel Programming Models ? 
Limits of regular tools 
 Unstructured parallelism is error-prone 
 Low level parallel tools are non-composable 
 Contribute to the Expressiveness Gap 
Available Models 
 Performance centric: P-RAM, LOG-P, BSP 
 Pattern centric: Futures, Skeletons 
 Data centric: HTA, PGAS 
6 of 41
Bulk Synchronous Parallelism [Valiant, McColl 90] 
Principles 
 Machine Model 
 Execution Model 
 Analytic Cost Model 
C 
o 
m 
p 
u 
t 
e 
B 
a 
r 
r 
i 
e 
r 
C 
o 
m 
m 
Wmax h.g 
P0 
P1 
P2 
P3 
Superstep T Superstep T+1 
Wmax h.g L 
BSP Execution Model 
7 of 41
Bulk Synchronous Parallelism [Valiant, McColl 90] 
Advantages 
 Simple set of primitives 
 Implementable on any 
kind of hardware 
 Possibility to reason about 
BSP programs 
C 
o 
m 
p 
u 
t 
e 
B 
a 
r 
r 
i 
e 
r 
C 
o 
m 
m 
Wmax h.g 
P0 
P1 
P2 
P3 
Superstep T Superstep T+1 
Wmax h.g L 
BSP Execution Model 
7 of 41
Parallel Skeletons [Cole 89] 
Principles 
 There are patterns in parallel applications 
 Those patterns can be generalized in Skeletons 
 Applications are assembled as a combination of such patterns 
Functional point of view 
 Skeletons are Higher-Order Functions 
 Skeletons support a compositionnal semantic 
 Applications become composition of state-less functions 
8 of 41
Parallel Skeletons [Cole 89] 
Principles 
 There are patterns in parallel applications 
 Those patterns can be generalized in Skeletons 
 Applications are assembled as a combination of such patterns 
Classical Skeletons 
 Data parallel: map, fold, scan 
 Task parallel: par, pipe, farm 
 More complex: Distribuable Homomorphism, Divide  Conquer, … 
8 of 41
Relevance to our Objectives 
Why using Parallel Skeletons ? 
 Write code independant of parallel programming minutiae 
 Composability supports hierarchical architectures 
 Code is scalable and easy to maintain 
Why using BSP ? 
 Cost model guide development 
 Few primitives mean that intellectual burden is low 
 Good medium for developping skeletons 
How to ensure performance of those models’ implementations ? 
9 of 41
Domain Specic Embedded Languages 
Domain Specic Languages 
 Non-Turing complete declarative languages 
 Solve a single type of problems 
 Express what to do instead of how to do it 
 E.g: SQL, M, M, … 
From DSL to DSEL [Abrahams 2004] 
 A DSL incorporates domain-specic notation, constructs, and abstractions as 
fundamental design considerations. 
 A Domain Specic Embedded Languages (DSEL) is simply a library that meets the 
same criteria 
 Generative Programming is one way to design such libraries 
10 of 41
Generative Programming [Eisenecker 97] 
Domain Specific 
Application Description 
Generative Component Concrete Application 
Translator 
Parametric 
Sub-components 
11 of 41
Meta-programming as a Tool 
Denition 
Meta-programming is the writing of computer programs that analyse, transform and 
generate other programs (or themselves) as their data. 
Meta-programmable Languages 
 metaOCAML : runtime code generation via code quoting 
 Template Haskell : compile-time code generation via templates 
 C++ : compile-time code generation via templates 
C++ meta-programming 
 Relies on the Turing-complete C++  sub-language 
 Handles types and integral constants at compile-time 
  classes and functions act as code quoting 
12 of 41
The Expression Templates Idiom 
Principles 
 Relies on extensive operator 
overloading 
 Carries semantic information 
around code fragment 
 Introduces DSLs without 
disrupting dev. chain 
matrix x(h,w),a(h,w),b(h,w); 
x = cos(a) + (b*a); 
exprassign 
,exprmatrix 
,exprplus 
, exprcos 
,exprmatrix 
 
, exprmultiplies 
,exprmatrix 
,exprmatrix 
 
(x,a,b); 
+ 
= 
cos * 
a b a 
x 
#pragma omp parallel for 
for(int j=0;jh;++j) 
{ 
for(int i=0;iw;++i) 
{ 
x(j,i) = cos(a(j,i)) 
+ ( b(j,i) 
* a(j,i) 
); 
} 
} 
Arbitrary Transforms applied 
on the meta-AST 
General Principles of Expression Templates 
13 of 41
The Expression Templates Idiom 
Advantages 
 Generic implementation becomes 
self-aware of optimizations 
 API abstraction level is arbitrary 
high 
 Accessible through high-level 
tools like B.P 
matrix x(h,w),a(h,w),b(h,w); 
x = cos(a) + (b*a); 
exprassign 
,exprmatrix 
,exprplus 
, exprcos 
,exprmatrix 
 
, exprmultiplies 
,exprmatrix 
,exprmatrix 
 
(x,a,b); 
+ 
= 
cos * 
a b a 
x 
#pragma omp parallel for 
for(int j=0;jh;++j) 
{ 
for(int i=0;iw;++i) 
{ 
x(j,i) = cos(a(j,i)) 
+ ( b(j,i) 
* a(j,i) 
); 
} 
} 
Arbitrary Transforms applied 
on the meta-AST 
General Principles of Expression Templates 
13 of 41
Our Contributions 
Our Strategy 
 Applies DSEL generation techniques to parallel programming 
 Maintains low cost of abstractions through meta-programming 
 Maintains abstraction level via modern library design 
Our contributions 
Tools Pub. Scope Applications 
Quaff ParCo’06 MPI Skeletons Real-time 3D reconstruction 
SkellPU PACT’08 Skeleton on Cell BE Real-time Image processing 
BSP++ IJPP’12 MPI/OpenMP BSP Bioinformatics, Model Checking 
NT2 JPDC’14 Data Parallel Matlab Fluid Dynamics, Vision 
14 of 41
Example of BSP++ Application 
Khaled Hamidouche PHD 2008-2011 in collab. with Univ. Brasilia 
BSP Smith  Waterman 
 SW computes DNA sequences alignment 
 BSP++ implementation was written once and run on 7 different hardwares 
 Efficiency of 95+% even on 6000 cores super-computer 
Platform MaxSize # Elements Speedup GCUPs 
cluster (MPI) 1,072,950 128 cores 73x 6.53 
cluster (MPI/OpenMP) 1,072,950 128 cores 116x 10.41 
OpenMP 1,072,950 16 cores 16x 0.40 
CellBE 85,603 8 SPEs — 0.14 
cluster of CellBEs 85,603 24 SPEs (8:24) 2.8x 0.37 
Hopper(MPI) 5,303,436 3072 cores 260x 3.09 
Hopper(MPI+OpenMP) 24,894,269 6144 cores 5664x 15,5 
15 of 41
Second Look at our Contributions 
Development Limitations 
 DSELs are mostly tied to the domain model 
 Architecture support is often an afterthought 
 Extensibility is difficult as many refactoring are required per architecture 
 Example : No proper way to support GPUs with those implementation techniques 
16 of 41
Second Look at our Contributions 
Development Limitations 
 DSELs are mostly tied to the domain model 
 Architecture support is often an afterthought 
 Extensibility is difficult as many refactoring are required per architecture 
 Example : No proper way to support GPUs with those implementation techniques 
Proposed Method 
 Extends Generative Programming to take this architecture into account 
 Provides an architecture description DSEL 
 Integrates this description in the code generation process 
16 of 41
Architecture Aware Generative Programming 
17 of 41
Software refactoring 
Tools Issues Changes 
Quaff Raw skeletons API Re-engineered as part of NT2 
SkellPU Too architecture specic Re-engineered as part of NT2 
BSP++ Integration issues Integrate hybrid code generation 
NT2 Not easily extendable Integrate Quaff Skeleton models 
Boost.SIMD - Side product of NT2 restructuration 
Conclusion 
 Skeletons are ne as parallel middleware 
 Model based abstractions are not high level enough 
 For low level architectures, the simplest model is often the best 
18 of 41
Boost.SIMD 
Pierre Estérie PHD 2010-2014 
Principles 
 Provides simple C++ API over SIMD 
extensions 
 Supports every Intel, PPC and ARM 
instructions sets 
 Fully integrates with modern C++ 
idioms 
Sparse Tridiagonal Solver - collaboration with M. Baboulin and Y. wang 
19 of 41
Talk Layout 
Introduction 
Abstractions  Efficiency 
Experimental Results 
Conclusion 
20 of 41
The Numerical Template Toolbox 
Pierre Estérie PHD 2010-2014 
NT2 as a Scientic Computing Library 
 Provides a simple, M-like interface for users 
 Provides high-performance computing entities and primitives 
 Is easily extendable 
Components 
 Uses Boost.SIMD for in-core optimizations 
 Uses recursive parallel skeletons 
 Supports task parallelism through Futures 
21 of 41
The Numerical Template Toolbox 
Principles 
 tableT,S is a simple, multidimensional array object that exactly 
mimics M array behavior and functionalities 
 500+ functions usable directly either on table or on any scalar values 
as in M 
22 of 41
The Numerical Template Toolbox 
Principles 
 tableT,S is a simple, multidimensional array object that exactly 
mimics M array behavior and functionalities 
 500+ functions usable directly either on table or on any scalar values 
as in M 
How does it works 
 Take a .m le, copy to a .cpp le 
22 of 41
The Numerical Template Toolbox 
Principles 
 tableT,S is a simple, multidimensional array object that exactly 
mimics M array behavior and functionalities 
 500+ functions usable directly either on table or on any scalar values 
as in M 
How does it works 
 Take a .m le, copy to a .cpp le 
 Add #include nt2/nt2.hpp and do cosmetic changes 
22 of 41
The Numerical Template Toolbox 
Principles 
 tableT,S is a simple, multidimensional array object that exactly 
mimics M array behavior and functionalities 
 500+ functions usable directly either on table or on any scalar values 
as in M 
How does it works 
 Take a .m le, copy to a .cpp le 
 Add #include nt2/nt2.hpp and do cosmetic changes 
 Compile the le and link with libnt2.a 
22 of 41
NT2 - From M to C++ 
M code 
A1 = 1 : 1 0 0 0 ; 
A2 = A1 + randn ( size ( A1 ) ) ; 
X = lu ( A1 * A1 ’) ; 
rms = sqrt ( sum ( sqr ( A1 (:) - A2 (:) ) ) / numel ( A1 ) ) ; 
NT2 code 
table  double  A1 = _ (1. ,1000.) ; 
table  double  A2 = A1 + randn ( size ( A1 ) ) ; 
table  double  X = lu ( m t i m e s ( A1 , trans ( A1 ) ) ; 
d o u b l e rms = sqrt ( sum ( sqr ( A1 ( _ ) - A2 ( _ ) ) ) / numel ( A1 ) ) ; 
23 of 41
Parallel Skeletons extraction process 
A = B / sum(C+D); 
= 
A = 
B sum 
+ 
C D 
fold 
transform 
24 of 41
Parallel Skeletons extraction process 
A = B / sum(C+D); 
; ; 
= 
A = 
B sum 
+ 
C D 
fold 
transform 
= 
tmp sum 
+ 
C D 
fold 
) 
= 
A = 
B tmp 
transform 
25 of 41
From data to task parallelism 
Antoine Tran Tan PHD, 2012-2015 
Limits of the fork-join model 
 Synchronization cost due to implicit barriers 
 Under-exploitation of potential parallelism 
 Poor data locality and no inter-statement optimization 
26 of 41
From data to task parallelism 
Antoine Tran Tan PHD, 2012-2015 
Limits of the fork-join model 
 Synchronization cost due to implicit barriers 
 Under-exploitation of potential parallelism 
 Poor data locality and no inter-statement optimization 
Skeletons from the Future 
 Adapt current skeletons for taskication 
 Use Futures ( or HPX) to automatically pipeline 
 Derive a dependency graph between statements 
26 of 41
Parallel Skeletons extraction process - Take 2 
A = B / sum(C+D); 
; ; 
= 
tmp sum 
+ 
C D 
fold 
= 
A = 
B tmp 
transform 
27 of 41
Parallel Skeletons extraction process - Take 2 
A = B / sum(C+D); 
fold 
= 
= 
tmp(3) sum(3) 
+ 
tmp(1) sum 
+ 
C(:; 3) D(:; 3) 
= 
spawnertransform,OpenMP 
= 
= 
A(:; 3) = 
A(:; 2) = 
A(:; 1) = 
B(:; 3) tmp(3) 
B(:; 2) tmp(2) 
transform 
C(:; 1) D(:; 1) 
fold 
= 
tmp(2) sum(2) 
+ 
C(:; 2) D(:; 2) 
B(:; 1) tmp(1) 
transform 
workerfold,simd 
workertransform,simd 
spawnertransform,OpenMP 
; 
28 of 41
Motion Detection 
Lacassagne et al., ICIP 2009 
 Sigma-Delta algorithm based on background substraction 
 Use local gaussian model of lightness variation to detect motion 
 Challenge: Very low arithmetic density 
 Challenge: Integer-based implementation with small range 
29 of 41
Motion Detection 
table  char  s i g m a _ d e l t a ( table  char  b a c k g r o u n d 
, table  char  const  frame 
, table  char  v a r i a n c e 
) 
{ 
// E s t i m a t e Raw M o v e m e n t 
b a c k g r o u n d = s e l i n c ( b a c k g r o u n d  frame 
, s e l d e c ( b a c k g r o u n d  frame , b a c k g r o u n d ) 
) ; 
table  char  diff = dist ( background , frame ) ; 
// C o m p u t e Local V a r i a n c e 
table  char  sig3 = muls ( diff ,3) ; 
var = i f _ e l s e ( diff != 0 
, s e l i n c ( v a r i a n c e  sig3 
, s e l d e c ( var  sig3 , v a r i a n c e ) 
) 
, v a r i a n c e 
) ; 
// G e n e r a t e M o v e m e n t Label 
r e t u r n i f _ z e r o _ e l s e _ o n e ( diff  v a r i a n c e ) ; 
} 
30 of 41
Motion Detection 
18 
16 
14 
12 
10 
8 
6 
4 
2 
0 
512x512 1024x1024 
cycles/element 
Image Size (N x N) 
x6.8 
x14.8 
x16.5 
x2.1 
x3.6 
x6.7 
x15.3 
x18 
x2.3 
x3.99 
x10.8 
x10.8 
SCALAR 
HALF CORE 
FULL CORE 
SIMD 
JRTIP2008 
SIMD + HALF CORE 
SIMD + FULL CORE 
31 of 41
Black and Scholes Option Pricing 
NT2 Code 
table  float  b l a c k s c h o l e s ( table  float  const  Sa , table  float  const  Xa 
, table  float  const  Ta 
, table  float  const  ra , table  float  const  va 
) 
{ 
table  float  da = sqrt ( Ta ) ; 
table  float  d1 = log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) ; 
table  float  d2 = d1 - va * da ; 
r e t u r n Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) ; 
} 
32 of 41
Black and Scholes Option Pricing 
NT2 Code with loop fusion 
table  float  b l a c k s c h o l e s ( table  float  const  Sa , table  float  const  Xa 
, table  float  const  Ta 
, table  float  const  ra , table  float  const  va 
) 
{ 
// P r e a l l o c a t e t e m p o r a r y t a b l e s 
table  float  da ( e x t e n t ( Ta ) ) , d1 ( e x t e n t ( Ta ) ) , d2 ( e x t e n t ( Ta ) ) , R ( e x t e n t ( Ta ) ) ; 
// tie merge loop nest and i n c r e a s e cache l o c a l i t y 
tie ( da , d1 , d2 , R ) = tie ( sqrt ( Ta ) 
, log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) 
, d1 - va * da 
, Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) 
) ; 
r e t u r n R ; 
} 
32 of 41
Black and Scholes Option Pricing 
Performance 
1000000 
150 
100 
50 
0 
x1.89 
x2.91 
x5.58 
x6.30 
Size 
cycle/value 
scalar 
SSE2 
AVX2 
SSE2, 4 cores 
AVX2, 4 cores 
33 of 41
Black and Scholes Option Pricing 
Performance with loop fusion/futurisation 
1000000 
150 
100 
50 
0 
x2.27 
x4.13 
x8.05 
x11.12 
Size 
cycle/value 
scalar 
SSE2 
AVX2 
SSE2, 4 cores 
AVX2, 4 cores 
34 of 41
LU Decomposition 
Algorithm 
A00 
A10 A01 A02 
A20 A11 
A21 
A12 
A22 A11 
A21 A12 
A22 
A22 
step 1 
step 2 
step 3 
step 4 
step 5 
step 6 
step 7 
DGETRF 
DGESSM 
DTSTRF 
DSSSSM 
35 of 41
LU Decomposition 
Performance 
0 10 20 30 40 50 
100 
50 
0 
Number of cores 
Median GFLOPS 
8000  8000 LU decomposition 
NT2 
Intel MKL 
36 of 41
Talk Layout 
Introduction 
Abstractions  Efficiency 
Experimental Results 
Conclusion 
37 of 41
Conclusion 
Parallel Computing for Scientist 
 Software Libraries built as Generic and Generative components can solve a large 
chunk of parallelism related problems while being easy to use. 
 Like regular language, DSEL needs informations about the hardware system 
 Integrating hardware descriptions as Generic components increases tools portability 
and re-targetability 
38 of 41
Conclusion 
Parallel Computing for Scientist 
 Software Libraries built as Generic and Generative components can solve a large 
chunk of parallelism related problems while being easy to use. 
 Like regular language, DSEL needs informations about the hardware system 
 Integrating hardware descriptions as Generic components increases tools portability 
and re-targetability 
Our Achievements 
 A new method for parallel software development 
 Efficient libraries working on large subset of hardware 
 High level of performances across a wide application spectrum 
38 of 41
Works in Progress 
Application to Accelerators 
 Exploration of proper skeleton implementation on GPUs 
 Adaptation of Future based code generator 
 In progress with Ian Masliah’ PHD thesis 
Parallelism within C++ 
 SIMD as part of the standard library 
 Proposal N3571 for standard SIMD computation 
 Interoperability with current parallel model of C++ 
39 of 41
Perspectives 
DSEL as C++ rst class idiom 
 Build partial evaluation into the language 
 Ease transition between regular and meta C++ 
 Mid-term Prospect: metaOCAML like quoting for C++ 
DSEL and compilers relationship 
 C++ DSEL hits a limit on their applicability 
 Compilers often lack high level informations for proper optimization 
 Mid-term Prospect: Hybrid library/compiler approaches for DSEL 
40 of 41
Thanks for your attention

Contenu connexe

Tendances (20)

Matlab Basic Tutorial
Matlab Basic TutorialMatlab Basic Tutorial
Matlab Basic Tutorial
 
Storage classes, linkage & memory management
Storage classes, linkage & memory managementStorage classes, linkage & memory management
Storage classes, linkage & memory management
 
Matlab
MatlabMatlab
Matlab
 
Algorithm & data structure lec2
Algorithm & data structure lec2Algorithm & data structure lec2
Algorithm & data structure lec2
 
Glimpses of C++0x
Glimpses of C++0xGlimpses of C++0x
Glimpses of C++0x
 
Archi Modelling
Archi ModellingArchi Modelling
Archi Modelling
 
Introduction Of C++
Introduction Of C++Introduction Of C++
Introduction Of C++
 
Matlab ppt
Matlab pptMatlab ppt
Matlab ppt
 
C Programming Tutorial - www.infomtec.com
C Programming Tutorial - www.infomtec.comC Programming Tutorial - www.infomtec.com
C Programming Tutorial - www.infomtec.com
 
Fundamentals of matlab
Fundamentals of matlabFundamentals of matlab
Fundamentals of matlab
 
Data types in c++
Data types in c++ Data types in c++
Data types in c++
 
2. data, operators, io
2. data, operators, io2. data, operators, io
2. data, operators, io
 
C Programming - Refresher - Part III
C Programming - Refresher - Part IIIC Programming - Refresher - Part III
C Programming - Refresher - Part III
 
Matlab introduction
Matlab introductionMatlab introduction
Matlab introduction
 
Introduction to C++
Introduction to C++ Introduction to C++
Introduction to C++
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
MATLAB INTRODUCTION
MATLAB INTRODUCTIONMATLAB INTRODUCTION
MATLAB INTRODUCTION
 
Matlab ppt
Matlab pptMatlab ppt
Matlab ppt
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
Deep C
Deep CDeep C
Deep C
 

En vedette

Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...Joel Falcou
 
Tools for Meta-Programming
Tools for Meta-ProgrammingTools for Meta-Programming
Tools for Meta-Programmingelliando dias
 
Creative coding in art education -Fads presentation
Creative coding in art education -Fads presentationCreative coding in art education -Fads presentation
Creative coding in art education -Fads presentationTomi Dufva
 
Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Modelsbutest
 
Creative Coding in Interaction Design with Tim Stutts
Creative Coding in Interaction Design with Tim StuttsCreative Coding in Interaction Design with Tim Stutts
Creative Coding in Interaction Design with Tim StuttsFITC
 
Summary - Transformational-Generative Theory
Summary - Transformational-Generative TheorySummary - Transformational-Generative Theory
Summary - Transformational-Generative TheoryMarielis VI
 
Transformational-Generative Grammar
Transformational-Generative GrammarTransformational-Generative Grammar
Transformational-Generative GrammarRuth Ann Llego
 
Deep structure and surface structure
Deep structure and surface structureDeep structure and surface structure
Deep structure and surface structureAsif Ali Raza
 
Generative Design and Design Hacking
Generative Design and Design HackingGenerative Design and Design Hacking
Generative Design and Design HackingDesignit
 

En vedette (10)

Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
 
Tools for Meta-Programming
Tools for Meta-ProgrammingTools for Meta-Programming
Tools for Meta-Programming
 
Creative coding in art education -Fads presentation
Creative coding in art education -Fads presentationCreative coding in art education -Fads presentation
Creative coding in art education -Fads presentation
 
Geometria Projetiva
Geometria ProjetivaGeometria Projetiva
Geometria Projetiva
 
Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Models
 
Creative Coding in Interaction Design with Tim Stutts
Creative Coding in Interaction Design with Tim StuttsCreative Coding in Interaction Design with Tim Stutts
Creative Coding in Interaction Design with Tim Stutts
 
Summary - Transformational-Generative Theory
Summary - Transformational-Generative TheorySummary - Transformational-Generative Theory
Summary - Transformational-Generative Theory
 
Transformational-Generative Grammar
Transformational-Generative GrammarTransformational-Generative Grammar
Transformational-Generative Grammar
 
Deep structure and surface structure
Deep structure and surface structureDeep structure and surface structure
Deep structure and surface structure
 
Generative Design and Design Hacking
Generative Design and Design HackingGenerative Design and Design Hacking
Generative Design and Design Hacking
 

Similaire à Software Abstractions for Parallel Hardware

Source-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructureSource-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructurekaveirious
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systemsdairsie
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMchiportal
 
Software engineering
Software engineeringSoftware engineering
Software engineeringFahe Em
 
Software engineering
Software engineeringSoftware engineering
Software engineeringFahe Em
 
Re-implementing Thrift using MDE
Re-implementing Thrift using MDERe-implementing Thrift using MDE
Re-implementing Thrift using MDESina Madani
 
Prograamção Paralela Baseada em Componentes (Introduzido o Modelo #)
Prograamção Paralela Baseada em Componentes (Introduzido o Modelo #)Prograamção Paralela Baseada em Componentes (Introduzido o Modelo #)
Prograamção Paralela Baseada em Componentes (Introduzido o Modelo #)Heron Carvalho
 
Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015
Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015
Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015Mozaic Works
 
05 Lecture - PARALLEL Programming in C ++.pdf
05 Lecture - PARALLEL Programming in C ++.pdf05 Lecture - PARALLEL Programming in C ++.pdf
05 Lecture - PARALLEL Programming in C ++.pdfalivaisi1
 
Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0Bruce Johnson
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...jsvetter
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsCSCJournals
 
A DSEL for Addressing the Problems Posed by Parallel Architectures
A DSEL for Addressing the Problems Posed by Parallel ArchitecturesA DSEL for Addressing the Problems Posed by Parallel Architectures
A DSEL for Addressing the Problems Posed by Parallel ArchitecturesJason Hearne-McGuiness
 
Compiler gate question key
Compiler gate question keyCompiler gate question key
Compiler gate question keyArthyR3
 
.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3aminmesbahi
 
LIFT: A Legacy InFormation retrieval Tool
LIFT: A Legacy InFormation retrieval ToolLIFT: A Legacy InFormation retrieval Tool
LIFT: A Legacy InFormation retrieval ToolKellyton Brito
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
Aggregate Programming in Scala
Aggregate Programming in ScalaAggregate Programming in Scala
Aggregate Programming in ScalaRoberto Casadei
 

Similaire à Software Abstractions for Parallel Hardware (20)

Source-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructureSource-to-source transformations: Supporting tools and infrastructure
Source-to-source transformations: Supporting tools and infrastructure
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Software engineering
Software engineeringSoftware engineering
Software engineering
 
Introduction To MDD
Introduction To MDDIntroduction To MDD
Introduction To MDD
 
Re-implementing Thrift using MDE
Re-implementing Thrift using MDERe-implementing Thrift using MDE
Re-implementing Thrift using MDE
 
Prograamção Paralela Baseada em Componentes (Introduzido o Modelo #)
Prograamção Paralela Baseada em Componentes (Introduzido o Modelo #)Prograamção Paralela Baseada em Componentes (Introduzido o Modelo #)
Prograamção Paralela Baseada em Componentes (Introduzido o Modelo #)
 
Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015
Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015
Simon Brown: Software Architecture as Code at I T.A.K.E. Unconference 2015
 
05 Lecture - PARALLEL Programming in C ++.pdf
05 Lecture - PARALLEL Programming in C ++.pdf05 Lecture - PARALLEL Programming in C ++.pdf
05 Lecture - PARALLEL Programming in C ++.pdf
 
Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0Overview of VS2010 and .NET 4.0
Overview of VS2010 and .NET 4.0
 
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
Exploring Emerging Technologies in the Extreme Scale HPC Co-Design Space with...
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core Processors
 
A DSEL for Addressing the Problems Posed by Parallel Architectures
A DSEL for Addressing the Problems Posed by Parallel ArchitecturesA DSEL for Addressing the Problems Posed by Parallel Architectures
A DSEL for Addressing the Problems Posed by Parallel Architectures
 
Compiler gate question key
Compiler gate question keyCompiler gate question key
Compiler gate question key
 
.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3.NET Core, ASP.NET Core Course, Session 3
.NET Core, ASP.NET Core Course, Session 3
 
LIFT: A Legacy InFormation retrieval Tool
LIFT: A Legacy InFormation retrieval ToolLIFT: A Legacy InFormation retrieval Tool
LIFT: A Legacy InFormation retrieval Tool
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Aq4301224227
Aq4301224227Aq4301224227
Aq4301224227
 
Aggregate Programming in Scala
Aggregate Programming in ScalaAggregate Programming in Scala
Aggregate Programming in Scala
 

Dernier

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 

Dernier (20)

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 

Software Abstractions for Parallel Hardware

  • 1. Software Abstractions for Parallel Architectures Joel Falcou LRI - CNRS - INRIA HDR Thesis Defense 12/01/2014
  • 2. The Paradigm Change in Science From Experiments to Simulations Simulations is now an integral part of the Scientic Method Scientic Computing enables larger, faster, more accurate Research Fast Simulation is Time Travel as scientic results are now more readily available Local Galaxy Cluster Simulation - Illustris project Computing is rst and foremost a mainstream science tool 2 of 41
  • 3. The Paradigm Change in Science The Parallel Hell Heat Wall: Growing cores instead of GHz Hierarchical and heterogeneous parallel systems are the norm The Free Lunch is over as hardware complexity rises faster than the average developer skills Local Galaxy Cluster Simulation - Illustris project The real challenge in HPC is the Expressiveness/Efficiency War 2 of 41
  • 4. The Expressiveness/Efficiency War Single Core Era Performance Expressiveness C/Fort. C++ Java Multi-Core/SIMD Era Performance Sequential Expressiveness SIMD Threads Heterogenous Era Performance Sequential Expressiveness GPU Phi SIMD Threads Distributed As parallel systems complexity grows, the expressiveness gap turns into an ocean 3 of 41
  • 5. Designing tools for Scientic Computing Objectives 1. Be non-disruptive 2. Domain driven optimizations 3. Provide intuitive API for the user 4. Support a wide architectural landscape 5. Be efficient 4 of 41
  • 6. Designing tools for Scientic Computing Objectives 1. Be non-disruptive 2. Domain driven optimizations 3. Provide intuitive API for the user 4. Support a wide architectural landscape 5. Be efficient Our Approach Design tools as C++ libraries (1) Design these libraries as Domain Specic Embedded Languages (DSEL) (2+3) Use Parallel Programming Abstractions as parallel components (4) Use Generative Programming to deliver performance (5) 4 of 41
  • 7. Talk Layout Introduction Abstractions Efficiency Experimental Results Conclusion 5 of 41
  • 8. Why Parallel Programming Models ? Limits of regular tools Unstructured parallelism is error-prone Low level parallel tools are non-composable Contribute to the Expressiveness Gap 6 of 41
  • 9. Why Parallel Programming Models ? Limits of regular tools Unstructured parallelism is error-prone Low level parallel tools are non-composable Contribute to the Expressiveness Gap Available Models Performance centric: P-RAM, LOG-P, BSP Pattern centric: Futures, Skeletons Data centric: HTA, PGAS 6 of 41
  • 10. Why Parallel Programming Models ? Limits of regular tools Unstructured parallelism is error-prone Low level parallel tools are non-composable Contribute to the Expressiveness Gap Available Models Performance centric: P-RAM, LOG-P, BSP Pattern centric: Futures, Skeletons Data centric: HTA, PGAS 6 of 41
  • 11. Bulk Synchronous Parallelism [Valiant, McColl 90] Principles Machine Model Execution Model Analytic Cost Model C o m p u t e B a r r i e r C o m m Wmax h.g P0 P1 P2 P3 Superstep T Superstep T+1 Wmax h.g L BSP Execution Model 7 of 41
  • 12. Bulk Synchronous Parallelism [Valiant, McColl 90] Advantages Simple set of primitives Implementable on any kind of hardware Possibility to reason about BSP programs C o m p u t e B a r r i e r C o m m Wmax h.g P0 P1 P2 P3 Superstep T Superstep T+1 Wmax h.g L BSP Execution Model 7 of 41
  • 13. Parallel Skeletons [Cole 89] Principles There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as a combination of such patterns Functional point of view Skeletons are Higher-Order Functions Skeletons support a compositionnal semantic Applications become composition of state-less functions 8 of 41
  • 14. Parallel Skeletons [Cole 89] Principles There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as a combination of such patterns Classical Skeletons Data parallel: map, fold, scan Task parallel: par, pipe, farm More complex: Distribuable Homomorphism, Divide Conquer, … 8 of 41
  • 15. Relevance to our Objectives Why using Parallel Skeletons ? Write code independant of parallel programming minutiae Composability supports hierarchical architectures Code is scalable and easy to maintain Why using BSP ? Cost model guide development Few primitives mean that intellectual burden is low Good medium for developping skeletons How to ensure performance of those models’ implementations ? 9 of 41
  • 16. Domain Specic Embedded Languages Domain Specic Languages Non-Turing complete declarative languages Solve a single type of problems Express what to do instead of how to do it E.g: SQL, M, M, … From DSL to DSEL [Abrahams 2004] A DSL incorporates domain-specic notation, constructs, and abstractions as fundamental design considerations. A Domain Specic Embedded Languages (DSEL) is simply a library that meets the same criteria Generative Programming is one way to design such libraries 10 of 41
  • 17. Generative Programming [Eisenecker 97] Domain Specific Application Description Generative Component Concrete Application Translator Parametric Sub-components 11 of 41
  • 18. Meta-programming as a Tool Denition Meta-programming is the writing of computer programs that analyse, transform and generate other programs (or themselves) as their data. Meta-programmable Languages metaOCAML : runtime code generation via code quoting Template Haskell : compile-time code generation via templates C++ : compile-time code generation via templates C++ meta-programming Relies on the Turing-complete C++  sub-language Handles types and integral constants at compile-time  classes and functions act as code quoting 12 of 41
  • 19. The Expression Templates Idiom Principles Relies on extensive operator overloading Carries semantic information around code fragment Introduces DSLs without disrupting dev. chain matrix x(h,w),a(h,w),b(h,w); x = cos(a) + (b*a); exprassign ,exprmatrix ,exprplus , exprcos ,exprmatrix , exprmultiplies ,exprmatrix ,exprmatrix (x,a,b); + = cos * a b a x #pragma omp parallel for for(int j=0;jh;++j) { for(int i=0;iw;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); } } Arbitrary Transforms applied on the meta-AST General Principles of Expression Templates 13 of 41
  • 20. The Expression Templates Idiom Advantages Generic implementation becomes self-aware of optimizations API abstraction level is arbitrary high Accessible through high-level tools like B.P matrix x(h,w),a(h,w),b(h,w); x = cos(a) + (b*a); exprassign ,exprmatrix ,exprplus , exprcos ,exprmatrix , exprmultiplies ,exprmatrix ,exprmatrix (x,a,b); + = cos * a b a x #pragma omp parallel for for(int j=0;jh;++j) { for(int i=0;iw;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); } } Arbitrary Transforms applied on the meta-AST General Principles of Expression Templates 13 of 41
  • 21. Our Contributions Our Strategy Applies DSEL generation techniques to parallel programming Maintains low cost of abstractions through meta-programming Maintains abstraction level via modern library design Our contributions Tools Pub. Scope Applications Quaff ParCo’06 MPI Skeletons Real-time 3D reconstruction SkellPU PACT’08 Skeleton on Cell BE Real-time Image processing BSP++ IJPP’12 MPI/OpenMP BSP Bioinformatics, Model Checking NT2 JPDC’14 Data Parallel Matlab Fluid Dynamics, Vision 14 of 41
  • 22. Example of BSP++ Application Khaled Hamidouche PHD 2008-2011 in collab. with Univ. Brasilia BSP Smith Waterman SW computes DNA sequences alignment BSP++ implementation was written once and run on 7 different hardwares Efficiency of 95+% even on 6000 cores super-computer Platform MaxSize # Elements Speedup GCUPs cluster (MPI) 1,072,950 128 cores 73x 6.53 cluster (MPI/OpenMP) 1,072,950 128 cores 116x 10.41 OpenMP 1,072,950 16 cores 16x 0.40 CellBE 85,603 8 SPEs — 0.14 cluster of CellBEs 85,603 24 SPEs (8:24) 2.8x 0.37 Hopper(MPI) 5,303,436 3072 cores 260x 3.09 Hopper(MPI+OpenMP) 24,894,269 6144 cores 5664x 15,5 15 of 41
  • 23. Second Look at our Contributions Development Limitations DSELs are mostly tied to the domain model Architecture support is often an afterthought Extensibility is difficult as many refactoring are required per architecture Example : No proper way to support GPUs with those implementation techniques 16 of 41
  • 24. Second Look at our Contributions Development Limitations DSELs are mostly tied to the domain model Architecture support is often an afterthought Extensibility is difficult as many refactoring are required per architecture Example : No proper way to support GPUs with those implementation techniques Proposed Method Extends Generative Programming to take this architecture into account Provides an architecture description DSEL Integrates this description in the code generation process 16 of 41
  • 25. Architecture Aware Generative Programming 17 of 41
  • 26. Software refactoring Tools Issues Changes Quaff Raw skeletons API Re-engineered as part of NT2 SkellPU Too architecture specic Re-engineered as part of NT2 BSP++ Integration issues Integrate hybrid code generation NT2 Not easily extendable Integrate Quaff Skeleton models Boost.SIMD - Side product of NT2 restructuration Conclusion Skeletons are ne as parallel middleware Model based abstractions are not high level enough For low level architectures, the simplest model is often the best 18 of 41
  • 27. Boost.SIMD Pierre Estérie PHD 2010-2014 Principles Provides simple C++ API over SIMD extensions Supports every Intel, PPC and ARM instructions sets Fully integrates with modern C++ idioms Sparse Tridiagonal Solver - collaboration with M. Baboulin and Y. wang 19 of 41
  • 28. Talk Layout Introduction Abstractions Efficiency Experimental Results Conclusion 20 of 41
  • 29. The Numerical Template Toolbox Pierre Estérie PHD 2010-2014 NT2 as a Scientic Computing Library Provides a simple, M-like interface for users Provides high-performance computing entities and primitives Is easily extendable Components Uses Boost.SIMD for in-core optimizations Uses recursive parallel skeletons Supports task parallelism through Futures 21 of 41
  • 30. The Numerical Template Toolbox Principles tableT,S is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M 22 of 41
  • 31. The Numerical Template Toolbox Principles tableT,S is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M How does it works Take a .m le, copy to a .cpp le 22 of 41
  • 32. The Numerical Template Toolbox Principles tableT,S is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M How does it works Take a .m le, copy to a .cpp le Add #include nt2/nt2.hpp and do cosmetic changes 22 of 41
  • 33. The Numerical Template Toolbox Principles tableT,S is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M How does it works Take a .m le, copy to a .cpp le Add #include nt2/nt2.hpp and do cosmetic changes Compile the le and link with libnt2.a 22 of 41
  • 34. NT2 - From M to C++ M code A1 = 1 : 1 0 0 0 ; A2 = A1 + randn ( size ( A1 ) ) ; X = lu ( A1 * A1 ’) ; rms = sqrt ( sum ( sqr ( A1 (:) - A2 (:) ) ) / numel ( A1 ) ) ; NT2 code table double A1 = _ (1. ,1000.) ; table double A2 = A1 + randn ( size ( A1 ) ) ; table double X = lu ( m t i m e s ( A1 , trans ( A1 ) ) ; d o u b l e rms = sqrt ( sum ( sqr ( A1 ( _ ) - A2 ( _ ) ) ) / numel ( A1 ) ) ; 23 of 41
  • 35. Parallel Skeletons extraction process A = B / sum(C+D); = A = B sum + C D fold transform 24 of 41
  • 36. Parallel Skeletons extraction process A = B / sum(C+D); ; ; = A = B sum + C D fold transform = tmp sum + C D fold ) = A = B tmp transform 25 of 41
  • 37. From data to task parallelism Antoine Tran Tan PHD, 2012-2015 Limits of the fork-join model Synchronization cost due to implicit barriers Under-exploitation of potential parallelism Poor data locality and no inter-statement optimization 26 of 41
  • 38. From data to task parallelism Antoine Tran Tan PHD, 2012-2015 Limits of the fork-join model Synchronization cost due to implicit barriers Under-exploitation of potential parallelism Poor data locality and no inter-statement optimization Skeletons from the Future Adapt current skeletons for taskication Use Futures ( or HPX) to automatically pipeline Derive a dependency graph between statements 26 of 41
  • 39. Parallel Skeletons extraction process - Take 2 A = B / sum(C+D); ; ; = tmp sum + C D fold = A = B tmp transform 27 of 41
  • 40. Parallel Skeletons extraction process - Take 2 A = B / sum(C+D); fold = = tmp(3) sum(3) + tmp(1) sum + C(:; 3) D(:; 3) = spawnertransform,OpenMP = = A(:; 3) = A(:; 2) = A(:; 1) = B(:; 3) tmp(3) B(:; 2) tmp(2) transform C(:; 1) D(:; 1) fold = tmp(2) sum(2) + C(:; 2) D(:; 2) B(:; 1) tmp(1) transform workerfold,simd workertransform,simd spawnertransform,OpenMP ; 28 of 41
  • 41. Motion Detection Lacassagne et al., ICIP 2009 Sigma-Delta algorithm based on background substraction Use local gaussian model of lightness variation to detect motion Challenge: Very low arithmetic density Challenge: Integer-based implementation with small range 29 of 41
  • 42. Motion Detection table char s i g m a _ d e l t a ( table char b a c k g r o u n d , table char const frame , table char v a r i a n c e ) { // E s t i m a t e Raw M o v e m e n t b a c k g r o u n d = s e l i n c ( b a c k g r o u n d frame , s e l d e c ( b a c k g r o u n d frame , b a c k g r o u n d ) ) ; table char diff = dist ( background , frame ) ; // C o m p u t e Local V a r i a n c e table char sig3 = muls ( diff ,3) ; var = i f _ e l s e ( diff != 0 , s e l i n c ( v a r i a n c e sig3 , s e l d e c ( var sig3 , v a r i a n c e ) ) , v a r i a n c e ) ; // G e n e r a t e M o v e m e n t Label r e t u r n i f _ z e r o _ e l s e _ o n e ( diff v a r i a n c e ) ; } 30 of 41
  • 43. Motion Detection 18 16 14 12 10 8 6 4 2 0 512x512 1024x1024 cycles/element Image Size (N x N) x6.8 x14.8 x16.5 x2.1 x3.6 x6.7 x15.3 x18 x2.3 x3.99 x10.8 x10.8 SCALAR HALF CORE FULL CORE SIMD JRTIP2008 SIMD + HALF CORE SIMD + FULL CORE 31 of 41
  • 44. Black and Scholes Option Pricing NT2 Code table float b l a c k s c h o l e s ( table float const Sa , table float const Xa , table float const Ta , table float const ra , table float const va ) { table float da = sqrt ( Ta ) ; table float d1 = log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) ; table float d2 = d1 - va * da ; r e t u r n Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) ; } 32 of 41
  • 45. Black and Scholes Option Pricing NT2 Code with loop fusion table float b l a c k s c h o l e s ( table float const Sa , table float const Xa , table float const Ta , table float const ra , table float const va ) { // P r e a l l o c a t e t e m p o r a r y t a b l e s table float da ( e x t e n t ( Ta ) ) , d1 ( e x t e n t ( Ta ) ) , d2 ( e x t e n t ( Ta ) ) , R ( e x t e n t ( Ta ) ) ; // tie merge loop nest and i n c r e a s e cache l o c a l i t y tie ( da , d1 , d2 , R ) = tie ( sqrt ( Ta ) , log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) , d1 - va * da , Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) ) ; r e t u r n R ; } 32 of 41
  • 46. Black and Scholes Option Pricing Performance 1000000 150 100 50 0 x1.89 x2.91 x5.58 x6.30 Size cycle/value scalar SSE2 AVX2 SSE2, 4 cores AVX2, 4 cores 33 of 41
  • 47. Black and Scholes Option Pricing Performance with loop fusion/futurisation 1000000 150 100 50 0 x2.27 x4.13 x8.05 x11.12 Size cycle/value scalar SSE2 AVX2 SSE2, 4 cores AVX2, 4 cores 34 of 41
  • 48. LU Decomposition Algorithm A00 A10 A01 A02 A20 A11 A21 A12 A22 A11 A21 A12 A22 A22 step 1 step 2 step 3 step 4 step 5 step 6 step 7 DGETRF DGESSM DTSTRF DSSSSM 35 of 41
  • 49. LU Decomposition Performance 0 10 20 30 40 50 100 50 0 Number of cores Median GFLOPS 8000 8000 LU decomposition NT2 Intel MKL 36 of 41
  • 50. Talk Layout Introduction Abstractions Efficiency Experimental Results Conclusion 37 of 41
  • 51. Conclusion Parallel Computing for Scientist Software Libraries built as Generic and Generative components can solve a large chunk of parallelism related problems while being easy to use. Like regular language, DSEL needs informations about the hardware system Integrating hardware descriptions as Generic components increases tools portability and re-targetability 38 of 41
  • 52. Conclusion Parallel Computing for Scientist Software Libraries built as Generic and Generative components can solve a large chunk of parallelism related problems while being easy to use. Like regular language, DSEL needs informations about the hardware system Integrating hardware descriptions as Generic components increases tools portability and re-targetability Our Achievements A new method for parallel software development Efficient libraries working on large subset of hardware High level of performances across a wide application spectrum 38 of 41
  • 53. Works in Progress Application to Accelerators Exploration of proper skeleton implementation on GPUs Adaptation of Future based code generator In progress with Ian Masliah’ PHD thesis Parallelism within C++ SIMD as part of the standard library Proposal N3571 for standard SIMD computation Interoperability with current parallel model of C++ 39 of 41
  • 54. Perspectives DSEL as C++ rst class idiom Build partial evaluation into the language Ease transition between regular and meta C++ Mid-term Prospect: metaOCAML like quoting for C++ DSEL and compilers relationship C++ DSEL hits a limit on their applicability Compilers often lack high level informations for proper optimization Mid-term Prospect: Hybrid library/compiler approaches for DSEL 40 of 41
  • 55. Thanks for your attention