SlideShare une entreprise Scribd logo
1  sur  54
Télécharger pour lire hors ligne
Automatic Task-based Code Generation for High Performance DSEL 
Antoine Tran Tan Joel Falcou Daniel Etiemble Hartmut Kaiser 
Université de Paris Sud - LRI - ParSys - Inria Postale 
Ste||ar Group - LSU 
NumScale SAS 
Meeting C++ 2014 
1 of 28
The Expressiveness/Efficiency War 
Single Core Era 
Performance 
Expressiveness 
C/Fort. 
C++ 
Java 
Multi-Core/SIMD Era 
Performance 
Sequential 
Expressiveness 
SIMD 
Threads 
Heterogenous Era 
Performance 
Sequential 
Expressiveness 
GPU 
Phi 
SIMD 
Threads 
Distributed 
As parallel systems complexity grows, the expressiveness gap turns into an ocean 
2 of 28
Designing tools for Scientic Computing 
Challenges 
1. Be non-disruptive 
2. Domain driven optimizations 
3. Provide intuitive API for the user 
4. Support a wide architectural landscape 
3 of 28
Designing tools for Scientic Computing 
Challenges 
1. Be non-disruptive 
2. Domain driven optimizations 
3. Provide intuitive API for the user 
4. Support a wide architectural landscape 
Our Approach 
 Design tools as C++ libraries (1) 
 Design them as Domain Specic Embedded Language (DSEL) (2+3) 
 Use Generative Programming to deliver performance (4) 
3 of 28
Designing tools for Scientic Computing 
Challenges 
1. Be non-disruptive 
2. Domain driven optimizations 
3. Provide intuitive API for the user 
4. Support a wide architectural landscape 
Objectives of the Talk 
 Presents one of our such desgined tools 
 Expose some limitations of the DSEL medium 
 Propose a different take on such tools 
3 of 28
Outline 
Introduction 
The NT2 Library 
Redesign for task parallelism 
Benchmarks 
Conclusion and Perspectives 
4 of 28
Outline 
Introduction 
The NT2 Library 
Redesign for task parallelism 
Benchmarks 
Conclusion and Perspectives 
4 of 28
NT2 : The Numerical Template Toolbox 
NT2 as a Scientic Computing Library 
 Provides a simple, M DSEL 
 Provides high-performance computing entities and primitives 
 Is easily extendable 
Components 
 Uses Boost.SIMD for in-core optimizations 
 Uses recursive parallel skeletons 
 Supports task parallelism through Futures 
5 of 28
The Numerical Template Toolbox 
Principles 
 tableT,S is a simple, multidimensional array object that exactly 
mimics M array behavior and functionalities 
 500+ functions usable directly either on table or on any scalar values 
as in M 
6 of 28
The Numerical Template Toolbox 
Principles 
 tableT,S is a simple, multidimensional array object that exactly 
mimics M array behavior and functionalities 
 500+ functions usable directly either on table or on any scalar values 
as in M 
How does it works 
 Take a .m le, copy to a .cpp le 
6 of 28
The Numerical Template Toolbox 
Principles 
 tableT,S is a simple, multidimensional array object that exactly 
mimics M array behavior and functionalities 
 500+ functions usable directly either on table or on any scalar values 
as in M 
How does it works 
 Take a .m le, copy to a .cpp le 
 Add #include nt2/nt2.hpp and do cosmetic changes 
6 of 28
The Numerical Template Toolbox 
Principles 
 tableT,S is a simple, multidimensional array object that exactly 
mimics M array behavior and functionalities 
 500+ functions usable directly either on table or on any scalar values 
as in M 
How does it works 
 Take a .m le, copy to a .cpp le 
 Add #include nt2/nt2.hpp and do cosmetic changes 
 Compile the le and link with libnt2.a 
6 of 28
NT2 - From M to C++ 
M code 
A1 = 1 : 1 0 0 0 ; 
A2 = A1 + randn ( size ( A1 ) ) ; 
X = lu ( A1 * A1 ’) ; 
rms = sqrt ( sum ( sqr ( A1 (:) - A2 (:) ) ) / numel ( A1 ) ) ; 
NT2 code 
table  double  A1 = _ (1. ,1000.) ; 
table  double  A2 = A1 + randn ( size ( A1 ) ) ; 
table  double  X = lu ( m t i m e s ( A1 , trans ( A1 ) ) ; 
d o u b l e rms = sqrt ( sum ( sqr ( A1 ( _ ) - A2 ( _ ) ) ) / numel ( A1 ) ) ; 
7 of 28
Domain Specic Embedded Languages 
Domain Specic Languages 
 Non-Turing complete declarative languages 
 Solve a single type of problems 
 Express what to do instead of how to do it 
 E.g : SQL, M, M, … 
From DSL to DSEL [Abrahams 2004] 
 A DSL incorporates domain-specic notation, constructs, and abstractions as 
fundamental design considerations. 
 A Domain Specic Embedded Languages (DSEL) is simply a library that meets the 
same criteria 
 Generative Programming is one way to design such libraries 
8 of 28
Meta-programming as a Tool 
Denition 
Meta-programming is the writing of computer programs that analyse, transform and 
generate other programs (or themselves) as their data. 
Meta-programmable Languages 
 metaOCAML : runtime code generation via code quoting 
 Template Haskell : compile-time code generation via templates 
 C++ : compile-time code generation via templates 
C++ meta-programming 
 Relies on the Turing-complete C++ Template sub-language 
 Handles types and integral constants at compile-time 
 Template classes and functions act as code quoting 
9 of 28
The Expression Templates Idiom 
Principles 
 Relies on extensive operator 
overloading 
 Carries semantic information 
around code fragment 
 Introduces DSLs without 
disrupting dev. chain 
matrix x(h,w),a(h,w),b(h,w); 
x = cos(a) + (b*a); 
exprassign 
,exprmatrix 
,exprplus 
, exprcos 
,exprmatrix 
 
, exprmultiplies 
,exprmatrix 
,exprmatrix 
 
(x,a,b); 
+ 
= 
cos * 
a b a 
x 
#pragma omp parallel for 
for(int j=0;jh;++j) 
{ 
for(int i=0;iw;++i) 
{ 
x(j,i) = cos(a(j,i)) 
+ ( b(j,i) 
* a(j,i) 
); 
} 
} 
Arbitrary Transforms applied 
on the meta-AST 
General Principles of Expression Templates 
10 of 28
The Expression Templates Idiom 
Advantages 
 Generic implementation becomes 
self-aware of optimizations 
 API abstraction level is arbitrary 
high 
 Accessible through high-level 
tools like B.P 
matrix x(h,w),a(h,w),b(h,w); 
x = cos(a) + (b*a); 
exprassign 
,exprmatrix 
,exprplus 
, exprcos 
,exprmatrix 
 
, exprmultiplies 
,exprmatrix 
,exprmatrix 
 
(x,a,b); 
+ 
= 
cos * 
a b a 
x 
#pragma omp parallel for 
for(int j=0;jh;++j) 
{ 
for(int i=0;iw;++i) 
{ 
x(j,i) = cos(a(j,i)) 
+ ( b(j,i) 
* a(j,i) 
); 
} 
} 
Arbitrary Transforms applied 
on the meta-AST 
General Principles of Expression Templates 
10 of 28
Why Parallel Programming Models ? 
Limits of regular tools 
 Unstructured parallelism is error-prone 
 Low level parallel tools are non-composable 
 Contribute to the Expressiveness Gap 
11 of 28
Why Parallel Programming Models ? 
Limits of regular tools 
 Unstructured parallelism is error-prone 
 Low level parallel tools are non-composable 
 Contribute to the Expressiveness Gap 
Available Models 
 Performance centric : P-RAM, LOG-P, BSP 
 Pattern centric : Futures, Skeletons 
 Data centric : HTA, PGAS 
11 of 28
Why Parallel Programming Models ? 
Limits of regular tools 
 Unstructured parallelism is error-prone 
 Low level parallel tools are non-composable 
 Contribute to the Expressiveness Gap 
Available Models 
 Performance centric : P-RAM, LOG-P, BSP 
 Pattern centric : Futures, Skeletons 
 Data centric : HTA, PGAS 
11 of 28
Parallel Skeletons [Cole 89] 
Principles 
 There are patterns in parallel applications 
 Those patterns can be generalized in Skeletons 
 Applications are assembled as a combination of such patterns 
Functional point of view 
 Skeletons are Higher-Order Functions 
 Skeletons support a compositionnal semantic 
 Applications become composition of state-less functions 
12 of 28
Parallel Skeletons [Cole 89] 
Principles 
 There are patterns in parallel applications 
 Those patterns can be generalized in Skeletons 
 Applications are assembled as a combination of such patterns 
Classical Skeletons 
 Data parallel : map, fold, scan 
 Task parallel : par, pipe, farm 
 More complex : Distribuable Homomorphism, Divide  Conquer, … 
12 of 28
Parallel Skeletons : The heart of NT2 
Parallel Skeletons in NT2 
 transform : Apply a n-ary function over a subset of data 
 fold : Perform n-ary reduction over a subset of data 
 scan : Perform n-ary prex reduction over a subset of data 
13 of 28
Parallel Skeletons : The heart of NT2 
Parallel Skeletons in NT2 
 transform : Apply a n-ary function over a subset of data 
 fold : Perform n-ary reduction over a subset of data 
 scan : Perform n-ary prex reduction over a subset of data 
Tied to families of loop nests 
 Elementwise : call to transform 
 Can be nested with elementwise operations 
 Reduction  scan : respectively call to fold or scan 
 Successive reductions or prex scans are not nestable 
 Can contain a nested elementwise loop nest 
13 of 28
Parallel Skeletons extraction process 
A = B / sum(C+D) ; 
= 
A = 
B sum 
+ 
C D 
fold 
transform 
14 of 28
Parallel Skeletons extraction process 
A = B / sum(C+D) ; 
; ; 
= 
A = 
B sum 
+ 
C D 
fold 
transform 
= 
tmp sum 
+ 
C D 
fold 
) 
= 
A = 
B tmp 
transform 
15 of 28
The limits 
Limits of Expression Templates 
 Synchronization cost due to implicit barriers 
 Under-exploitation of potential parallelism (ex : concurrency, pipeline) 
 Poor data locality and no inter-statement optimization 
16 of 28
The limits 
Limits of Expression Templates 
 Synchronization cost due to implicit barriers 
 Under-exploitation of potential parallelism (ex : concurrency, pipeline) 
 Poor data locality and no inter-statement optimization 
Actions ! 
 Upgrade NT2 to enable task parallelism 
 Adapt current skeletons for taskication 
 Derive a dependency graph between statements 
16 of 28
Outline 
Introduction 
The NT2 Library 
Redesign for task parallelism 
Benchmarks 
Conclusion and Perspectives 
16 of 28
Futures programming model 
future  int  f1 = async ( foo1 ) ; 
Client Future Promise 
async( ) 
new( ) 
returns future 
set( ) 
17 of 28
Futures programming model 
future  int  f1 = async ( foo1 ) ; 
... 
int r e s u l t = f1 . get () ; 
Client Future Promise 
async( ) 
new( ) 
returns future 
set( ) 
17 of 28
Futures programming model 
future  int  f1 = async ( foo1 ) ; 
... 
int r e s u l t = f1 . get () ; 
Client Future Promise 
async( ) 
new( ) 
returns future 
set( ) 
get( ) 
returns value 
17 of 28
Futures programming model 
future  int  f1 = async ( foo1 ) ; 
... 
int r e s u l t = f1 . get () ; 
Client Future Promise 
async( ) 
new( ) 
returns future 
set( ) 
get( ) 
not ready 
returns value 
17 of 28
Sequential composition 
. 
Future1 
Client 
Promise 
async( ) 
new( ) 
returns future1 
set( ) 
future  int  f1 = async ( foo1 ) ; 
18 of 28
Sequential composition 
. 
Future1 
Client 
Promise 
async( ) 
new( ) 
returns future1 
set( ) 
future  int  f1 = async ( foo1 ) ; 
future  int  f2 = f1 . then ( foo2 ) ; 
18 of 28
Sequential composition 
Client Future1 Future2 Promise 
async( ) 
new( ) 
returns future1 
set( ) 
set( ) 
then( ) 
returns future2 
new( ) future  int  f1 = async ( foo1 ) ; 
future  int  f2 = f1 . then ( foo2 ) ; 
18 of 28
Sequential composition 
Client Future1 Future2 Promise 
async( ) 
new( ) 
returns future1 
set( ) 
new( ) 
then( ) 
set( ) 
returns future2 
future  int  f1 = async ( foo1 ) ; 
future  int  f2 = f1 . then ( foo2 ) ; 
18 of 28
Parallel composition 
t y p e d e f t y p e n a m e future  int  F ; 
F f1 = async ( foo1 ) ; 
F f2 = async ( foo2 ) ; 
foo. 1 foo2 
19 of 28
Parallel composition 
t y p e d e f t y p e n a m e future  int  F ; 
F f1 = async ( foo1 ) ; 
F f2 = async ( foo2 ) ; 
future  tuple F ,F   f3 = w h e n _ a l l ( f1 , f2 ) ; 
foo. 1 foo2 
19 of 28
Parallel composition 
t y p e d e f t y p e n a m e future  int  F ; 
vector F  join ; 
join . p u s h _ b a c k ( async ( foo1 ) ) ; 
join . p u s h _ b a c k ( async ( foo2 ) ) ; 
foo. 1 foo2 
19 of 28
Parallel composition 
t y p e d e f t y p e n a m e future  int  F ; 
vector F  join ; 
join . p u s h _ b a c k ( async ( foo1 ) ) ; 
join . p u s h _ b a c k ( async ( foo2 ) ) ; 
future  vector F   f3 = w h e n _ a l l ( join ) ; 
foo. 1 foo2 
19 of 28
Parallel composition 
t y p e d e f t y p e n a m e future  int  F ; 
vector F  join ; 
join . p u s h _ b a c k ( async ( foo1 ) ) ; 
join . p u s h _ b a c k ( async ( foo2 ) ) ; 
F f3 = w h e n _ a l l ( join ) . then ( foo3 ) ; 
foo1 foo2 
foo3 
19 of 28
Parallel Skeletons extraction process 
A = B / sum(C+D) ; 
; ; 
= 
tmp sum 
+ 
C D 
fold 
= 
A = 
B tmp 
transform 
20 of 28
Parallel Skeletons extraction process 
A = B / sum(C+D) ; 
fold 
= 
= 
tmp(3) sum(3) 
+ 
tmp(1) sum 
+ 
spawnertransform,Futures 
C(:; 3) D(:; 3) 
= 
= 
= 
A(:; 3) = 
A(:; 2) = 
A(:; 1) = 
B(:; 3) tmp(3) 
B(:; 2) tmp(2) 
transform 
C(:; 1) D(:; 1) 
fold 
= 
tmp(2) sum(2) 
+ 
C(:; 2) D(:; 2) 
B(:; 1) tmp(1) 
transform 
workerfold,simd 
workertransform,simd 
spawnertransform,Futures 
; 
21 of 28
Outline 
Introduction 
The NT2 Library 
Redesign for task parallelism 
Benchmarks 
Conclusion and Perspectives 
21 of 28
Black and Scholes Option Pricing 
NT2 Code 
table  float  b l a c k s c h o l e s ( table  float  const  Sa , table  float  const  Xa 
, table  float  const  Ta 
, table  float  const  ra , table  float  const  va 
) 
{ 
table  float  da = sqrt ( Ta ) ; 
table  float  d1 = log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) ; 
table  float  d2 = d1 - va * da ; 
r e t u r n Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) ; 
} 
22 of 28
Black and Scholes Option Pricing 
NT2 Code - Manual Loop Fusion 
table  float  b l a c k s c h o l e s ( table  float  const  Sa , table  float  const  Xa 
, table  float  const  Ta 
, table  float  const  ra , table  float  const  va 
) 
{ 
// P r e a l l o c a t e t e m p o r a r y t a b l e s 
table  float  da ( e x t e n t ( Ta ) ) , d1 ( e x t e n t ( Ta ) ) , d2 ( e x t e n t ( Ta ) ) , R ( e x t e n t ( Ta ) ) ; 
// tie merge loop nest and i n c r e a s e cache l o c a l i t y 
tie ( da , d1 , d2 , R ) = tie ( sqrt ( Ta ) 
, log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) 
, d1 - va * da 
, Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) 
) ; 
r e t u r n R ; 
} 
23 of 28
Black and Scholes performance for size 256M 
600 581 
scalar SIMD NT2 
400 
200 
0 
140 
23 
Implementation 
Execution Time in cycles/element 
24 of 28
TiledLU algorithm 
A00 
A10 A01 A02 
A20 A11 
A21 
A12 
A22 A11 
A21 A12 
A22 
A22 
step 1 
step 2 
step 3 
step 4 
step 5 
step 6 
step 7 
DGETRF 
DGESSM 
DTSTRF 
DSSSSM 
25 of 28
LU performance for Problem Size 8000 
Performance in GFlop/s NT2 
0 10 20 30 40 50 
150 
100 
50 
0 
Number of cores 
Plasma 
Intel MKL 
26 of 28
Outline 
Introduction 
The NT2 Library 
Redesign for task parallelism 
Benchmarks 
Conclusion and Perspectives 
26 of 28
Conclusion and Perspectives 
Conclusion 
 Extraction of asynchrony from DSEL statements 
 Combinaison of Future-based task management with parallel skeletons 
 Efficiency for coarse grain tasks 
27 of 28
Conclusion and Perspectives 
Conclusion 
 Extraction of asynchrony from DSEL statements 
 Combinaison of Future-based task management with parallel skeletons 
 Efficiency for coarse grain tasks 
Perspectives 
 Integration of application/memory aware cost models 
 Generalization of Futures for distributed systems  GPGPUs 
 Follow us on http://www.github.com/MetaScale/nt2 
27 of 28
Thanks for your attention 
28 of 28

Contenu connexe

Tendances

C++ questions And Answer
C++ questions And AnswerC++ questions And Answer
C++ questions And Answerlavparmar007
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationKushaal Singla
 
Storage classes, linkage & memory management
Storage classes, linkage & memory managementStorage classes, linkage & memory management
Storage classes, linkage & memory managementMomenMostafa
 
Combinators, DSLs, HTML and F#
Combinators, DSLs, HTML and F#Combinators, DSLs, HTML and F#
Combinators, DSLs, HTML and F#Robert Pickering
 
Algorithm & data structure lec2
Algorithm & data structure lec2Algorithm & data structure lec2
Algorithm & data structure lec2Abdul Khan
 
Datastructure notes
Datastructure notesDatastructure notes
Datastructure notesSrikanth
 
C++ OOP Implementation
C++ OOP ImplementationC++ OOP Implementation
C++ OOP ImplementationFridz Felisco
 
A recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templatesA recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templatesCoen De Roover
 
Top C Language Interview Questions and Answer
Top C Language Interview Questions and AnswerTop C Language Interview Questions and Answer
Top C Language Interview Questions and AnswerVineet Kumar Saini
 
C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1ReKruiTIn.com
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisSilvio Cesare
 
Functional Programming in R
Functional Programming in RFunctional Programming in R
Functional Programming in RSoumendra Dhanee
 

Tendances (20)

C++ questions And Answer
C++ questions And AnswerC++ questions And Answer
C++ questions And Answer
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
Storage classes, linkage & memory management
Storage classes, linkage & memory managementStorage classes, linkage & memory management
Storage classes, linkage & memory management
 
Combinators, DSLs, HTML and F#
Combinators, DSLs, HTML and F#Combinators, DSLs, HTML and F#
Combinators, DSLs, HTML and F#
 
Algorithm & data structure lec2
Algorithm & data structure lec2Algorithm & data structure lec2
Algorithm & data structure lec2
 
Datastructure notes
Datastructure notesDatastructure notes
Datastructure notes
 
C++ OOP Implementation
C++ OOP ImplementationC++ OOP Implementation
C++ OOP Implementation
 
C++ interview question
C++ interview questionC++ interview question
C++ interview question
 
A recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templatesA recommender system for generalizing and refining code templates
A recommender system for generalizing and refining code templates
 
C++ Interview Questions
C++ Interview QuestionsC++ Interview Questions
C++ Interview Questions
 
Top C Language Interview Questions and Answer
Top C Language Interview Questions and AnswerTop C Language Interview Questions and Answer
Top C Language Interview Questions and Answer
 
C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
 
Lazy java
Lazy javaLazy java
Lazy java
 
Oop l2
Oop l2Oop l2
Oop l2
 
C Programming - Refresher - Part III
C Programming - Refresher - Part IIIC Programming - Refresher - Part III
C Programming - Refresher - Part III
 
Matlab Basic Tutorial
Matlab Basic TutorialMatlab Basic Tutorial
Matlab Basic Tutorial
 
Functional Programming in R
Functional Programming in RFunctional Programming in R
Functional Programming in R
 
Learn Matlab
Learn MatlabLearn Matlab
Learn Matlab
 
ARIC Team Seminar
ARIC Team SeminarARIC Team Seminar
ARIC Team Seminar
 

En vedette

Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...Joel Falcou
 
Tools for Meta-Programming
Tools for Meta-ProgrammingTools for Meta-Programming
Tools for Meta-Programmingelliando dias
 
Creative coding in art education -Fads presentation
Creative coding in art education -Fads presentationCreative coding in art education -Fads presentation
Creative coding in art education -Fads presentationTomi Dufva
 
Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Modelsbutest
 
Creative Coding in Interaction Design with Tim Stutts
Creative Coding in Interaction Design with Tim StuttsCreative Coding in Interaction Design with Tim Stutts
Creative Coding in Interaction Design with Tim StuttsFITC
 
Summary - Transformational-Generative Theory
Summary - Transformational-Generative TheorySummary - Transformational-Generative Theory
Summary - Transformational-Generative TheoryMarielis VI
 
Transformational-Generative Grammar
Transformational-Generative GrammarTransformational-Generative Grammar
Transformational-Generative GrammarRuth Ann Llego
 
Deep structure and surface structure
Deep structure and surface structureDeep structure and surface structure
Deep structure and surface structureAsif Ali Raza
 
Generative Design and Design Hacking
Generative Design and Design HackingGenerative Design and Design Hacking
Generative Design and Design HackingDesignit
 

En vedette (10)

Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
Lattice Boltzmann sur architecture multicoeurs vectorielle - Une approche de ...
 
Tools for Meta-Programming
Tools for Meta-ProgrammingTools for Meta-Programming
Tools for Meta-Programming
 
Creative coding in art education -Fads presentation
Creative coding in art education -Fads presentationCreative coding in art education -Fads presentation
Creative coding in art education -Fads presentation
 
Geometria Projetiva
Geometria ProjetivaGeometria Projetiva
Geometria Projetiva
 
Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Models
 
Creative Coding in Interaction Design with Tim Stutts
Creative Coding in Interaction Design with Tim StuttsCreative Coding in Interaction Design with Tim Stutts
Creative Coding in Interaction Design with Tim Stutts
 
Summary - Transformational-Generative Theory
Summary - Transformational-Generative TheorySummary - Transformational-Generative Theory
Summary - Transformational-Generative Theory
 
Transformational-Generative Grammar
Transformational-Generative GrammarTransformational-Generative Grammar
Transformational-Generative Grammar
 
Deep structure and surface structure
Deep structure and surface structureDeep structure and surface structure
Deep structure and surface structure
 
Generative Design and Design Hacking
Generative Design and Design HackingGenerative Design and Design Hacking
Generative Design and Design Hacking
 

Similaire à Automatic Task-based Code Generation for High Performance DSEL

Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Raffi Khatchadourian
 
Matlab Introduction
Matlab IntroductionMatlab Introduction
Matlab Introductionideas2ignite
 
Programming in python
Programming in pythonProgramming in python
Programming in pythonIvan Rojas
 
Integrative Parallel Programming in HPC
Integrative Parallel Programming in HPCIntegrative Parallel Programming in HPC
Integrative Parallel Programming in HPCVictor Eijkhout
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
Machine learning on streams of data
Machine learning on streams of dataMachine learning on streams of data
Machine learning on streams of dataTomasz Sosiński
 
Whats New in Visual Studio 2012 for C++ Developers
Whats New in Visual Studio 2012 for C++ DevelopersWhats New in Visual Studio 2012 for C++ Developers
Whats New in Visual Studio 2012 for C++ DevelopersRainer Stropek
 
Analyzing On-Chip Interconnect with Modern C++
Analyzing On-Chip Interconnect with Modern C++Analyzing On-Chip Interconnect with Modern C++
Analyzing On-Chip Interconnect with Modern C++Jeff Trull
 
Summer training matlab
Summer training matlab Summer training matlab
Summer training matlab Arshit Rai
 
Introduction to Matlab
Introduction to MatlabIntroduction to Matlab
Introduction to MatlabAmr Rashed
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Deep Learning, Scala, and Spark
Deep Learning, Scala, and SparkDeep Learning, Scala, and Spark
Deep Learning, Scala, and SparkOswald Campesato
 
Summer training matlab
Summer training matlab Summer training matlab
Summer training matlab Arshit Rai
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
 
Fuel Up JavaScript with Functional Programming
Fuel Up JavaScript with Functional ProgrammingFuel Up JavaScript with Functional Programming
Fuel Up JavaScript with Functional ProgrammingShine Xavier
 
Archi Modelling
Archi ModellingArchi Modelling
Archi Modellingdilane007
 

Similaire à Automatic Task-based Code Generation for High Performance DSEL (20)

Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...
 
Matlab Introduction
Matlab IntroductionMatlab Introduction
Matlab Introduction
 
Programming in python
Programming in pythonProgramming in python
Programming in python
 
Dsp file
Dsp fileDsp file
Dsp file
 
An Intoduction to R
An Intoduction to RAn Intoduction to R
An Intoduction to R
 
Integrative Parallel Programming in HPC
Integrative Parallel Programming in HPCIntegrative Parallel Programming in HPC
Integrative Parallel Programming in HPC
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Machine learning on streams of data
Machine learning on streams of dataMachine learning on streams of data
Machine learning on streams of data
 
Whats New in Visual Studio 2012 for C++ Developers
Whats New in Visual Studio 2012 for C++ DevelopersWhats New in Visual Studio 2012 for C++ Developers
Whats New in Visual Studio 2012 for C++ Developers
 
Analyzing On-Chip Interconnect with Modern C++
Analyzing On-Chip Interconnect with Modern C++Analyzing On-Chip Interconnect with Modern C++
Analyzing On-Chip Interconnect with Modern C++
 
Oct.22nd.Presentation.Final
Oct.22nd.Presentation.FinalOct.22nd.Presentation.Final
Oct.22nd.Presentation.Final
 
Summer training matlab
Summer training matlab Summer training matlab
Summer training matlab
 
Introduction to Matlab
Introduction to MatlabIntroduction to Matlab
Introduction to Matlab
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Deep Learning, Scala, and Spark
Deep Learning, Scala, and SparkDeep Learning, Scala, and Spark
Deep Learning, Scala, and Spark
 
main
mainmain
main
 
Summer training matlab
Summer training matlab Summer training matlab
Summer training matlab
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
Fuel Up JavaScript with Functional Programming
Fuel Up JavaScript with Functional ProgrammingFuel Up JavaScript with Functional Programming
Fuel Up JavaScript with Functional Programming
 
Archi Modelling
Archi ModellingArchi Modelling
Archi Modelling
 

Dernier

Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldRoberto Pérez Alcolea
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdfAndrey Devyatkin
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogueitservices996
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencessuser9e7c64
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...OnePlan Solutions
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfmaor17
 

Dernier (20)

Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
2024-04-09 - From Complexity to Clarity - AWS Summit AMS.pdf
 
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News UpdateVictoriaMetrics Q1 Meet Up '24 - Community & News Update
VictoriaMetrics Q1 Meet Up '24 - Community & News Update
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Ronisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited CatalogueRonisha Informatics Private Limited Catalogue
Ronisha Informatics Private Limited Catalogue
 
Patterns for automating API delivery. API conference
Patterns for automating API delivery. API conferencePatterns for automating API delivery. API conference
Patterns for automating API delivery. API conference
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
Tech Tuesday Slides - Introduction to Project Management with OnePlan's Work ...
 
Zer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdfZer0con 2024 final share short version.pdf
Zer0con 2024 final share short version.pdf
 

Automatic Task-based Code Generation for High Performance DSEL

  • 1. Automatic Task-based Code Generation for High Performance DSEL Antoine Tran Tan Joel Falcou Daniel Etiemble Hartmut Kaiser Université de Paris Sud - LRI - ParSys - Inria Postale Ste||ar Group - LSU NumScale SAS Meeting C++ 2014 1 of 28
  • 2. The Expressiveness/Efficiency War Single Core Era Performance Expressiveness C/Fort. C++ Java Multi-Core/SIMD Era Performance Sequential Expressiveness SIMD Threads Heterogenous Era Performance Sequential Expressiveness GPU Phi SIMD Threads Distributed As parallel systems complexity grows, the expressiveness gap turns into an ocean 2 of 28
  • 3. Designing tools for Scientic Computing Challenges 1. Be non-disruptive 2. Domain driven optimizations 3. Provide intuitive API for the user 4. Support a wide architectural landscape 3 of 28
  • 4. Designing tools for Scientic Computing Challenges 1. Be non-disruptive 2. Domain driven optimizations 3. Provide intuitive API for the user 4. Support a wide architectural landscape Our Approach Design tools as C++ libraries (1) Design them as Domain Specic Embedded Language (DSEL) (2+3) Use Generative Programming to deliver performance (4) 3 of 28
  • 5. Designing tools for Scientic Computing Challenges 1. Be non-disruptive 2. Domain driven optimizations 3. Provide intuitive API for the user 4. Support a wide architectural landscape Objectives of the Talk Presents one of our such desgined tools Expose some limitations of the DSEL medium Propose a different take on such tools 3 of 28
  • 6. Outline Introduction The NT2 Library Redesign for task parallelism Benchmarks Conclusion and Perspectives 4 of 28
  • 7. Outline Introduction The NT2 Library Redesign for task parallelism Benchmarks Conclusion and Perspectives 4 of 28
  • 8. NT2 : The Numerical Template Toolbox NT2 as a Scientic Computing Library Provides a simple, M DSEL Provides high-performance computing entities and primitives Is easily extendable Components Uses Boost.SIMD for in-core optimizations Uses recursive parallel skeletons Supports task parallelism through Futures 5 of 28
  • 9. The Numerical Template Toolbox Principles tableT,S is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M 6 of 28
  • 10. The Numerical Template Toolbox Principles tableT,S is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M How does it works Take a .m le, copy to a .cpp le 6 of 28
  • 11. The Numerical Template Toolbox Principles tableT,S is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M How does it works Take a .m le, copy to a .cpp le Add #include nt2/nt2.hpp and do cosmetic changes 6 of 28
  • 12. The Numerical Template Toolbox Principles tableT,S is a simple, multidimensional array object that exactly mimics M array behavior and functionalities 500+ functions usable directly either on table or on any scalar values as in M How does it works Take a .m le, copy to a .cpp le Add #include nt2/nt2.hpp and do cosmetic changes Compile the le and link with libnt2.a 6 of 28
  • 13. NT2 - From M to C++ M code A1 = 1 : 1 0 0 0 ; A2 = A1 + randn ( size ( A1 ) ) ; X = lu ( A1 * A1 ’) ; rms = sqrt ( sum ( sqr ( A1 (:) - A2 (:) ) ) / numel ( A1 ) ) ; NT2 code table double A1 = _ (1. ,1000.) ; table double A2 = A1 + randn ( size ( A1 ) ) ; table double X = lu ( m t i m e s ( A1 , trans ( A1 ) ) ; d o u b l e rms = sqrt ( sum ( sqr ( A1 ( _ ) - A2 ( _ ) ) ) / numel ( A1 ) ) ; 7 of 28
  • 14. Domain Specic Embedded Languages Domain Specic Languages Non-Turing complete declarative languages Solve a single type of problems Express what to do instead of how to do it E.g : SQL, M, M, … From DSL to DSEL [Abrahams 2004] A DSL incorporates domain-specic notation, constructs, and abstractions as fundamental design considerations. A Domain Specic Embedded Languages (DSEL) is simply a library that meets the same criteria Generative Programming is one way to design such libraries 8 of 28
  • 15. Meta-programming as a Tool Denition Meta-programming is the writing of computer programs that analyse, transform and generate other programs (or themselves) as their data. Meta-programmable Languages metaOCAML : runtime code generation via code quoting Template Haskell : compile-time code generation via templates C++ : compile-time code generation via templates C++ meta-programming Relies on the Turing-complete C++ Template sub-language Handles types and integral constants at compile-time Template classes and functions act as code quoting 9 of 28
  • 16. The Expression Templates Idiom Principles Relies on extensive operator overloading Carries semantic information around code fragment Introduces DSLs without disrupting dev. chain matrix x(h,w),a(h,w),b(h,w); x = cos(a) + (b*a); exprassign ,exprmatrix ,exprplus , exprcos ,exprmatrix , exprmultiplies ,exprmatrix ,exprmatrix (x,a,b); + = cos * a b a x #pragma omp parallel for for(int j=0;jh;++j) { for(int i=0;iw;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); } } Arbitrary Transforms applied on the meta-AST General Principles of Expression Templates 10 of 28
  • 17. The Expression Templates Idiom Advantages Generic implementation becomes self-aware of optimizations API abstraction level is arbitrary high Accessible through high-level tools like B.P matrix x(h,w),a(h,w),b(h,w); x = cos(a) + (b*a); exprassign ,exprmatrix ,exprplus , exprcos ,exprmatrix , exprmultiplies ,exprmatrix ,exprmatrix (x,a,b); + = cos * a b a x #pragma omp parallel for for(int j=0;jh;++j) { for(int i=0;iw;++i) { x(j,i) = cos(a(j,i)) + ( b(j,i) * a(j,i) ); } } Arbitrary Transforms applied on the meta-AST General Principles of Expression Templates 10 of 28
  • 18. Why Parallel Programming Models ? Limits of regular tools Unstructured parallelism is error-prone Low level parallel tools are non-composable Contribute to the Expressiveness Gap 11 of 28
  • 19. Why Parallel Programming Models ? Limits of regular tools Unstructured parallelism is error-prone Low level parallel tools are non-composable Contribute to the Expressiveness Gap Available Models Performance centric : P-RAM, LOG-P, BSP Pattern centric : Futures, Skeletons Data centric : HTA, PGAS 11 of 28
  • 20. Why Parallel Programming Models ? Limits of regular tools Unstructured parallelism is error-prone Low level parallel tools are non-composable Contribute to the Expressiveness Gap Available Models Performance centric : P-RAM, LOG-P, BSP Pattern centric : Futures, Skeletons Data centric : HTA, PGAS 11 of 28
  • 21. Parallel Skeletons [Cole 89] Principles There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as a combination of such patterns Functional point of view Skeletons are Higher-Order Functions Skeletons support a compositionnal semantic Applications become composition of state-less functions 12 of 28
  • 22. Parallel Skeletons [Cole 89] Principles There are patterns in parallel applications Those patterns can be generalized in Skeletons Applications are assembled as a combination of such patterns Classical Skeletons Data parallel : map, fold, scan Task parallel : par, pipe, farm More complex : Distribuable Homomorphism, Divide Conquer, … 12 of 28
  • 23. Parallel Skeletons : The heart of NT2 Parallel Skeletons in NT2 transform : Apply a n-ary function over a subset of data fold : Perform n-ary reduction over a subset of data scan : Perform n-ary prex reduction over a subset of data 13 of 28
  • 24. Parallel Skeletons : The heart of NT2 Parallel Skeletons in NT2 transform : Apply a n-ary function over a subset of data fold : Perform n-ary reduction over a subset of data scan : Perform n-ary prex reduction over a subset of data Tied to families of loop nests Elementwise : call to transform Can be nested with elementwise operations Reduction scan : respectively call to fold or scan Successive reductions or prex scans are not nestable Can contain a nested elementwise loop nest 13 of 28
  • 25. Parallel Skeletons extraction process A = B / sum(C+D) ; = A = B sum + C D fold transform 14 of 28
  • 26. Parallel Skeletons extraction process A = B / sum(C+D) ; ; ; = A = B sum + C D fold transform = tmp sum + C D fold ) = A = B tmp transform 15 of 28
  • 27. The limits Limits of Expression Templates Synchronization cost due to implicit barriers Under-exploitation of potential parallelism (ex : concurrency, pipeline) Poor data locality and no inter-statement optimization 16 of 28
  • 28. The limits Limits of Expression Templates Synchronization cost due to implicit barriers Under-exploitation of potential parallelism (ex : concurrency, pipeline) Poor data locality and no inter-statement optimization Actions ! Upgrade NT2 to enable task parallelism Adapt current skeletons for taskication Derive a dependency graph between statements 16 of 28
  • 29. Outline Introduction The NT2 Library Redesign for task parallelism Benchmarks Conclusion and Perspectives 16 of 28
  • 30. Futures programming model future int f1 = async ( foo1 ) ; Client Future Promise async( ) new( ) returns future set( ) 17 of 28
  • 31. Futures programming model future int f1 = async ( foo1 ) ; ... int r e s u l t = f1 . get () ; Client Future Promise async( ) new( ) returns future set( ) 17 of 28
  • 32. Futures programming model future int f1 = async ( foo1 ) ; ... int r e s u l t = f1 . get () ; Client Future Promise async( ) new( ) returns future set( ) get( ) returns value 17 of 28
  • 33. Futures programming model future int f1 = async ( foo1 ) ; ... int r e s u l t = f1 . get () ; Client Future Promise async( ) new( ) returns future set( ) get( ) not ready returns value 17 of 28
  • 34. Sequential composition . Future1 Client Promise async( ) new( ) returns future1 set( ) future int f1 = async ( foo1 ) ; 18 of 28
  • 35. Sequential composition . Future1 Client Promise async( ) new( ) returns future1 set( ) future int f1 = async ( foo1 ) ; future int f2 = f1 . then ( foo2 ) ; 18 of 28
  • 36. Sequential composition Client Future1 Future2 Promise async( ) new( ) returns future1 set( ) set( ) then( ) returns future2 new( ) future int f1 = async ( foo1 ) ; future int f2 = f1 . then ( foo2 ) ; 18 of 28
  • 37. Sequential composition Client Future1 Future2 Promise async( ) new( ) returns future1 set( ) new( ) then( ) set( ) returns future2 future int f1 = async ( foo1 ) ; future int f2 = f1 . then ( foo2 ) ; 18 of 28
  • 38. Parallel composition t y p e d e f t y p e n a m e future int F ; F f1 = async ( foo1 ) ; F f2 = async ( foo2 ) ; foo. 1 foo2 19 of 28
  • 39. Parallel composition t y p e d e f t y p e n a m e future int F ; F f1 = async ( foo1 ) ; F f2 = async ( foo2 ) ; future tuple F ,F f3 = w h e n _ a l l ( f1 , f2 ) ; foo. 1 foo2 19 of 28
  • 40. Parallel composition t y p e d e f t y p e n a m e future int F ; vector F join ; join . p u s h _ b a c k ( async ( foo1 ) ) ; join . p u s h _ b a c k ( async ( foo2 ) ) ; foo. 1 foo2 19 of 28
  • 41. Parallel composition t y p e d e f t y p e n a m e future int F ; vector F join ; join . p u s h _ b a c k ( async ( foo1 ) ) ; join . p u s h _ b a c k ( async ( foo2 ) ) ; future vector F f3 = w h e n _ a l l ( join ) ; foo. 1 foo2 19 of 28
  • 42. Parallel composition t y p e d e f t y p e n a m e future int F ; vector F join ; join . p u s h _ b a c k ( async ( foo1 ) ) ; join . p u s h _ b a c k ( async ( foo2 ) ) ; F f3 = w h e n _ a l l ( join ) . then ( foo3 ) ; foo1 foo2 foo3 19 of 28
  • 43. Parallel Skeletons extraction process A = B / sum(C+D) ; ; ; = tmp sum + C D fold = A = B tmp transform 20 of 28
  • 44. Parallel Skeletons extraction process A = B / sum(C+D) ; fold = = tmp(3) sum(3) + tmp(1) sum + spawnertransform,Futures C(:; 3) D(:; 3) = = = A(:; 3) = A(:; 2) = A(:; 1) = B(:; 3) tmp(3) B(:; 2) tmp(2) transform C(:; 1) D(:; 1) fold = tmp(2) sum(2) + C(:; 2) D(:; 2) B(:; 1) tmp(1) transform workerfold,simd workertransform,simd spawnertransform,Futures ; 21 of 28
  • 45. Outline Introduction The NT2 Library Redesign for task parallelism Benchmarks Conclusion and Perspectives 21 of 28
  • 46. Black and Scholes Option Pricing NT2 Code table float b l a c k s c h o l e s ( table float const Sa , table float const Xa , table float const Ta , table float const ra , table float const va ) { table float da = sqrt ( Ta ) ; table float d1 = log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) ; table float d2 = d1 - va * da ; r e t u r n Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) ; } 22 of 28
  • 47. Black and Scholes Option Pricing NT2 Code - Manual Loop Fusion table float b l a c k s c h o l e s ( table float const Sa , table float const Xa , table float const Ta , table float const ra , table float const va ) { // P r e a l l o c a t e t e m p o r a r y t a b l e s table float da ( e x t e n t ( Ta ) ) , d1 ( e x t e n t ( Ta ) ) , d2 ( e x t e n t ( Ta ) ) , R ( e x t e n t ( Ta ) ) ; // tie merge loop nest and i n c r e a s e cache l o c a l i t y tie ( da , d1 , d2 , R ) = tie ( sqrt ( Ta ) , log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) , d1 - va * da , Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) ) ; r e t u r n R ; } 23 of 28
  • 48. Black and Scholes performance for size 256M 600 581 scalar SIMD NT2 400 200 0 140 23 Implementation Execution Time in cycles/element 24 of 28
  • 49. TiledLU algorithm A00 A10 A01 A02 A20 A11 A21 A12 A22 A11 A21 A12 A22 A22 step 1 step 2 step 3 step 4 step 5 step 6 step 7 DGETRF DGESSM DTSTRF DSSSSM 25 of 28
  • 50. LU performance for Problem Size 8000 Performance in GFlop/s NT2 0 10 20 30 40 50 150 100 50 0 Number of cores Plasma Intel MKL 26 of 28
  • 51. Outline Introduction The NT2 Library Redesign for task parallelism Benchmarks Conclusion and Perspectives 26 of 28
  • 52. Conclusion and Perspectives Conclusion Extraction of asynchrony from DSEL statements Combinaison of Future-based task management with parallel skeletons Efficiency for coarse grain tasks 27 of 28
  • 53. Conclusion and Perspectives Conclusion Extraction of asynchrony from DSEL statements Combinaison of Future-based task management with parallel skeletons Efficiency for coarse grain tasks Perspectives Integration of application/memory aware cost models Generalization of Futures for distributed systems GPGPUs Follow us on http://www.github.com/MetaScale/nt2 27 of 28
  • 54. Thanks for your attention 28 of 28