Providing high level tools for parallel programming while sustaining a high level of performance has been a challenge that techniques like Domain Specific Embedded Languages try to solve. In previous works, we investigated the design of such a DSEL – NT2 – providing a Matlab -like syntax for parallel numerical computations inside a C++ library.
Main issues addressed here is how liimtaions of classical DSEL generation and multithreaded code generation can be overcome.
Automatic Task-based Code Generation for High Performance DSEL
1. Automatic Task-based Code Generation for High Performance DSEL
Antoine Tran Tan Joel Falcou Daniel Etiemble Hartmut Kaiser
Université de Paris Sud - LRI - ParSys - Inria Postale
Ste||ar Group - LSU
NumScale SAS
Meeting C++ 2014
1 of 28
2. The Expressiveness/Efficiency War
Single Core Era
Performance
Expressiveness
C/Fort.
C++
Java
Multi-Core/SIMD Era
Performance
Sequential
Expressiveness
SIMD
Threads
Heterogenous Era
Performance
Sequential
Expressiveness
GPU
Phi
SIMD
Threads
Distributed
As parallel systems complexity grows, the expressiveness gap turns into an ocean
2 of 28
3. Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
3 of 28
4. Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
Our Approach
Design tools as C++ libraries (1)
Design them as Domain Specic Embedded Language (DSEL) (2+3)
Use Generative Programming to deliver performance (4)
3 of 28
5. Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
Objectives of the Talk
Presents one of our such desgined tools
Expose some limitations of the DSEL medium
Propose a different take on such tools
3 of 28
6. Outline
Introduction
The NT2 Library
Redesign for task parallelism
Benchmarks
Conclusion and Perspectives
4 of 28
7. Outline
Introduction
The NT2 Library
Redesign for task parallelism
Benchmarks
Conclusion and Perspectives
4 of 28
8. NT2 : The Numerical Template Toolbox
NT2 as a Scientic Computing Library
Provides a simple, M DSEL
Provides high-performance computing entities and primitives
Is easily extendable
Components
Uses Boost.SIMD for in-core optimizations
Uses recursive parallel skeletons
Supports task parallelism through Futures
5 of 28
9. The Numerical Template Toolbox
Principles
tableT,S is a simple, multidimensional array object that exactly
mimics M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values
as in M
6 of 28
10. The Numerical Template Toolbox
Principles
tableT,S is a simple, multidimensional array object that exactly
mimics M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values
as in M
How does it works
Take a .m le, copy to a .cpp le
6 of 28
11. The Numerical Template Toolbox
Principles
tableT,S is a simple, multidimensional array object that exactly
mimics M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values
as in M
How does it works
Take a .m le, copy to a .cpp le
Add #include nt2/nt2.hpp and do cosmetic changes
6 of 28
12. The Numerical Template Toolbox
Principles
tableT,S is a simple, multidimensional array object that exactly
mimics M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values
as in M
How does it works
Take a .m le, copy to a .cpp le
Add #include nt2/nt2.hpp and do cosmetic changes
Compile the le and link with libnt2.a
6 of 28
13. NT2 - From M to C++
M code
A1 = 1 : 1 0 0 0 ;
A2 = A1 + randn ( size ( A1 ) ) ;
X = lu ( A1 * A1 ’) ;
rms = sqrt ( sum ( sqr ( A1 (:) - A2 (:) ) ) / numel ( A1 ) ) ;
NT2 code
table double A1 = _ (1. ,1000.) ;
table double A2 = A1 + randn ( size ( A1 ) ) ;
table double X = lu ( m t i m e s ( A1 , trans ( A1 ) ) ;
d o u b l e rms = sqrt ( sum ( sqr ( A1 ( _ ) - A2 ( _ ) ) ) / numel ( A1 ) ) ;
7 of 28
14. Domain Specic Embedded Languages
Domain Specic Languages
Non-Turing complete declarative languages
Solve a single type of problems
Express what to do instead of how to do it
E.g : SQL, M, M, …
From DSL to DSEL [Abrahams 2004]
A DSL incorporates domain-specic notation, constructs, and abstractions as
fundamental design considerations.
A Domain Specic Embedded Languages (DSEL) is simply a library that meets the
same criteria
Generative Programming is one way to design such libraries
8 of 28
15. Meta-programming as a Tool
Denition
Meta-programming is the writing of computer programs that analyse, transform and
generate other programs (or themselves) as their data.
Meta-programmable Languages
metaOCAML : runtime code generation via code quoting
Template Haskell : compile-time code generation via templates
C++ : compile-time code generation via templates
C++ meta-programming
Relies on the Turing-complete C++ Template sub-language
Handles types and integral constants at compile-time
Template classes and functions act as code quoting
9 of 28
16. The Expression Templates Idiom
Principles
Relies on extensive operator
overloading
Carries semantic information
around code fragment
Introduces DSLs without
disrupting dev. chain
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
exprassign
,exprmatrix
,exprplus
, exprcos
,exprmatrix
, exprmultiplies
,exprmatrix
,exprmatrix
(x,a,b);
+
=
cos *
a b a
x
#pragma omp parallel for
for(int j=0;jh;++j)
{
for(int i=0;iw;++i)
{
x(j,i) = cos(a(j,i))
+ ( b(j,i)
* a(j,i)
);
}
}
Arbitrary Transforms applied
on the meta-AST
General Principles of Expression Templates
10 of 28
17. The Expression Templates Idiom
Advantages
Generic implementation becomes
self-aware of optimizations
API abstraction level is arbitrary
high
Accessible through high-level
tools like B.P
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
exprassign
,exprmatrix
,exprplus
, exprcos
,exprmatrix
, exprmultiplies
,exprmatrix
,exprmatrix
(x,a,b);
+
=
cos *
a b a
x
#pragma omp parallel for
for(int j=0;jh;++j)
{
for(int i=0;iw;++i)
{
x(j,i) = cos(a(j,i))
+ ( b(j,i)
* a(j,i)
);
}
}
Arbitrary Transforms applied
on the meta-AST
General Principles of Expression Templates
10 of 28
18. Why Parallel Programming Models ?
Limits of regular tools
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Contribute to the Expressiveness Gap
11 of 28
19. Why Parallel Programming Models ?
Limits of regular tools
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Contribute to the Expressiveness Gap
Available Models
Performance centric : P-RAM, LOG-P, BSP
Pattern centric : Futures, Skeletons
Data centric : HTA, PGAS
11 of 28
20. Why Parallel Programming Models ?
Limits of regular tools
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Contribute to the Expressiveness Gap
Available Models
Performance centric : P-RAM, LOG-P, BSP
Pattern centric : Futures, Skeletons
Data centric : HTA, PGAS
11 of 28
21. Parallel Skeletons [Cole 89]
Principles
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as a combination of such patterns
Functional point of view
Skeletons are Higher-Order Functions
Skeletons support a compositionnal semantic
Applications become composition of state-less functions
12 of 28
22. Parallel Skeletons [Cole 89]
Principles
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as a combination of such patterns
Classical Skeletons
Data parallel : map, fold, scan
Task parallel : par, pipe, farm
More complex : Distribuable Homomorphism, Divide Conquer, …
12 of 28
23. Parallel Skeletons : The heart of NT2
Parallel Skeletons in NT2
transform : Apply a n-ary function over a subset of data
fold : Perform n-ary reduction over a subset of data
scan : Perform n-ary prex reduction over a subset of data
13 of 28
24. Parallel Skeletons : The heart of NT2
Parallel Skeletons in NT2
transform : Apply a n-ary function over a subset of data
fold : Perform n-ary reduction over a subset of data
scan : Perform n-ary prex reduction over a subset of data
Tied to families of loop nests
Elementwise : call to transform
Can be nested with elementwise operations
Reduction scan : respectively call to fold or scan
Successive reductions or prex scans are not nestable
Can contain a nested elementwise loop nest
13 of 28
26. Parallel Skeletons extraction process
A = B / sum(C+D) ;
; ;
=
A =
B sum
+
C D
fold
transform
=
tmp sum
+
C D
fold
)
=
A =
B tmp
transform
15 of 28
27. The limits
Limits of Expression Templates
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism (ex : concurrency, pipeline)
Poor data locality and no inter-statement optimization
16 of 28
28. The limits
Limits of Expression Templates
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism (ex : concurrency, pipeline)
Poor data locality and no inter-statement optimization
Actions !
Upgrade NT2 to enable task parallelism
Adapt current skeletons for taskication
Derive a dependency graph between statements
16 of 28
29. Outline
Introduction
The NT2 Library
Redesign for task parallelism
Benchmarks
Conclusion and Perspectives
16 of 28
30. Futures programming model
future int f1 = async ( foo1 ) ;
Client Future Promise
async( )
new( )
returns future
set( )
17 of 28
31. Futures programming model
future int f1 = async ( foo1 ) ;
...
int r e s u l t = f1 . get () ;
Client Future Promise
async( )
new( )
returns future
set( )
17 of 28
32. Futures programming model
future int f1 = async ( foo1 ) ;
...
int r e s u l t = f1 . get () ;
Client Future Promise
async( )
new( )
returns future
set( )
get( )
returns value
17 of 28
33. Futures programming model
future int f1 = async ( foo1 ) ;
...
int r e s u l t = f1 . get () ;
Client Future Promise
async( )
new( )
returns future
set( )
get( )
not ready
returns value
17 of 28
34. Sequential composition
.
Future1
Client
Promise
async( )
new( )
returns future1
set( )
future int f1 = async ( foo1 ) ;
18 of 28
35. Sequential composition
.
Future1
Client
Promise
async( )
new( )
returns future1
set( )
future int f1 = async ( foo1 ) ;
future int f2 = f1 . then ( foo2 ) ;
18 of 28
36. Sequential composition
Client Future1 Future2 Promise
async( )
new( )
returns future1
set( )
set( )
then( )
returns future2
new( ) future int f1 = async ( foo1 ) ;
future int f2 = f1 . then ( foo2 ) ;
18 of 28
37. Sequential composition
Client Future1 Future2 Promise
async( )
new( )
returns future1
set( )
new( )
then( )
set( )
returns future2
future int f1 = async ( foo1 ) ;
future int f2 = f1 . then ( foo2 ) ;
18 of 28
38. Parallel composition
t y p e d e f t y p e n a m e future int F ;
F f1 = async ( foo1 ) ;
F f2 = async ( foo2 ) ;
foo. 1 foo2
19 of 28
39. Parallel composition
t y p e d e f t y p e n a m e future int F ;
F f1 = async ( foo1 ) ;
F f2 = async ( foo2 ) ;
future tuple F ,F f3 = w h e n _ a l l ( f1 , f2 ) ;
foo. 1 foo2
19 of 28
40. Parallel composition
t y p e d e f t y p e n a m e future int F ;
vector F join ;
join . p u s h _ b a c k ( async ( foo1 ) ) ;
join . p u s h _ b a c k ( async ( foo2 ) ) ;
foo. 1 foo2
19 of 28
41. Parallel composition
t y p e d e f t y p e n a m e future int F ;
vector F join ;
join . p u s h _ b a c k ( async ( foo1 ) ) ;
join . p u s h _ b a c k ( async ( foo2 ) ) ;
future vector F f3 = w h e n _ a l l ( join ) ;
foo. 1 foo2
19 of 28
42. Parallel composition
t y p e d e f t y p e n a m e future int F ;
vector F join ;
join . p u s h _ b a c k ( async ( foo1 ) ) ;
join . p u s h _ b a c k ( async ( foo2 ) ) ;
F f3 = w h e n _ a l l ( join ) . then ( foo3 ) ;
foo1 foo2
foo3
19 of 28
45. Outline
Introduction
The NT2 Library
Redesign for task parallelism
Benchmarks
Conclusion and Perspectives
21 of 28
46. Black and Scholes Option Pricing
NT2 Code
table float b l a c k s c h o l e s ( table float const Sa , table float const Xa
, table float const Ta
, table float const ra , table float const va
)
{
table float da = sqrt ( Ta ) ;
table float d1 = log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) ;
table float d2 = d1 - va * da ;
r e t u r n Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) ;
}
22 of 28
47. Black and Scholes Option Pricing
NT2 Code - Manual Loop Fusion
table float b l a c k s c h o l e s ( table float const Sa , table float const Xa
, table float const Ta
, table float const ra , table float const va
)
{
// P r e a l l o c a t e t e m p o r a r y t a b l e s
table float da ( e x t e n t ( Ta ) ) , d1 ( e x t e n t ( Ta ) ) , d2 ( e x t e n t ( Ta ) ) , R ( e x t e n t ( Ta ) ) ;
// tie merge loop nest and i n c r e a s e cache l o c a l i t y
tie ( da , d1 , d2 , R ) = tie ( sqrt ( Ta )
, log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da )
, d1 - va * da
, Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 )
) ;
r e t u r n R ;
}
23 of 28
48. Black and Scholes performance for size 256M
600 581
scalar SIMD NT2
400
200
0
140
23
Implementation
Execution Time in cycles/element
24 of 28
50. LU performance for Problem Size 8000
Performance in GFlop/s NT2
0 10 20 30 40 50
150
100
50
0
Number of cores
Plasma
Intel MKL
26 of 28
51. Outline
Introduction
The NT2 Library
Redesign for task parallelism
Benchmarks
Conclusion and Perspectives
26 of 28
52. Conclusion and Perspectives
Conclusion
Extraction of asynchrony from DSEL statements
Combinaison of Future-based task management with parallel skeletons
Efficiency for coarse grain tasks
27 of 28
53. Conclusion and Perspectives
Conclusion
Extraction of asynchrony from DSEL statements
Combinaison of Future-based task management with parallel skeletons
Efficiency for coarse grain tasks
Perspectives
Integration of application/memory aware cost models
Generalization of Futures for distributed systems GPGPUs
Follow us on http://www.github.com/MetaScale/nt2
27 of 28