Automatic Task-based Code Generation for High Performance DSEL

Automatic Task-based Code Generation for High Performance DSEL
Antoine Tran Tan Joel Falcou Daniel Etiemble Hartmut Kaiser
Université de Paris Sud - LRI - ParSys - Inria Postale
Ste||ar Group - LSU
NumScale SAS
Meeting C++ 2014
1 of 28

The Expressiveness/Efficiency War
Single Core Era
Performance
Expressiveness
C/Fort.
C++
Java
Multi-Core/SIMD Era
Performance
Sequential
Expressiveness
SIMD
Threads
Heterogenous Era
Performance
Sequential
Expressiveness
GPU
Phi
SIMD
Threads
Distributed
As parallel systems complexity grows, the expressiveness gap turns into an ocean
2 of 28

Designing tools for Scientic Computing
Challenges
1. Be non-disruptive
2. Domain driven optimizations
3. Provide intuitive API for the user
4. Support a wide architectural landscape
3 of 28

Challenges
Our Approach
Design tools as C++ libraries (1)
Design them as Domain Specic Embedded Language (DSEL) (2+3)
Use Generative Programming to deliver performance (4)
3 of 28

Challenges
Objectives of the Talk
Presents one of our such desgined tools
Expose some limitations of the DSEL medium
Propose a different take on such tools
3 of 28

Outline
Introduction
The NT2 Library
Redesign for task parallelism
Benchmarks
Conclusion and Perspectives
4 of 28

NT2 : The Numerical Template Toolbox
NT2 as a Scientic Computing Library
Provides a simple, M DSEL
Provides high-performance computing entities and primitives
Is easily extendable
Components
Uses Boost.SIMD for in-core optimizations
Uses recursive parallel skeletons
Supports task parallelism through Futures
5 of 28

The Numerical Template Toolbox
Principles
tableT,S is a simple, multidimensional array object that exactly
mimics M array behavior and functionalities
500+ functions usable directly either on table or on any scalar values
as in M
6 of 28

Principles
How does it works
Take a .m le, copy to a .cpp le
6 of 28

Principles
How does it works
Add #include nt2/nt2.hpp and do cosmetic changes
6 of 28

Principles
How does it works
Add #include nt2/nt2.hpp and do cosmetic changes
Compile the le and link with libnt2.a
6 of 28

NT2 - From M to C++
M code
A1 = 1 : 1 0 0 0 ;
A2 = A1 + randn ( size ( A1 ) ) ;
X = lu ( A1 * A1 ’) ;
rms = sqrt ( sum ( sqr ( A1 (:) - A2 (:) ) ) / numel ( A1 ) ) ;
NT2 code
table double A1 = _ (1. ,1000.) ;
table double A2 = A1 + randn ( size ( A1 ) ) ;
table double X = lu ( m t i m e s ( A1 , trans ( A1 ) ) ;
d o u b l e rms = sqrt ( sum ( sqr ( A1 ( _ ) - A2 ( _ ) ) ) / numel ( A1 ) ) ;
7 of 28

Domain Specic Embedded Languages
Domain Specic Languages
Non-Turing complete declarative languages
Solve a single type of problems
Express what to do instead of how to do it
E.g : SQL, M, M, …
From DSL to DSEL [Abrahams 2004]
A DSL incorporates domain-specic notation, constructs, and abstractions as
fundamental design considerations.
A Domain Specic Embedded Languages (DSEL) is simply a library that meets the
same criteria
Generative Programming is one way to design such libraries
8 of 28

Meta-programming as a Tool
Denition
Meta-programming is the writing of computer programs that analyse, transform and
generate other programs (or themselves) as their data.
Meta-programmable Languages
metaOCAML : runtime code generation via code quoting
Template Haskell : compile-time code generation via templates
C++ : compile-time code generation via templates
C++ meta-programming
Relies on the Turing-complete C++ Template sub-language
Handles types and integral constants at compile-time
Template classes and functions act as code quoting
9 of 28

The Expression Templates Idiom
Principles
Relies on extensive operator
overloading
Carries semantic information
around code fragment
Introduces DSLs without
disrupting dev. chain
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
exprassign
,exprmatrix
,exprplus
, exprcos
,exprmatrix

, exprmultiplies
,exprmatrix
,exprmatrix

(x,a,b);
+
=
cos *
a b a
x
#pragma omp parallel for
for(int j=0;jh;++j)
{
for(int i=0;iw;++i)
{
x(j,i) = cos(a(j,i))
+ ( b(j,i)
* a(j,i)
);
}
}
Arbitrary Transforms applied
on the meta-AST
General Principles of Expression Templates
10 of 28

The Expression Templates Idiom
Advantages
Generic implementation becomes
self-aware of optimizations
API abstraction level is arbitrary
high
Accessible through high-level
tools like B.P
matrix x(h,w),a(h,w),b(h,w);
x = cos(a) + (b*a);
exprassign
,exprmatrix
,exprplus
, exprcos
,exprmatrix

, exprmultiplies
,exprmatrix
,exprmatrix

(x,a,b);
+
=
cos *
a b a
x
#pragma omp parallel for
for(int j=0;jh;++j)
{
for(int i=0;iw;++i)
{
x(j,i) = cos(a(j,i))
+ ( b(j,i)
* a(j,i)
);
}
}
Arbitrary Transforms applied
on the meta-AST
General Principles of Expression Templates
10 of 28

Why Parallel Programming Models ?
Limits of regular tools
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Contribute to the Expressiveness Gap
11 of 28

Why Parallel Programming Models ?
Limits of regular tools
Unstructured parallelism is error-prone
Low level parallel tools are non-composable
Contribute to the Expressiveness Gap
Available Models
Performance centric : P-RAM, LOG-P, BSP
Pattern centric : Futures, Skeletons
Data centric : HTA, PGAS
11 of 28

Parallel Skeletons [Cole 89]
Principles
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as a combination of such patterns
Functional point of view
Skeletons are Higher-Order Functions
Skeletons support a compositionnal semantic
Applications become composition of state-less functions
12 of 28

Parallel Skeletons [Cole 89]
Principles
There are patterns in parallel applications
Those patterns can be generalized in Skeletons
Applications are assembled as a combination of such patterns
Classical Skeletons
Data parallel : map, fold, scan
Task parallel : par, pipe, farm
More complex : Distribuable Homomorphism, Divide Conquer, …
12 of 28

Parallel Skeletons : The heart of NT2
Parallel Skeletons in NT2
transform : Apply a n-ary function over a subset of data
fold : Perform n-ary reduction over a subset of data
scan : Perform n-ary prex reduction over a subset of data
13 of 28

Parallel Skeletons : The heart of NT2
Parallel Skeletons in NT2
transform : Apply a n-ary function over a subset of data
fold : Perform n-ary reduction over a subset of data
scan : Perform n-ary prex reduction over a subset of data
Tied to families of loop nests
Elementwise : call to transform
Can be nested with elementwise operations
Reduction scan : respectively call to fold or scan
Successive reductions or prex scans are not nestable
Can contain a nested elementwise loop nest
13 of 28

Parallel Skeletons extraction process
A = B / sum(C+D) ;
=
A =
B sum
+
C D
fold
transform
14 of 28

A = B / sum(C+D) ;
; ;
=
A =
B sum
+
C D
fold
transform
=
tmp sum
+
C D
fold
)
=
A =
B tmp
transform
15 of 28

The limits
Limits of Expression Templates
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism (ex : concurrency, pipeline)
Poor data locality and no inter-statement optimization
16 of 28

The limits
Limits of Expression Templates
Synchronization cost due to implicit barriers
Under-exploitation of potential parallelism (ex : concurrency, pipeline)
Poor data locality and no inter-statement optimization
Actions !
Upgrade NT2 to enable task parallelism
Adapt current skeletons for taskication
Derive a dependency graph between statements
16 of 28

Outline
Introduction
The NT2 Library
Benchmarks
16 of 28

Futures programming model
future int f1 = async ( foo1 ) ;
Client Future Promise
async( )
new( )
returns future
set( )
17 of 28

...
int r e s u l t = f1 . get () ;
async( )
new( )
returns future
set( )
17 of 28

...
async( )
new( )
returns future
set( )
get( )
returns value
17 of 28

...
async( )
new( )
returns future
set( )
get( )
not ready
returns value
17 of 28

Sequential composition
.
Future1
Client
Promise
async( )
new( )
returns future1
set( )
18 of 28

.
Future1
Client
Promise
async( )
new( )
returns future1
set( )
future int f2 = f1 . then ( foo2 ) ;
18 of 28

Client Future1 Future2 Promise
async( )
new( )
returns future1
set( )
set( )
then( )
returns future2
new( ) future int f1 = async ( foo1 ) ;
18 of 28

Client Future1 Future2 Promise
async( )
new( )
returns future1
set( )
new( )
then( )
set( )
returns future2
18 of 28

Parallel composition
t y p e d e f t y p e n a m e future int F ;
F f1 = async ( foo1 ) ;
foo. 1 foo2
19 of 28

future tuple F ,F f3 = w h e n _ a l l ( f1 , f2 ) ;
foo. 1 foo2
19 of 28

vector F join ;
join . p u s h _ b a c k ( async ( foo1 ) ) ;
foo. 1 foo2
19 of 28

vector F join ;
future vector F f3 = w h e n _ a l l ( join ) ;
foo. 1 foo2
19 of 28

vector F join ;
F f3 = w h e n _ a l l ( join ) . then ( foo3 ) ;
foo1 foo2
foo3
19 of 28

A = B / sum(C+D) ;
; ;
=
tmp sum
+
C D
fold
=
A =
B tmp
transform
20 of 28

A = B / sum(C+D) ;
fold
=
=
tmp(3) sum(3)
+
tmp(1) sum
+
spawnertransform,Futures
C(:; 3) D(:; 3)
=
=
=
A(:; 3) =
A(:; 2) =
A(:; 1) =
B(:; 3) tmp(3)
B(:; 2) tmp(2)
transform
C(:; 1) D(:; 1)
fold
=
tmp(2) sum(2)
+
C(:; 2) D(:; 2)
B(:; 1) tmp(1)
transform
workerfold,simd
workertransform,simd
spawnertransform,Futures
;
21 of 28

Outline
Introduction
The NT2 Library
Benchmarks
21 of 28

Black and Scholes Option Pricing
NT2 Code
table float b l a c k s c h o l e s ( table float const Sa , table float const Xa
, table float const Ta
, table float const ra , table float const va
)
{
table float da = sqrt ( Ta ) ;
table float d1 = log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da ) ;
table float d2 = d1 - va * da ;
r e t u r n Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 ) ;
}
22 of 28

Black and Scholes Option Pricing
NT2 Code - Manual Loop Fusion
table float b l a c k s c h o l e s ( table float const Sa , table float const Xa
, table float const Ta
, table float const ra , table float const va
)
{
// P r e a l l o c a t e t e m p o r a r y t a b l e s
table float da ( e x t e n t ( Ta ) ) , d1 ( e x t e n t ( Ta ) ) , d2 ( e x t e n t ( Ta ) ) , R ( e x t e n t ( Ta ) ) ;
// tie merge loop nest and i n c r e a s e cache l o c a l i t y
tie ( da , d1 , d2 , R ) = tie ( sqrt ( Ta )
, log ( Sa / Xa ) + ( sqr ( va ) *0.5 f + ra ) * Ta /( va * da )
, d1 - va * da
, Sa * n o r m c d f ( d1 ) - Xa * exp ( - ra * Ta ) * n o r m c d f ( d2 )
) ;
r e t u r n R ;
}
23 of 28

Black and Scholes performance for size 256M
600 581
scalar SIMD NT2
400
200
0
140
23
Implementation
Execution Time in cycles/element
24 of 28

TiledLU algorithm
A00
A10 A01 A02
A20 A11
A21
A12
A22 A11
A21 A12
A22
A22
step 1
step 2
step 3
step 4
step 5
step 6
step 7
DGETRF
DGESSM
DTSTRF
DSSSSM
25 of 28

LU performance for Problem Size 8000
Performance in GFlop/s NT2
0 10 20 30 40 50
150
100
50
0
Number of cores
Plasma
Intel MKL
26 of 28

Outline
Introduction
The NT2 Library
Benchmarks
26 of 28

Conclusion
Extraction of asynchrony from DSEL statements
Combinaison of Future-based task management with parallel skeletons
Efficiency for coarse grain tasks
27 of 28

Conclusion
Extraction of asynchrony from DSEL statements
Combinaison of Future-based task management with parallel skeletons
Efficiency for coarse grain tasks
Perspectives
Integration of application/memory aware cost models
Generalization of Futures for distributed systems GPGPUs
Follow us on http://www.github.com/MetaScale/nt2
27 of 28

Thanks for your attention
28 of 28

Automatic Task-based Code Generation for High Performance DSEL

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (10)

Similaire à Automatic Task-based Code Generation for High Performance DSEL

Similaire à Automatic Task-based Code Generation for High Performance DSEL (20)

Dernier

Dernier (20)

Automatic Task-based Code Generation for High Performance DSEL