SlideShare une entreprise Scribd logo
1  sur  13
Télécharger pour lire hors ligne
ESAIM: PROCEEDINGS, Vol. ?, 2013, 1-10
Editors: Will be set by the publisher
COMPILATION ANALYSIS, PERFORMANCE ANALYSIS, SCALABILITY USING
SCALASCA WITH FEEL++ SCIENTIFIC APPLICATIONS∗, ∗∗
Jussara Marandola1
Résumé. ...
Abstract. This paper presents the following contributions: the compilation analysis on Feel++ lan-
guage presenting one example as state of art in mesh manipulation to provide 1D, cube in 2D, tetrahe-
dron, cube in 3D models of Laplacian code. The focus was analyzed the compilation options during the
execution - mpirun, mpi execution with duo, four processors. In rst step, we showed the importance to
realize the performance analyze to compare the feel++ scientic application using Feel++ TIME class
or introducing Scalasca instrumentation to get the CPU time allocation and throughput by scalability
in terms of numbers of threads and scalability of processors for clusters.
1. Introduction
In this paper, we introduce the compilation executions and performance analysis on Feel++, Finite Element
Embedded Language in C++. Feel++ is a C++ library for arbitrary order Galerkin methods (e.g. nite and
spectral element methods ) continuous or discontinuous in 1D 2D and 3D. It include many features such
as geometries 1D, 2D, 3D and lower topological dimension 1D(curve) in 2D and 3D or 2D(surfacee) in 3D.
Through supporting Gmsh for mesh generation and Paraview for post-processing we can analyze the models by
visualization.
Thinking in terms of libraries to solve problems arising from partial dierential equations(PDEs) through
generalized Galerkin methods [13], Feel++ provides the complexity of dierential models and implementation of
state of the art robust numerical methods : a language clear to express problems specialized in a type of equation,
e.g Navier-Stokes or linear elasticity models and nally a wealth of solution algorithm. The last advances in
Feel++ describes the mesh data structure as well mesh entities(elements, faces, edges, nodes) and algebrics
representations. Mesh entities are indexed by process id by MPI in parallel context.
Feel ++ relies on MPI for parallel computations and the class Application initialises the MPI environment.
The makele of Feel++ project enable the option MPI mode that oers the parallel computation with mpirun
command.
The main goal will be the introduction of compilation executation in linear algebra environment, providing
all mesh options, displaying the models through gmesh or paraview, the base concepts of performance analysis
(equations, metrics and Time class into Feel++) and nally to introduce the use of Scalasca to analyze the
∗ Thanks : Laboratoire Jean Kuntzmann, Université Joseph Fourier Grenoble 1, BP53 38041 Grenoble Cedex 9, France, e-mail :
christophe.prudhomme@ujf-grenoble.fr
∗∗ Thanks : Embedded Real Time Systems Foundation Laboratory (LaSTRE), CEA LIST, CEA-Saclay Nano-INNOV PC172,
F91191 Gif-sur-Yvette cedex, France, e-mail : stephane.louise@cea.fr
1 jussara@nerim.net
c EDP Sciences, SMAI 2013
2 ESAIM: PROCEEDINGS
scalability  performance reports on Feel++. Scalasca [6] is a performance toolset that has been specically
designed to analyze parallel application execution behavior on large-scale systems. It oers an incremental
performance anlaysis procedure that integrates runtime summaries with in-depth studies of current behavior
via event tracing, adopting a strategy of successively rened measurement congurations.
We will focus our case study on linear algebra environment, specially standard formulation : the laplacian
case. Feel ++ supports three dierent linear algebra environments that we shall call backends such as Gmm,
Petsc4, Trilinos5. Regarding the fucntion spaces denition, several types of the polynomials (P) are used as
following : Lagrange, Legendre, Dubiner, Crouzeix-Raviart, Raviart-Thomas. It supports also modal basis, e.g.
Legendre or Dubiner [12], as well as nite elements (FE) following the standard denition, set in [4], as a triplet
(κ, P, Σ) where κ is a convex, P the polynomial space and Σ the dual space.
Remainder of this paper is organized as follows.
Section 2, we will introduce the analysis of execution time for parallel algorithms as well concepts related with
speedup, execution time components and eciency and nally, the importance of Amdahl's law in performance
analysis of graphics processors. The advantage of the most common model (MPI), message passing in the context
of programming environment of Feel++ during compilation will be discuted also. In the Section 3 will be to
describe the standard formulation of laplacian without strong and weak Dirichlet conditions explaining the
compilation options by executing with the following ordering : hzise and shape. The last one provides dierents
polynomials  dimensions according ne and coarse grain (represented by size). In Section 4, we present the
overview of current version of Scalasca in terms of instrumentation and measurement. The layered model of the
Scalasca architecture, the analysis conguration and the MPI sample ctest-pomp-mpi will be described. After
that we will show the laplacian case using Scalasca. And nally, we present the Section 5 concluding all results
of setup experiment using Scalasca, the contributions and limitations of Scalasca with Feel++, advances of
compilation options with mesh data structure and nally, the performance analysis on programming environment
in terms of serial/parallel time execution of code or use of toolset for instrumentation and measurement.
2. Performance Analysis
This paper presents the Scalasca toolset for scalable performance analysis of large-scale parallel applications.
It oers a basis to examine the eectiveness of parallel performance tools. The main goal is to optimize the
real applications (benchmarks and laplacian case) understing the barriers to high performance and predict
improvement. The performance tools usually provides a program metrics parallelization.
The rst objective related with optimization of parallel algorithms on Feel++ language will be the analysis
of execution time, communication time and topology view (in terms of analysis of processes in cartesian grid)
for parallel algorithm using MPI programming on codes and compilation options. We use the message passing
model because it has the advantage in scalability, since it runs on distributed memory multiprocessors. For most
of large-scale parallel machines use the common model message passing - MPI [10].
In this context, one of the challenges [8], faced by the application developers for high performance parallel
systems is the relatively diculty programming environment. Depending of parallel machine, the high perfor-
mance could be obtained by relative programming models such as OpenMP [11], threads, data parallelism or
automatic parallelization. The recent performance tools are using the hybrid programming (MPI/OpenMP) to
optimize your application.
The recent studies about graphics performance analysis [9] using Amdahl's Law is very important for design-
ing future graphics processors as well specic purpose-processors. The Amadahl's Law is the most important
formulation that describes the number of processors does not lead to a proportional increase in performance.
The performance improvement to be gained from using some faster mode of execution is limited by the fraction
of the time the faster mode can be used. The denition can be written in the form of
T = T0 + (T1 − T0)
f1
f
, (1)
where T1 : is the measured execution time at frequency f1 ; T0 : is the non-scale time ;
ESAIM: PROCEEDINGS 3
The Amadhal's Law is very important at performance analysis regarding the scalability, it means the in-
creasing number of processors, in terms of measures of computation time (throughput and execution time) and
communication time. Analysis of scalability requires many factors to measure the performance that involves
concepts of speedup and eciency.
The Speedup measures how one parallel code can run faster than the sequential counterpart and ratio of
sequential execution time to parallel execution time. We can explain the following execution time components :
• Inherently sequential computations : σ(n)
• Pontentially parallel computations : ϕ(n)
• Communication operations and other repeat computations : κ(n, p)
Regarding Speedup, ψ(n) solves a problem of size n on p processors limited by
ψ(n, p) ≤
σ(n) + ϕ(n)
σ(n) + ϕ(n)
p + κ(n, p)
, (2)
Finally, the eciency could be measured through processor utilization, ratio of speedup to number of pro-
cessors used, consequently, eciency e n,p for problem of size n on p processors is given by
ε(n, p) ≤
σ(n) + ϕ(n)
p × (σ(n) + σ(n)
p + κ(n, p))
, (3)
3. Laplacian Case
In Feel++ document source, we list all samples. In Laplacian case, there are the following examples : laplacian
with dirichlet condition weakly (problem in 2D), laplacian full, and laplacian lagrange multiplier. All codes are
found in this directory :
~/Devel/feelpp/doc/manual/tutorial$
We analyzed the laplacian with dirichlet condition weakly, the laplacian.cpp. All kind of a vartiational for-
mulation of a problem is also called weak formulation. The key item is to bring a new function and to integrate
by parts. In the mathematical formulation example, we would like to solve this problem :
Problem 1 : nd u such that
−∆υ = inΩ = [−1; 1]2
(4)
with
= 2π2
g (5)
and g is the exact solution
g = sin (πx) cos (πy)(6)
The following boundary conditions apply
υ = g|x = ±1,
∂(υ)
∂(υ)
= 0|y = ±1 (7)
The alternative mathematical formulation handles the Dirichlet condition weakly and hence we have a uniform
treatment for all types of boundary conditions. In another alternative formulation which allows to treat weakly
Dirichlet boundary condition to Neumann and Robin conditions. Following a similar development as in the
previous section, the problem reads
Ω
u. ν +
|x=−1,x=1
∂(υ)
∂(ν)
ν − υ
∂(ν)
∂(n)
+
µ
h
υν =
Ω
f.ν +
|x=−1,x=1
−g
∂(ν)
∂(n)
+
µ
h
gν (8)
4 ESAIM: PROCEEDINGS
3.1. Execution Options and Feel++ Implementation
In C++ Laplacian is dened through the make-options function all types of compilation during the execution.
This function, a routine that returns the list of options uses the program called boost library. All data returned
in this case is used as an argument of a Feel Application subclass. In our experiment, we focused on hsize (to
describe the mesh structure - mesh size), shape (model or format of polynomial, always hypercube or simplex),
Dim (it provides dimension of 1D, 2D, 3D), and nally the weakdir (when the user wants to use weak Dirichlet
condition).
First we dene the make options. We showed the main lines of Feel++ and make options function :
include feel/feel.hpp
/** use Feel namespace */ using namespace Feel ;
inline
po::options_description
makeOptions()
{
po::options_description laplacianoptions( Laplacian options );
laplacianoptions.add_options()
( hsize, po::valuedouble()-default_value( 0.5 ), mesh size )
( shape, Feel::po::valuestd::string()-default_value( hypercube ), shape of the domain (either
( nu, po::valuedouble()-default_value( 1 ), grad.grad coefficient )
( weakdir, po::valueint()-default_value( 1 ), use weak Dirichlet condition )
( penaldir, Feel::po::valuedouble()-default_value( 10 ),
penalisation parameter for the weak boundary Dirichlet formulation )
( exact1D, po::valuestd::string()-default_value( sin(2*Pi*x) ), exact 1D solution )
( exact2D, po::valuestd::string()-default_value( sin(2*Pi*x)*cos(2*Pi*y) ), exact 2D solution
( exact3D, po::valuestd::string()-default_value( sin(2*Pi*x)*cos(2*Pi*y)*cos(2*Pi*z) ), exact
( rhs1D, po::valuestd::string()-default_value(  ), right hand side 1D )
( rhs2D, po::valuestd::string()-default_value(  ), right hand side 2D )
( rhs3D, po::valuestd::string()-default_value(  ), right hand side 3D )
;
return laplacianoptions.add( Feel::feel_options() );
}
Initially, the geometric dimension of the problem was setup in Dim=2. In Laplacian class, public main
function, we listed many types as
• Numerical type is double
• Geometry entities type composing the mesh, here Simplex in Dimension Dim of Order 1
• Mesh type
• Approximation function space type
• Exporter factory type
The Laplacian() provides mesh size and shape as dened below :
Laplacian()
:
super(),
meshSize( this-vm()[hsize].template asdouble() ),
shape( this-vm()[shape].template asstd::string() )
{
}
The execution denition was specied by template int Dim, int Order :
templateint Dim, int Order
ESAIM: PROCEEDINGS 5
void LaplacianDim,Order::run()
{
LOG(INFO)  --------------------------------n;
LOG(INFO)  Execute Laplacian  Dim  n;
Environment::changeRepository( boost::format( doc/manual/tutorial/%1%/%2%-%3%/P%4%/h_%5%/ )
% this-about().appName()
% shape
% Dim
% Order
% meshSize );
mesh_ptrtype mesh = createGMSHMesh( _mesh=new mesh_type,
_desc=domain( _name=( boost::format( %1%-%2% ) % shape % Dim ).str(
_usenames=true,
_shape=shape,
_h=meshSize,
_xmin=-1,
_ymin=-1 ) );
...}
The parameter _h in the function domain allows to change the mesh characteristic size to M_meshsize which
is given for example on the command-line using the Application framework. The createGMSHMesh is a function
which allows to generate a mesh le .msh automatically from a description le ( .geo for example) by using the
_desc parameter and store the generated mesh into the meshparameterallocatedwhencallingthefunction.
To compile the Laplacian.cpp, rst we invoked the compiler with make :
~/Devel/feel.opt$ make feelpp_doc_laplacian
To run the laplacian executable, we must invoke the command-line in this directory :
~/Devel/feel.opt/doc/manual/tutorial$
To run in parallel, we used the command-line mpirun to run the laplacian executable in N processors. The
Laplacian was executed with 2 processors and 4 processors, but we haven't implemented the time class in
Feel++ to measure the CPU allocation time. One of most limitation is the time measurement during execution,
in terms of scalability of processors (increase of N processors).
At this point of our experiment, we dened the parameters of execution related with mesh structure :
• hsize : Size of mesh.
• shape : Hypercube or Simplex shape.
• dim : Setting the 2D and 3D models.
• nu : Grad coecient setup n=1.
• weakdir : Use weak Dirichlet condition.
The MPI command-line was dened as
% mpirun -np 4 ./feelpp_doc_laplacian
Regarding these parameters, we concluded that meshes when executed, it provides the following ordering :
shape dim model
Shape Dim Model
Simplex 2D Triangle
Simplex 3D Tetrahedron
Hypercube 2D Rectangle
Hypercube 3D Cube
6 ESAIM: PROCEEDINGS
Figure 1. Paraview Visualization of Cube in 2D
Figure 2. Paraview Visualization of Rectangle in 3D
For all simulations, we setup mesh size = 0.1. The follow, we list all command-line in these parameters
running with 4 processors.
% mpirun -np 4 ./feelpp_doc_laplacian --hsize=0.1 --shape=simplex --dim=2 --nu=1 --weadir=yes
% mpirun -np 4 ./feelpp_doc_laplacian --hsize=0.1 --shape=simplex --dim=3 --nu=1 --weadir=yes
% mpirun -np 4 ./feelpp_doc_laplacian --hsize=0.1 --shape=hypercube --dim=2 --nu=1
--weadir=yes
% mpirun -np 4 ./feelpp_doc_laplacian --hsize=0.1 --shape=hypercube --dim=3 --nu=1
--weadir=yes
After execution simulations, Feel++ supports essentially the Gmsh mesh le format, but we used the Par-
aview. It provides also some classes to manipulate Gmsh .geo les and generate .msh les. All Gmsh and
Paraview les generated by executation could be visualized at directory bin of Feel.
~/feel/doc/manual/tutorial/laplacian$ ls
hypercube-2 hypercube-3 simplex-2 simplex-3
To visualize the cube (2D) and rectangle (3D) with 0.1 size and 4 processors, we invoked the paraview :
~/feel/doc/manual/tutorial/laplacian/hypercube-2/P2/h_0.1/np_4$ paraview laplacian-4.sos
~/feel/doc/manual/tutorial/laplacian/hypercube-3/P2/h_0.1/np_4$ paraview laplacian-4.sos
To visualize the triangle (2D) and tetrahedron (3D) with 0.1 size and 4 processors, we invoked the paraview :
~/feel/doc/manual/tutorial/laplacian/simplex-2/P2/h_0.1/np_4$ paraview laplacian-4.sos
~/feel/doc/manual/tutorial/laplacian/simplex-3/P2/h_0.1/np_4$ paraview laplacian-4.sos
ESAIM: PROCEEDINGS 7
Figure 3. Paraview Visualization of Triangle in 2D
Figure 4. Paraview Visualization of Tetrahedron in 3D
4. Introduction to Scalasca
In the context regarding numerical methods and algorithms for high performance computing, HPC commu-
nity requires the use of powerful and robust performance-analysis tools that makes the optimization of parallel
applications. These tools are eective and more ecient not only helping the improve the scalability character-
istics of scientic codes, but also allow experts to concentrate on the underlying science rather than to spend a
major fraction of their time tuning their application for a particular machine.
In this section, we introduce the SCALASCA, an open-source performance-analysis toolset developed at
Julich Supercomputing Center in cooperation with the University of Tennessee as the sucessor of KOJAK [14],
specically designed for use on large-scale systems including IBM Blue Gene and Cray XT, as well for small
and medium-scale HPC platforms.
The Scalasca architecture [6] shows the basic analysis workow. The base of layered model presents the
instrumentation, measurement system, trace analysis, trace utilities, report tools. The layered model is divided
in following topics : preparation (concerning insertion of probes), Execution (Event measurement  collection),
postmotem (Event summarization  analysis, and nally, presentation).
At rst, the target application must be instrumented, that is, probes must be inserted into the code that
carry out the measurements. This can happen at dierent levels, including source code, object code, or library.
Before running the instrumented executable on the parallel machine, the user can choose between generating a
runtine summary report or an event trace. In the gure 3, we show the Scalasca's performance-analysis work
ow.
The Scalasca presents the following research topics regarding performance analysis, tools  techniques [1] :
• Prole analysis : provides the summary of aggregated metrics (per function/callpath,  or per pro-
cess/thread) and tools that can generate  or present such proles (event traces). Ex : gprof, mpiP,
ompP, Scalasca, TAU, Vampir, ...
8 ESAIM: PROCEEDINGS
Figure 5. Performance analysis work ow . When tracing is enabled, each process gener-
ates a trace le containing records for its process-local events. It is generally recommended
to optimize the instrumentation based on previously generated summary report. During the
analysis, Scalasca searches for wait states and at the end it provides the result in structure to
the summary report.
• Time-line analysis : provides visual representation of the space/time sequence of events. Requires an
execution trace. Ex : Vampir, Paraver, JumpShot, Intel TAC, Sun Studio, ...
• Pattern analysis : search for event sequences characteristics ineciencies. Can be done manually (via
visual time-line) or automatically (via KOJAK, Scalasca, Periscope, ...).
In the state of art related with productivity tools the following tools are very important in cycle life of
integration of technologies :
• KCachegrind : Callgraph-based cache simulation analysis.
• Marmot/MUST : MPI correctness checking.
• PAPI : Interfacing to hardware performance counters.
• Periscope : Automatic analysis via an on-line distributed search.
• SCALASCA : Large-scale parallel performance analysis.
• TAU : Integrated parallel performance system.
• Vampir/VampirTrace : Event tracing and graphical trace visualization  analysis.
In the next section, we will explain the measurement and analysis conguration regarding the package C test
with MPI running. We will make emphasis on performance analysis.
4.1. Setup Experiment
The Scalasca measurement system that gets linked with instrumented application executable can be con-
gured via environment variables or conguration les to specify that runtime summaries and/or event traces
should be collected, along with optional hardware counter metrics.
In the introduction of section 4 we presented the performance-analysis work ow based on the following
steps : preparation, measurement, analysis, examination, optimization. The preparation prepare application,
insert extra code (probes). The measurement collect the relevant data to execution performance analysis. The
analysis provides the calculation of metrics, identication of performance metrics. The examination presents the
results in an form. And nally, the optimization modify the code to eliminate/reduce performance problems.
The perforrmance analysis presents the following steps :
• 0 : reference preparation for validation
• 1 : Program instrumentation : skin
• 2 : Summary measurement collection  analysis : scan [ -s]
• 3 : Summary analysis report examination : square
• 4 : Summary experiment scoring : scan -t
• 5 : Event trace collection  analysis : scan -t
• 6 : Event trace analysis report examination : square
At the samples are available execution mode in MPI, OpenMP  hybrid OpenMP/MPI variants. In our setup
experiment, we will show the sample CTEST with pomp-mpi execution. In this case, the Scalasca measurement
system provides the congure le (Makele  Makele.defs), its means that CTEST.C code is compiled with
ESAIM: PROCEEDINGS 9
command Scalasca -instrument. The prex compile/link commands in Makele denitions (cong/make.def)
with scalasca instrumenter is very important.
At the begin, you must load the Scalasca module and after you could run Scalasca for brief usage information,
as dened below.
% module load scalasca
scalasca /1.4.2 loaded
~/scalasca-1.4.2$ scalasca
Scalasca 1.4.2
Toolset for scalable performance analysis of large-scale parallel applications
usage: scalasca [-v][-n] {action}
1. prepare application objects and executable for measurement:
scalasca -instrument compile-or-link-command # skin
2. run application under control of measurement system:
scalasca -analyze application-launch-command # scan
3. interactively explore measurement analysis report:
scalasca -examine experiment-archive|report # square
-v: enable verbose commentary
-n: show actions without taking them
-h: show quick reference guide (only)
Scalasca Makele Example provides the following PREP denition part in the Makele explain how to setup
the instrumentation regarding routines by compiler
• Default instrumentation of routines by compiler PREP = scalasca -instrument.
• Build without any instrumentation PREP = scalasca -instrument -mode=none.
• Manual EPIK instrumentation (only) PREP_EPIK = $(PREP) -user -comp=all.
• Manual POMP instrumentation (only) PREP_POMP = $(PREP) -pomp -comp=all.
In our case, the PREP instrumentation is setting in Makele including makele.defs, by default, PREP macro
is not set and no instrumentation is performed for a regular production build. You must specify a PREP value
in the Makele or make on command line uses as :
% make PREP=' scalasca −instrument ' .
Before the instrumentation step, we describes the conguration about setup experiment. We must setup and
look some details :
To locate the executable of Scalasca :
% ~/home/ bin
% echo PATH
To congure the PATH :
% export PATH=$PATH:~/home/bin
To modify Makele.defs Linux GNU (conguration standard) :
Prefix = /opt/ scalasca → Prefix = /home/ bin
OPARI2DIR = $(PREFIX)
We must to modify the MPILIB -lmpich  editing mpi settings
#MPILIB = -lmpich
MPILIB =
PMPILIB = -lpmpich
10 ESAIM: PROCEEDINGS
In this setup experiment describes the CTEST sample. CTEST is simple toy C program that can be used to
test the instrumentation and measurement features of the toolset. We will use the CTEST-POMP-MPI. The
example is founded on :
% ~/Documents/ scalasca −1.4.2/ example
Regarding Scalasca toolset commponents, during the CTEST experiment, we provide the following compo-
nents :
• Program source (with compiler / instrumenter) = instrumentated executable.
• Application + measurement lib = trace analysis
• Application + measurement lib = summary analysis
The Makele provides the command line of Scalasca instrumenter = scalasca -instrument
We used the following performance analysis steps :
Scalasca instrumenter = Prepare application objects and executable f o r measurement :
% scalasca − instrument compile−or−link −command
Scalasca measurement c o l l e c t o r  analyzer = Run application under control of measurement
system :
% scalasca −analyze experiment−archive | report
Scalasca a n a l y s i s report examiner = Post−process  explore measurement a n a l y s i s report :
% scalasca −examine experiment−archive | report
At example directory, we listed two CTEST : ctest-pomp-mpi and ctest-epik-mpi. The Versions of these les
including manual instrumentation using the EPIK API and POMP directives include -epik and -pomp in
their names. For ctest-pomp-mpi, we prepared the command which invoke the compiler with make, for example :
~/sclasca-1.4.2/example$ make ctest-pomp-mpi$ make ctest-pomp-mpi
scalasca -instrument -pomp -comp=all mpicc -m32 ctest-pomp.c -o ctest-pomp-mpi
INFO: Instrumented executable for MPI measurement
After, prepend commands which run the application executable with scalasca -analyze with 4 threads and 4
processors, for example :
~/scalasca-1.4.2/example$ OMP_NUM_THREADS = 4 scalasca -analyze -t mpiexec -np 4 ctest-pomp-mpi
OMP_NUM_THREADS=4 scalasca -analyze -t mpiexec -np 4 ctest-pomp-mpi
S=C=A=N: Scalasca 1.4.2 trace collection and analysis
S=C=A=N: ./epik_ctest-pomp-mpi_4_trace experiment archive
S=C=A=N: Wed Jan 9 17:15:45 2013: Collect start
/usr/bin/mpiexec -np 4 ctest-pomp-mpi
Max. memory usage : 5.715MB
Total processing time : 0.044s
SCAN: Wed Jan 9 17:15:50 2013: Analyze done (status=0) 2s
Warning: 0.035MB of analyzed trace data retained in ./epik_ctest-pomp-mpi_4_trace/ELG!
SCAN: ./epik_ctest-pomp-mpi_4_trace complete.
Running the ctest-pomp-mpi executable, we created the directory that provides the scout.cube.
cd epik_ctest-pomp-mpi_4_trace
The measurement archive directory ultimately contains a copy of the execution output (epik.log) :
• a record of the measurement conguration (epik.conf).
• the basic analysis report that was collated after measurement (epitome.cube).
• the complete analysis report produced during post-processing (summary.cube.gz).
To visualize the the results, we execute the command :
~/scalasca-1.4.2/example/epik_ctest-pomp-mpi_4_trace$ cube3 scout.cube
ESAIM: PROCEEDINGS 11
Figure 6. Analysis report presentation
The CUBE 1.4.2 is a parallel program analysis report exploration tools that provides libraries for XML
exporting  writing, algebra utilities for report processing and GUI for interative analysis exploration (requiring
the Qt4). The representation of values (severity matrix) on three hierarchical axes :
• Performance property (metric)
• Call-tree path (program location)
• System location (process/thread)
In the gure 4 below, we showed the visualization of analysis presentation and exploration :
The metric tree provides the main information about time (CPU allocation time - included time allocated
for idle threads) and visits. The time results is 0.12 and 1716 visits. All values are in nanoseconds. At call tree,
we observed some bag in generating the trace to analysis the communication of MPI. The system tree presents
the linux cluster, the machine in execution (@marie) with 4 processes.
And nally, we execute the command scalasca -examine :
~/scalasca-1.4.2/example$ scalasca -examine epik_ctest-pomp-mpi_4_trace
Archive ./epik_ctest-pomp-mpi_4_trace
INFO: Post-processing runtime summarization report...
INFO: Post-processing trace analysis report...
INFO: Displaying ./epik_ctest-pomp-mpi_4_trace/trace.cube.gz...
The scalasca -examine display the le trace.cube.gz, the preview visualization showed with CUBE. In the
gure ?? below, we show the trace report :
The last example, we used the CTEST.c program to test the instrumentation with EPIK API [7] and
measurement features of the toolset. Scalasca provides several possibilities to instrument user application code.
Besides the automatic compiler-based instrumentation, it provides manual instrumentation using the EPIK
API.
In MPI library calls, the instrumentation is accomplished using the standard MPI proling inter PMPI. To
enable it, the application program has to be linked against the EPIK measurement library plus MPI-specic
libraries.
We restart executing the command make ctest-epik-mpi to realize the instrumentation :
~/scalasca-1.4.2/example$ make ctest-epik-mpi$ make ctest-epik-mpi
mpicc -m32 -DEPIK `kconfig --32 --cflags` ctest-epik.c 
-o ctest-epik-mpi `kconfig --mpi --32 --libs`
Now, we prepared the command which run the application executable with scalasca -analyze :
~/Documents/jussara/scalasca-1.4.2/example$ scalasca -analyze -t mpiexec -np 4 ctest-epik-mpi
12 ESAIM: PROCEEDINGS
Figure 7. Examination Trace Report
scalasca -analyze -t mpiexec -np 4 ctest-epik-mpi
S=C=A=N: Scalasca 1.4.2 trace collection and analysis
S=C=A=N: ./epik_ctest-epik-mpi_4_trace experiment archive
S=C=A=N: Fri Jan 11 10:10:35 2013: Collect start
/usr/bin/mpiexec -np 4 ctest-epik-mpi
...
Max. memory usage : 5.711MB
Total processing time : 0.585s
S=C=A=N: Fri Jan 11 10:10:41 2013: Analyze done (status=0) 2s
Warning: 0.020MB of analyzed trace data retained in ./epik_ctest-epik-mpi_4_trace/ELG!
S=C=A=N: ./epik_ctest-epik-mpi_4_trace complete.
The last step, we execute the command scalasca -examine :
~/Documents/jussara/scalasca-1.4.2/example$ scalasca -examine epik_ctest-epik-mpi_4_trace/
Archive ./epik_ctest-epik-mpi_4_trace
INFO: Post-processing runtime summarization report...
INFO: Post-processing trace analysis report...
INFO: Displaying ./epik_ctest-epik-mpi_4_trace/trace.cube.gz...
The following gure [?], we presented the post-processing runtime summarization report.
Finally, we explained the rst window of examination report - the system metric tree. In the system, we
analyzed the following topics : time execution, the synchronizations, the communication, bytes transfered and
computational imbalance.
• Time : total CPU allocation time.
• Visits : number of visits.
• Synchronizations : number of point-to-point/collective synchronization operations.
• Communications : number of point-to-point communication operations.
• Bytes transferred : number of bytes transferred in communication operations.
• Computational imbalance : computational load imbalance heuristic (Overload and Underload).
ESAIM: PROCEEDINGS 13
Figure 8. Examination Trace Report
. The metric tree provides 0.19% CPU time, 1164 visits, 48 collective synchronization operations, 30
point-to-point send communication and 30 point-to-point receives communication.
4.2. Laplacian Instrumentation
5. Conclusion
Références
[1] Brian. Scalable performance analysis of large-scale parallel applications. http://www2.fz-juelich.de. [Online ; accessed 8-
January-2013].
[2] Markus Geimer Brian Wylie. Scalable performance analysis of large-scale parallel applications. www.training.prace-ri.eu/
uploads/tx.../Scalasca_Overview_01.pdf. [Online ; accessed 9-January-2013].
[3] Markus Geimer Brian Wylie. Tutorial Exercise NPB-MZ-MPI/BT - Prace training. www.training.prace-ri.eu/uploads/tx.
../MPIBTExercise_BG.pdf. [Online ; accessed 9-January-2013].
[4] Philippe G. Ciarlet. The Finite Element Method for Elliptic Problems. Siam, ISBN, 1st edition, 2002.
[5] Marcus Geimer. Introduction to Scalasca. http://www.linksceem.eu/ls2/images/stories/Introduction_to_Scalasca.pdf.
[Online ; accessed 9-January-2013].
[6] Markus Geimer, Felix Wolf, Brian J. N. Wylie, Erika Ábrahám, Daniel Becker, and Bernd Mohr. The scalasca performance
toolset architecture. Concurr. Comput. : Pract. Exper., 22(6) :702719, April 2010.
[7] User Guide. Scalabel Automatic Performance Analysis. http://www2.fz­-juelich.de/jsc/datapool/scalasca/UserGuide.
pdf. [Online ; accessed 9-January-2013].
[8] Parry Husbands, Costin Iancu, and Katherine Yelick. A performance analysis of the berkeley upc compiler. In Proceedings of
the 17th annual international conference on Supercomputing, ICS '03, pages 6373, New York, NY, USA, 2003. ACM.
[9] J. Issa and S. Figueira. Graphics performance analysis using amdahl's law. In Performance Evaluation of Computer and
Telecommunication Systems (SPECTS), 2010 International Symposium on, pages 127 132, july 2010.
[10] MPI. The Message Passing Interface (MPI) standard. http://www-unix.mcs.anl.gov/mpi. [Online ; accessed 7-January-2013].
[11] OpenMP. OpenMP : Simple, Portable, Scalable SMP Programming. http://www.openmp.org. [Online ; accessed 7-January-
2013].
[12] Christophe Prud'Homme, Vincent Chabannes, Vincent Doyeux, Mourad Ismail, Abdoulaye Samake, and Gonçalo Pena.
Feel++ : A Computational Framework for Galerkin Methods and Advanced Numerical Methods. ESAIM Proceedings, page 27,
December 2012. ISLE/CHPID.
[13] Christophe Prud'Homme, Vincent Chabannes, Vincent Doyeux, Mourad Ismail, Abdoulaye Samake, Gonçalo Pena, Cécile
Daversin, and Christophe Trophime. Advances in Feel++ : a domain specic embedded language in C++ for partial dif-
ferential equations. In Eccomas'12 - European Congress on Computational Methods in Applied Sciences and Engineering,
Vienna, Autriche, September 2012. MS404-1 Automation of computational modeling by advanced software tools and tech-
niques FRAE/RB4FASTSIM, ISLE/CHPID.
[14] F. Wolf and B. Mohr. Automatic performance analysis of hybrid mpi/openmp applications. In Parallel, Distributed and
Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on, pages 13 22, feb. 2003.

Contenu connexe

Tendances

aMCfast: Automation of Fast NLO Computations for PDF fits
aMCfast: Automation of Fast NLO Computations for PDF fitsaMCfast: Automation of Fast NLO Computations for PDF fits
aMCfast: Automation of Fast NLO Computations for PDF fitsjuanrojochacon
 
A time study in numerical methods programming
A time study in numerical methods programmingA time study in numerical methods programming
A time study in numerical methods programmingGlen Alleman
 
D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...
D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...
D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...SEENET-MTP
 
Advanced property tracking Industrial Modeling Framework
Advanced property tracking Industrial Modeling FrameworkAdvanced property tracking Industrial Modeling Framework
Advanced property tracking Industrial Modeling FrameworkAlkis Vazacopoulos
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...NECST Lab @ Politecnico di Milano
 
A Load-Balanced Parallelization of AKS Algorithm
A Load-Balanced Parallelization of AKS AlgorithmA Load-Balanced Parallelization of AKS Algorithm
A Load-Balanced Parallelization of AKS AlgorithmTELKOMNIKA JOURNAL
 
Area efficient parallel LFSR for cyclic redundancy check
Area efficient parallel LFSR for cyclic redundancy check  Area efficient parallel LFSR for cyclic redundancy check
Area efficient parallel LFSR for cyclic redundancy check IJECEIAES
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and AlgorithmDhaval Kaneria
 
Contention - Aware Scheduling (a different approach)
Contention - Aware Scheduling (a different approach)Contention - Aware Scheduling (a different approach)
Contention - Aware Scheduling (a different approach)Dimos Raptis
 
High Speed Decoding of Non-Binary Irregular LDPC Codes Using GPUs (Paper)
High Speed Decoding of Non-Binary Irregular LDPC Codes Using GPUs (Paper)High Speed Decoding of Non-Binary Irregular LDPC Codes Using GPUs (Paper)
High Speed Decoding of Non-Binary Irregular LDPC Codes Using GPUs (Paper)Enrique Monzo Solves
 
Parallel programming in modern world .net technics shared
Parallel programming in modern world .net technics   sharedParallel programming in modern world .net technics   shared
Parallel programming in modern world .net technics sharedIT Weekend
 

Tendances (19)

Matrix Multiplication Report
Matrix Multiplication ReportMatrix Multiplication Report
Matrix Multiplication Report
 
Chap12 slides
Chap12 slidesChap12 slides
Chap12 slides
 
aMCfast: Automation of Fast NLO Computations for PDF fits
aMCfast: Automation of Fast NLO Computations for PDF fitsaMCfast: Automation of Fast NLO Computations for PDF fits
aMCfast: Automation of Fast NLO Computations for PDF fits
 
A time study in numerical methods programming
A time study in numerical methods programmingA time study in numerical methods programming
A time study in numerical methods programming
 
Model checking
Model checkingModel checking
Model checking
 
D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...
D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...
D. Vulcanov: Symbolic Computation Methods in Cosmology and General Relativity...
 
Advanced property tracking Industrial Modeling Framework
Advanced property tracking Industrial Modeling FrameworkAdvanced property tracking Industrial Modeling Framework
Advanced property tracking Industrial Modeling Framework
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...
 
A Load-Balanced Parallelization of AKS Algorithm
A Load-Balanced Parallelization of AKS AlgorithmA Load-Balanced Parallelization of AKS Algorithm
A Load-Balanced Parallelization of AKS Algorithm
 
Area efficient parallel LFSR for cyclic redundancy check
Area efficient parallel LFSR for cyclic redundancy check  Area efficient parallel LFSR for cyclic redundancy check
Area efficient parallel LFSR for cyclic redundancy check
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
 
Chap10 slides
Chap10 slidesChap10 slides
Chap10 slides
 
Flex ch
Flex chFlex ch
Flex ch
 
Contention - Aware Scheduling (a different approach)
Contention - Aware Scheduling (a different approach)Contention - Aware Scheduling (a different approach)
Contention - Aware Scheduling (a different approach)
 
Fulltext
FulltextFulltext
Fulltext
 
Chap4 slides
Chap4 slidesChap4 slides
Chap4 slides
 
High Speed Decoding of Non-Binary Irregular LDPC Codes Using GPUs (Paper)
High Speed Decoding of Non-Binary Irregular LDPC Codes Using GPUs (Paper)High Speed Decoding of Non-Binary Irregular LDPC Codes Using GPUs (Paper)
High Speed Decoding of Non-Binary Irregular LDPC Codes Using GPUs (Paper)
 
Chap9 slides
Chap9 slidesChap9 slides
Chap9 slides
 
Parallel programming in modern world .net technics shared
Parallel programming in modern world .net technics   sharedParallel programming in modern world .net technics   shared
Parallel programming in modern world .net technics shared
 

En vedette

cemracs_vivabrain_slides
cemracs_vivabrain_slidescemracs_vivabrain_slides
cemracs_vivabrain_slidesJussara F.M.
 
presentation_cemracs2012
presentation_cemracs2012presentation_cemracs2012
presentation_cemracs2012Jussara F.M.
 
Catalogue ANGLAIS DÉLICES
Catalogue ANGLAIS DÉLICESCatalogue ANGLAIS DÉLICES
Catalogue ANGLAIS DÉLICESSylvie Schmit
 
بررسی روشهای مسیریابی شبکه های فرصت طلبانه
بررسی روشهای مسیریابی شبکه های فرصت طلبانهبررسی روشهای مسیریابی شبکه های فرصت طلبانه
بررسی روشهای مسیریابی شبکه های فرصت طلبانهabedin753
 
Gestion operativa en instituciones-educativas
Gestion operativa en instituciones-educativasGestion operativa en instituciones-educativas
Gestion operativa en instituciones-educativasFrank Ruiz
 

En vedette (11)

cemracs_vivabrain_slides
cemracs_vivabrain_slidescemracs_vivabrain_slides
cemracs_vivabrain_slides
 
2015_J_Kilmister_Resume
2015_J_Kilmister_Resume2015_J_Kilmister_Resume
2015_J_Kilmister_Resume
 
SoC-2012-pres-2
SoC-2012-pres-2SoC-2012-pres-2
SoC-2012-pres-2
 
presentation_cemracs2012
presentation_cemracs2012presentation_cemracs2012
presentation_cemracs2012
 
Catalogue ANGLAIS DÉLICES
Catalogue ANGLAIS DÉLICESCatalogue ANGLAIS DÉLICES
Catalogue ANGLAIS DÉLICES
 
Recruiting in the New Economy GENERIC
Recruiting in the New Economy GENERICRecruiting in the New Economy GENERIC
Recruiting in the New Economy GENERIC
 
بررسی روشهای مسیریابی شبکه های فرصت طلبانه
بررسی روشهای مسیریابی شبکه های فرصت طلبانهبررسی روشهای مسیریابی شبکه های فرصت طلبانه
بررسی روشهای مسیریابی شبکه های فرصت طلبانه
 
Antartica
AntarticaAntartica
Antartica
 
Gestion operativa en instituciones-educativas
Gestion operativa en instituciones-educativasGestion operativa en instituciones-educativas
Gestion operativa en instituciones-educativas
 
Human Capital Analytics 2.2016
Human Capital Analytics 2.2016Human Capital Analytics 2.2016
Human Capital Analytics 2.2016
 
MML New Resume
MML New ResumeMML New Resume
MML New Resume
 

Similaire à Rapport_Cemracs2012

Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP IJCSEIT Journal
 
cis97003
cis97003cis97003
cis97003perfj
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and AlgorithmDhaval Kaneria
 
Cupdf.com introduction to-data-structures-and-algorithm
Cupdf.com introduction to-data-structures-and-algorithmCupdf.com introduction to-data-structures-and-algorithm
Cupdf.com introduction to-data-structures-and-algorithmTarikuDabala1
 
Complier design
Complier design Complier design
Complier design shreeuva
 
Taking r to its limits. 70+ tips
Taking r to its limits. 70+ tipsTaking r to its limits. 70+ tips
Taking r to its limits. 70+ tipsIlya Shutov
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...IRJET Journal
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterSudhang Shankar
 
Performance measures
Performance measuresPerformance measures
Performance measuresDivya Tiwari
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsNECST Lab @ Politecnico di Milano
 

Similaire à Rapport_Cemracs2012 (20)

Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
Performance Analysis of Parallel Algorithms on Multi-core System using OpenMP
 
cis97003
cis97003cis97003
cis97003
 
DATA STRUCTURE.pdf
DATA STRUCTURE.pdfDATA STRUCTURE.pdf
DATA STRUCTURE.pdf
 
DATA STRUCTURE
DATA STRUCTUREDATA STRUCTURE
DATA STRUCTURE
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 
Introduction to data structures and Algorithm
Introduction to data structures and AlgorithmIntroduction to data structures and Algorithm
Introduction to data structures and Algorithm
 
Cupdf.com introduction to-data-structures-and-algorithm
Cupdf.com introduction to-data-structures-and-algorithmCupdf.com introduction to-data-structures-and-algorithm
Cupdf.com introduction to-data-structures-and-algorithm
 
Complier design
Complier design Complier design
Complier design
 
Taking r to its limits. 70+ tips
Taking r to its limits. 70+ tipsTaking r to its limits. 70+ tips
Taking r to its limits. 70+ tips
 
Matopt
MatoptMatopt
Matopt
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
 
Oct.22nd.Presentation.Final
Oct.22nd.Presentation.FinalOct.22nd.Presentation.Final
Oct.22nd.Presentation.Final
 
Tutorial
TutorialTutorial
Tutorial
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
Performance measures
Performance measuresPerformance measures
Performance measures
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving SystemsPRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
 
KMAP PAPER (1)
KMAP PAPER (1)KMAP PAPER (1)
KMAP PAPER (1)
 
cug2011-praveen
cug2011-praveencug2011-praveen
cug2011-praveen
 
final
finalfinal
final
 

Rapport_Cemracs2012

  • 1. ESAIM: PROCEEDINGS, Vol. ?, 2013, 1-10 Editors: Will be set by the publisher COMPILATION ANALYSIS, PERFORMANCE ANALYSIS, SCALABILITY USING SCALASCA WITH FEEL++ SCIENTIFIC APPLICATIONS∗, ∗∗ Jussara Marandola1 Résumé. ... Abstract. This paper presents the following contributions: the compilation analysis on Feel++ lan- guage presenting one example as state of art in mesh manipulation to provide 1D, cube in 2D, tetrahe- dron, cube in 3D models of Laplacian code. The focus was analyzed the compilation options during the execution - mpirun, mpi execution with duo, four processors. In rst step, we showed the importance to realize the performance analyze to compare the feel++ scientic application using Feel++ TIME class or introducing Scalasca instrumentation to get the CPU time allocation and throughput by scalability in terms of numbers of threads and scalability of processors for clusters. 1. Introduction In this paper, we introduce the compilation executions and performance analysis on Feel++, Finite Element Embedded Language in C++. Feel++ is a C++ library for arbitrary order Galerkin methods (e.g. nite and spectral element methods ) continuous or discontinuous in 1D 2D and 3D. It include many features such as geometries 1D, 2D, 3D and lower topological dimension 1D(curve) in 2D and 3D or 2D(surfacee) in 3D. Through supporting Gmsh for mesh generation and Paraview for post-processing we can analyze the models by visualization. Thinking in terms of libraries to solve problems arising from partial dierential equations(PDEs) through generalized Galerkin methods [13], Feel++ provides the complexity of dierential models and implementation of state of the art robust numerical methods : a language clear to express problems specialized in a type of equation, e.g Navier-Stokes or linear elasticity models and nally a wealth of solution algorithm. The last advances in Feel++ describes the mesh data structure as well mesh entities(elements, faces, edges, nodes) and algebrics representations. Mesh entities are indexed by process id by MPI in parallel context. Feel ++ relies on MPI for parallel computations and the class Application initialises the MPI environment. The makele of Feel++ project enable the option MPI mode that oers the parallel computation with mpirun command. The main goal will be the introduction of compilation executation in linear algebra environment, providing all mesh options, displaying the models through gmesh or paraview, the base concepts of performance analysis (equations, metrics and Time class into Feel++) and nally to introduce the use of Scalasca to analyze the ∗ Thanks : Laboratoire Jean Kuntzmann, Université Joseph Fourier Grenoble 1, BP53 38041 Grenoble Cedex 9, France, e-mail : christophe.prudhomme@ujf-grenoble.fr ∗∗ Thanks : Embedded Real Time Systems Foundation Laboratory (LaSTRE), CEA LIST, CEA-Saclay Nano-INNOV PC172, F91191 Gif-sur-Yvette cedex, France, e-mail : stephane.louise@cea.fr 1 jussara@nerim.net c EDP Sciences, SMAI 2013
  • 2. 2 ESAIM: PROCEEDINGS scalability performance reports on Feel++. Scalasca [6] is a performance toolset that has been specically designed to analyze parallel application execution behavior on large-scale systems. It oers an incremental performance anlaysis procedure that integrates runtime summaries with in-depth studies of current behavior via event tracing, adopting a strategy of successively rened measurement congurations. We will focus our case study on linear algebra environment, specially standard formulation : the laplacian case. Feel ++ supports three dierent linear algebra environments that we shall call backends such as Gmm, Petsc4, Trilinos5. Regarding the fucntion spaces denition, several types of the polynomials (P) are used as following : Lagrange, Legendre, Dubiner, Crouzeix-Raviart, Raviart-Thomas. It supports also modal basis, e.g. Legendre or Dubiner [12], as well as nite elements (FE) following the standard denition, set in [4], as a triplet (κ, P, Σ) where κ is a convex, P the polynomial space and Σ the dual space. Remainder of this paper is organized as follows. Section 2, we will introduce the analysis of execution time for parallel algorithms as well concepts related with speedup, execution time components and eciency and nally, the importance of Amdahl's law in performance analysis of graphics processors. The advantage of the most common model (MPI), message passing in the context of programming environment of Feel++ during compilation will be discuted also. In the Section 3 will be to describe the standard formulation of laplacian without strong and weak Dirichlet conditions explaining the compilation options by executing with the following ordering : hzise and shape. The last one provides dierents polynomials dimensions according ne and coarse grain (represented by size). In Section 4, we present the overview of current version of Scalasca in terms of instrumentation and measurement. The layered model of the Scalasca architecture, the analysis conguration and the MPI sample ctest-pomp-mpi will be described. After that we will show the laplacian case using Scalasca. And nally, we present the Section 5 concluding all results of setup experiment using Scalasca, the contributions and limitations of Scalasca with Feel++, advances of compilation options with mesh data structure and nally, the performance analysis on programming environment in terms of serial/parallel time execution of code or use of toolset for instrumentation and measurement. 2. Performance Analysis This paper presents the Scalasca toolset for scalable performance analysis of large-scale parallel applications. It oers a basis to examine the eectiveness of parallel performance tools. The main goal is to optimize the real applications (benchmarks and laplacian case) understing the barriers to high performance and predict improvement. The performance tools usually provides a program metrics parallelization. The rst objective related with optimization of parallel algorithms on Feel++ language will be the analysis of execution time, communication time and topology view (in terms of analysis of processes in cartesian grid) for parallel algorithm using MPI programming on codes and compilation options. We use the message passing model because it has the advantage in scalability, since it runs on distributed memory multiprocessors. For most of large-scale parallel machines use the common model message passing - MPI [10]. In this context, one of the challenges [8], faced by the application developers for high performance parallel systems is the relatively diculty programming environment. Depending of parallel machine, the high perfor- mance could be obtained by relative programming models such as OpenMP [11], threads, data parallelism or automatic parallelization. The recent performance tools are using the hybrid programming (MPI/OpenMP) to optimize your application. The recent studies about graphics performance analysis [9] using Amdahl's Law is very important for design- ing future graphics processors as well specic purpose-processors. The Amadahl's Law is the most important formulation that describes the number of processors does not lead to a proportional increase in performance. The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. The denition can be written in the form of T = T0 + (T1 − T0) f1 f , (1) where T1 : is the measured execution time at frequency f1 ; T0 : is the non-scale time ;
  • 3. ESAIM: PROCEEDINGS 3 The Amadhal's Law is very important at performance analysis regarding the scalability, it means the in- creasing number of processors, in terms of measures of computation time (throughput and execution time) and communication time. Analysis of scalability requires many factors to measure the performance that involves concepts of speedup and eciency. The Speedup measures how one parallel code can run faster than the sequential counterpart and ratio of sequential execution time to parallel execution time. We can explain the following execution time components : • Inherently sequential computations : σ(n) • Pontentially parallel computations : ϕ(n) • Communication operations and other repeat computations : κ(n, p) Regarding Speedup, ψ(n) solves a problem of size n on p processors limited by ψ(n, p) ≤ σ(n) + ϕ(n) σ(n) + ϕ(n) p + κ(n, p) , (2) Finally, the eciency could be measured through processor utilization, ratio of speedup to number of pro- cessors used, consequently, eciency e n,p for problem of size n on p processors is given by ε(n, p) ≤ σ(n) + ϕ(n) p × (σ(n) + σ(n) p + κ(n, p)) , (3) 3. Laplacian Case In Feel++ document source, we list all samples. In Laplacian case, there are the following examples : laplacian with dirichlet condition weakly (problem in 2D), laplacian full, and laplacian lagrange multiplier. All codes are found in this directory : ~/Devel/feelpp/doc/manual/tutorial$ We analyzed the laplacian with dirichlet condition weakly, the laplacian.cpp. All kind of a vartiational for- mulation of a problem is also called weak formulation. The key item is to bring a new function and to integrate by parts. In the mathematical formulation example, we would like to solve this problem : Problem 1 : nd u such that −∆υ = inΩ = [−1; 1]2 (4) with = 2π2 g (5) and g is the exact solution g = sin (πx) cos (πy)(6) The following boundary conditions apply υ = g|x = ±1, ∂(υ) ∂(υ) = 0|y = ±1 (7) The alternative mathematical formulation handles the Dirichlet condition weakly and hence we have a uniform treatment for all types of boundary conditions. In another alternative formulation which allows to treat weakly Dirichlet boundary condition to Neumann and Robin conditions. Following a similar development as in the previous section, the problem reads Ω u. ν + |x=−1,x=1 ∂(υ) ∂(ν) ν − υ ∂(ν) ∂(n) + µ h υν = Ω f.ν + |x=−1,x=1 −g ∂(ν) ∂(n) + µ h gν (8)
  • 4. 4 ESAIM: PROCEEDINGS 3.1. Execution Options and Feel++ Implementation In C++ Laplacian is dened through the make-options function all types of compilation during the execution. This function, a routine that returns the list of options uses the program called boost library. All data returned in this case is used as an argument of a Feel Application subclass. In our experiment, we focused on hsize (to describe the mesh structure - mesh size), shape (model or format of polynomial, always hypercube or simplex), Dim (it provides dimension of 1D, 2D, 3D), and nally the weakdir (when the user wants to use weak Dirichlet condition). First we dene the make options. We showed the main lines of Feel++ and make options function : include feel/feel.hpp /** use Feel namespace */ using namespace Feel ; inline po::options_description makeOptions() { po::options_description laplacianoptions( Laplacian options ); laplacianoptions.add_options() ( hsize, po::valuedouble()-default_value( 0.5 ), mesh size ) ( shape, Feel::po::valuestd::string()-default_value( hypercube ), shape of the domain (either ( nu, po::valuedouble()-default_value( 1 ), grad.grad coefficient ) ( weakdir, po::valueint()-default_value( 1 ), use weak Dirichlet condition ) ( penaldir, Feel::po::valuedouble()-default_value( 10 ), penalisation parameter for the weak boundary Dirichlet formulation ) ( exact1D, po::valuestd::string()-default_value( sin(2*Pi*x) ), exact 1D solution ) ( exact2D, po::valuestd::string()-default_value( sin(2*Pi*x)*cos(2*Pi*y) ), exact 2D solution ( exact3D, po::valuestd::string()-default_value( sin(2*Pi*x)*cos(2*Pi*y)*cos(2*Pi*z) ), exact ( rhs1D, po::valuestd::string()-default_value( ), right hand side 1D ) ( rhs2D, po::valuestd::string()-default_value( ), right hand side 2D ) ( rhs3D, po::valuestd::string()-default_value( ), right hand side 3D ) ; return laplacianoptions.add( Feel::feel_options() ); } Initially, the geometric dimension of the problem was setup in Dim=2. In Laplacian class, public main function, we listed many types as • Numerical type is double • Geometry entities type composing the mesh, here Simplex in Dimension Dim of Order 1 • Mesh type • Approximation function space type • Exporter factory type The Laplacian() provides mesh size and shape as dened below : Laplacian() : super(), meshSize( this-vm()[hsize].template asdouble() ), shape( this-vm()[shape].template asstd::string() ) { } The execution denition was specied by template int Dim, int Order : templateint Dim, int Order
  • 5. ESAIM: PROCEEDINGS 5 void LaplacianDim,Order::run() { LOG(INFO) --------------------------------n; LOG(INFO) Execute Laplacian Dim n; Environment::changeRepository( boost::format( doc/manual/tutorial/%1%/%2%-%3%/P%4%/h_%5%/ ) % this-about().appName() % shape % Dim % Order % meshSize ); mesh_ptrtype mesh = createGMSHMesh( _mesh=new mesh_type, _desc=domain( _name=( boost::format( %1%-%2% ) % shape % Dim ).str( _usenames=true, _shape=shape, _h=meshSize, _xmin=-1, _ymin=-1 ) ); ...} The parameter _h in the function domain allows to change the mesh characteristic size to M_meshsize which is given for example on the command-line using the Application framework. The createGMSHMesh is a function which allows to generate a mesh le .msh automatically from a description le ( .geo for example) by using the _desc parameter and store the generated mesh into the meshparameterallocatedwhencallingthefunction. To compile the Laplacian.cpp, rst we invoked the compiler with make : ~/Devel/feel.opt$ make feelpp_doc_laplacian To run the laplacian executable, we must invoke the command-line in this directory : ~/Devel/feel.opt/doc/manual/tutorial$ To run in parallel, we used the command-line mpirun to run the laplacian executable in N processors. The Laplacian was executed with 2 processors and 4 processors, but we haven't implemented the time class in Feel++ to measure the CPU allocation time. One of most limitation is the time measurement during execution, in terms of scalability of processors (increase of N processors). At this point of our experiment, we dened the parameters of execution related with mesh structure : • hsize : Size of mesh. • shape : Hypercube or Simplex shape. • dim : Setting the 2D and 3D models. • nu : Grad coecient setup n=1. • weakdir : Use weak Dirichlet condition. The MPI command-line was dened as % mpirun -np 4 ./feelpp_doc_laplacian Regarding these parameters, we concluded that meshes when executed, it provides the following ordering : shape dim model Shape Dim Model Simplex 2D Triangle Simplex 3D Tetrahedron Hypercube 2D Rectangle Hypercube 3D Cube
  • 6. 6 ESAIM: PROCEEDINGS Figure 1. Paraview Visualization of Cube in 2D Figure 2. Paraview Visualization of Rectangle in 3D For all simulations, we setup mesh size = 0.1. The follow, we list all command-line in these parameters running with 4 processors. % mpirun -np 4 ./feelpp_doc_laplacian --hsize=0.1 --shape=simplex --dim=2 --nu=1 --weadir=yes % mpirun -np 4 ./feelpp_doc_laplacian --hsize=0.1 --shape=simplex --dim=3 --nu=1 --weadir=yes % mpirun -np 4 ./feelpp_doc_laplacian --hsize=0.1 --shape=hypercube --dim=2 --nu=1 --weadir=yes % mpirun -np 4 ./feelpp_doc_laplacian --hsize=0.1 --shape=hypercube --dim=3 --nu=1 --weadir=yes After execution simulations, Feel++ supports essentially the Gmsh mesh le format, but we used the Par- aview. It provides also some classes to manipulate Gmsh .geo les and generate .msh les. All Gmsh and Paraview les generated by executation could be visualized at directory bin of Feel. ~/feel/doc/manual/tutorial/laplacian$ ls hypercube-2 hypercube-3 simplex-2 simplex-3 To visualize the cube (2D) and rectangle (3D) with 0.1 size and 4 processors, we invoked the paraview : ~/feel/doc/manual/tutorial/laplacian/hypercube-2/P2/h_0.1/np_4$ paraview laplacian-4.sos ~/feel/doc/manual/tutorial/laplacian/hypercube-3/P2/h_0.1/np_4$ paraview laplacian-4.sos To visualize the triangle (2D) and tetrahedron (3D) with 0.1 size and 4 processors, we invoked the paraview : ~/feel/doc/manual/tutorial/laplacian/simplex-2/P2/h_0.1/np_4$ paraview laplacian-4.sos ~/feel/doc/manual/tutorial/laplacian/simplex-3/P2/h_0.1/np_4$ paraview laplacian-4.sos
  • 7. ESAIM: PROCEEDINGS 7 Figure 3. Paraview Visualization of Triangle in 2D Figure 4. Paraview Visualization of Tetrahedron in 3D 4. Introduction to Scalasca In the context regarding numerical methods and algorithms for high performance computing, HPC commu- nity requires the use of powerful and robust performance-analysis tools that makes the optimization of parallel applications. These tools are eective and more ecient not only helping the improve the scalability character- istics of scientic codes, but also allow experts to concentrate on the underlying science rather than to spend a major fraction of their time tuning their application for a particular machine. In this section, we introduce the SCALASCA, an open-source performance-analysis toolset developed at Julich Supercomputing Center in cooperation with the University of Tennessee as the sucessor of KOJAK [14], specically designed for use on large-scale systems including IBM Blue Gene and Cray XT, as well for small and medium-scale HPC platforms. The Scalasca architecture [6] shows the basic analysis workow. The base of layered model presents the instrumentation, measurement system, trace analysis, trace utilities, report tools. The layered model is divided in following topics : preparation (concerning insertion of probes), Execution (Event measurement collection), postmotem (Event summarization analysis, and nally, presentation). At rst, the target application must be instrumented, that is, probes must be inserted into the code that carry out the measurements. This can happen at dierent levels, including source code, object code, or library. Before running the instrumented executable on the parallel machine, the user can choose between generating a runtine summary report or an event trace. In the gure 3, we show the Scalasca's performance-analysis work ow. The Scalasca presents the following research topics regarding performance analysis, tools techniques [1] : • Prole analysis : provides the summary of aggregated metrics (per function/callpath, or per pro- cess/thread) and tools that can generate or present such proles (event traces). Ex : gprof, mpiP, ompP, Scalasca, TAU, Vampir, ...
  • 8. 8 ESAIM: PROCEEDINGS Figure 5. Performance analysis work ow . When tracing is enabled, each process gener- ates a trace le containing records for its process-local events. It is generally recommended to optimize the instrumentation based on previously generated summary report. During the analysis, Scalasca searches for wait states and at the end it provides the result in structure to the summary report. • Time-line analysis : provides visual representation of the space/time sequence of events. Requires an execution trace. Ex : Vampir, Paraver, JumpShot, Intel TAC, Sun Studio, ... • Pattern analysis : search for event sequences characteristics ineciencies. Can be done manually (via visual time-line) or automatically (via KOJAK, Scalasca, Periscope, ...). In the state of art related with productivity tools the following tools are very important in cycle life of integration of technologies : • KCachegrind : Callgraph-based cache simulation analysis. • Marmot/MUST : MPI correctness checking. • PAPI : Interfacing to hardware performance counters. • Periscope : Automatic analysis via an on-line distributed search. • SCALASCA : Large-scale parallel performance analysis. • TAU : Integrated parallel performance system. • Vampir/VampirTrace : Event tracing and graphical trace visualization analysis. In the next section, we will explain the measurement and analysis conguration regarding the package C test with MPI running. We will make emphasis on performance analysis. 4.1. Setup Experiment The Scalasca measurement system that gets linked with instrumented application executable can be con- gured via environment variables or conguration les to specify that runtime summaries and/or event traces should be collected, along with optional hardware counter metrics. In the introduction of section 4 we presented the performance-analysis work ow based on the following steps : preparation, measurement, analysis, examination, optimization. The preparation prepare application, insert extra code (probes). The measurement collect the relevant data to execution performance analysis. The analysis provides the calculation of metrics, identication of performance metrics. The examination presents the results in an form. And nally, the optimization modify the code to eliminate/reduce performance problems. The perforrmance analysis presents the following steps : • 0 : reference preparation for validation • 1 : Program instrumentation : skin • 2 : Summary measurement collection analysis : scan [ -s] • 3 : Summary analysis report examination : square • 4 : Summary experiment scoring : scan -t • 5 : Event trace collection analysis : scan -t • 6 : Event trace analysis report examination : square At the samples are available execution mode in MPI, OpenMP hybrid OpenMP/MPI variants. In our setup experiment, we will show the sample CTEST with pomp-mpi execution. In this case, the Scalasca measurement system provides the congure le (Makele Makele.defs), its means that CTEST.C code is compiled with
  • 9. ESAIM: PROCEEDINGS 9 command Scalasca -instrument. The prex compile/link commands in Makele denitions (cong/make.def) with scalasca instrumenter is very important. At the begin, you must load the Scalasca module and after you could run Scalasca for brief usage information, as dened below. % module load scalasca scalasca /1.4.2 loaded ~/scalasca-1.4.2$ scalasca Scalasca 1.4.2 Toolset for scalable performance analysis of large-scale parallel applications usage: scalasca [-v][-n] {action} 1. prepare application objects and executable for measurement: scalasca -instrument compile-or-link-command # skin 2. run application under control of measurement system: scalasca -analyze application-launch-command # scan 3. interactively explore measurement analysis report: scalasca -examine experiment-archive|report # square -v: enable verbose commentary -n: show actions without taking them -h: show quick reference guide (only) Scalasca Makele Example provides the following PREP denition part in the Makele explain how to setup the instrumentation regarding routines by compiler • Default instrumentation of routines by compiler PREP = scalasca -instrument. • Build without any instrumentation PREP = scalasca -instrument -mode=none. • Manual EPIK instrumentation (only) PREP_EPIK = $(PREP) -user -comp=all. • Manual POMP instrumentation (only) PREP_POMP = $(PREP) -pomp -comp=all. In our case, the PREP instrumentation is setting in Makele including makele.defs, by default, PREP macro is not set and no instrumentation is performed for a regular production build. You must specify a PREP value in the Makele or make on command line uses as : % make PREP=' scalasca −instrument ' . Before the instrumentation step, we describes the conguration about setup experiment. We must setup and look some details : To locate the executable of Scalasca : % ~/home/ bin % echo PATH To congure the PATH : % export PATH=$PATH:~/home/bin To modify Makele.defs Linux GNU (conguration standard) : Prefix = /opt/ scalasca → Prefix = /home/ bin OPARI2DIR = $(PREFIX) We must to modify the MPILIB -lmpich editing mpi settings #MPILIB = -lmpich MPILIB = PMPILIB = -lpmpich
  • 10. 10 ESAIM: PROCEEDINGS In this setup experiment describes the CTEST sample. CTEST is simple toy C program that can be used to test the instrumentation and measurement features of the toolset. We will use the CTEST-POMP-MPI. The example is founded on : % ~/Documents/ scalasca −1.4.2/ example Regarding Scalasca toolset commponents, during the CTEST experiment, we provide the following compo- nents : • Program source (with compiler / instrumenter) = instrumentated executable. • Application + measurement lib = trace analysis • Application + measurement lib = summary analysis The Makele provides the command line of Scalasca instrumenter = scalasca -instrument We used the following performance analysis steps : Scalasca instrumenter = Prepare application objects and executable f o r measurement : % scalasca − instrument compile−or−link −command Scalasca measurement c o l l e c t o r analyzer = Run application under control of measurement system : % scalasca −analyze experiment−archive | report Scalasca a n a l y s i s report examiner = Post−process explore measurement a n a l y s i s report : % scalasca −examine experiment−archive | report At example directory, we listed two CTEST : ctest-pomp-mpi and ctest-epik-mpi. The Versions of these les including manual instrumentation using the EPIK API and POMP directives include -epik and -pomp in their names. For ctest-pomp-mpi, we prepared the command which invoke the compiler with make, for example : ~/sclasca-1.4.2/example$ make ctest-pomp-mpi$ make ctest-pomp-mpi scalasca -instrument -pomp -comp=all mpicc -m32 ctest-pomp.c -o ctest-pomp-mpi INFO: Instrumented executable for MPI measurement After, prepend commands which run the application executable with scalasca -analyze with 4 threads and 4 processors, for example : ~/scalasca-1.4.2/example$ OMP_NUM_THREADS = 4 scalasca -analyze -t mpiexec -np 4 ctest-pomp-mpi OMP_NUM_THREADS=4 scalasca -analyze -t mpiexec -np 4 ctest-pomp-mpi S=C=A=N: Scalasca 1.4.2 trace collection and analysis S=C=A=N: ./epik_ctest-pomp-mpi_4_trace experiment archive S=C=A=N: Wed Jan 9 17:15:45 2013: Collect start /usr/bin/mpiexec -np 4 ctest-pomp-mpi Max. memory usage : 5.715MB Total processing time : 0.044s SCAN: Wed Jan 9 17:15:50 2013: Analyze done (status=0) 2s Warning: 0.035MB of analyzed trace data retained in ./epik_ctest-pomp-mpi_4_trace/ELG! SCAN: ./epik_ctest-pomp-mpi_4_trace complete. Running the ctest-pomp-mpi executable, we created the directory that provides the scout.cube. cd epik_ctest-pomp-mpi_4_trace The measurement archive directory ultimately contains a copy of the execution output (epik.log) : • a record of the measurement conguration (epik.conf). • the basic analysis report that was collated after measurement (epitome.cube). • the complete analysis report produced during post-processing (summary.cube.gz). To visualize the the results, we execute the command : ~/scalasca-1.4.2/example/epik_ctest-pomp-mpi_4_trace$ cube3 scout.cube
  • 11. ESAIM: PROCEEDINGS 11 Figure 6. Analysis report presentation The CUBE 1.4.2 is a parallel program analysis report exploration tools that provides libraries for XML exporting writing, algebra utilities for report processing and GUI for interative analysis exploration (requiring the Qt4). The representation of values (severity matrix) on three hierarchical axes : • Performance property (metric) • Call-tree path (program location) • System location (process/thread) In the gure 4 below, we showed the visualization of analysis presentation and exploration : The metric tree provides the main information about time (CPU allocation time - included time allocated for idle threads) and visits. The time results is 0.12 and 1716 visits. All values are in nanoseconds. At call tree, we observed some bag in generating the trace to analysis the communication of MPI. The system tree presents the linux cluster, the machine in execution (@marie) with 4 processes. And nally, we execute the command scalasca -examine : ~/scalasca-1.4.2/example$ scalasca -examine epik_ctest-pomp-mpi_4_trace Archive ./epik_ctest-pomp-mpi_4_trace INFO: Post-processing runtime summarization report... INFO: Post-processing trace analysis report... INFO: Displaying ./epik_ctest-pomp-mpi_4_trace/trace.cube.gz... The scalasca -examine display the le trace.cube.gz, the preview visualization showed with CUBE. In the gure ?? below, we show the trace report : The last example, we used the CTEST.c program to test the instrumentation with EPIK API [7] and measurement features of the toolset. Scalasca provides several possibilities to instrument user application code. Besides the automatic compiler-based instrumentation, it provides manual instrumentation using the EPIK API. In MPI library calls, the instrumentation is accomplished using the standard MPI proling inter PMPI. To enable it, the application program has to be linked against the EPIK measurement library plus MPI-specic libraries. We restart executing the command make ctest-epik-mpi to realize the instrumentation : ~/scalasca-1.4.2/example$ make ctest-epik-mpi$ make ctest-epik-mpi mpicc -m32 -DEPIK `kconfig --32 --cflags` ctest-epik.c -o ctest-epik-mpi `kconfig --mpi --32 --libs` Now, we prepared the command which run the application executable with scalasca -analyze : ~/Documents/jussara/scalasca-1.4.2/example$ scalasca -analyze -t mpiexec -np 4 ctest-epik-mpi
  • 12. 12 ESAIM: PROCEEDINGS Figure 7. Examination Trace Report scalasca -analyze -t mpiexec -np 4 ctest-epik-mpi S=C=A=N: Scalasca 1.4.2 trace collection and analysis S=C=A=N: ./epik_ctest-epik-mpi_4_trace experiment archive S=C=A=N: Fri Jan 11 10:10:35 2013: Collect start /usr/bin/mpiexec -np 4 ctest-epik-mpi ... Max. memory usage : 5.711MB Total processing time : 0.585s S=C=A=N: Fri Jan 11 10:10:41 2013: Analyze done (status=0) 2s Warning: 0.020MB of analyzed trace data retained in ./epik_ctest-epik-mpi_4_trace/ELG! S=C=A=N: ./epik_ctest-epik-mpi_4_trace complete. The last step, we execute the command scalasca -examine : ~/Documents/jussara/scalasca-1.4.2/example$ scalasca -examine epik_ctest-epik-mpi_4_trace/ Archive ./epik_ctest-epik-mpi_4_trace INFO: Post-processing runtime summarization report... INFO: Post-processing trace analysis report... INFO: Displaying ./epik_ctest-epik-mpi_4_trace/trace.cube.gz... The following gure [?], we presented the post-processing runtime summarization report. Finally, we explained the rst window of examination report - the system metric tree. In the system, we analyzed the following topics : time execution, the synchronizations, the communication, bytes transfered and computational imbalance. • Time : total CPU allocation time. • Visits : number of visits. • Synchronizations : number of point-to-point/collective synchronization operations. • Communications : number of point-to-point communication operations. • Bytes transferred : number of bytes transferred in communication operations. • Computational imbalance : computational load imbalance heuristic (Overload and Underload).
  • 13. ESAIM: PROCEEDINGS 13 Figure 8. Examination Trace Report . The metric tree provides 0.19% CPU time, 1164 visits, 48 collective synchronization operations, 30 point-to-point send communication and 30 point-to-point receives communication. 4.2. Laplacian Instrumentation 5. Conclusion Références [1] Brian. Scalable performance analysis of large-scale parallel applications. http://www2.fz-juelich.de. [Online ; accessed 8- January-2013]. [2] Markus Geimer Brian Wylie. Scalable performance analysis of large-scale parallel applications. www.training.prace-ri.eu/ uploads/tx.../Scalasca_Overview_01.pdf. [Online ; accessed 9-January-2013]. [3] Markus Geimer Brian Wylie. Tutorial Exercise NPB-MZ-MPI/BT - Prace training. www.training.prace-ri.eu/uploads/tx. ../MPIBTExercise_BG.pdf. [Online ; accessed 9-January-2013]. [4] Philippe G. Ciarlet. The Finite Element Method for Elliptic Problems. Siam, ISBN, 1st edition, 2002. [5] Marcus Geimer. Introduction to Scalasca. http://www.linksceem.eu/ls2/images/stories/Introduction_to_Scalasca.pdf. [Online ; accessed 9-January-2013]. [6] Markus Geimer, Felix Wolf, Brian J. N. Wylie, Erika Ábrahám, Daniel Becker, and Bernd Mohr. The scalasca performance toolset architecture. Concurr. Comput. : Pract. Exper., 22(6) :702719, April 2010. [7] User Guide. Scalabel Automatic Performance Analysis. http://www2.fz­-juelich.de/jsc/datapool/scalasca/UserGuide. pdf. [Online ; accessed 9-January-2013]. [8] Parry Husbands, Costin Iancu, and Katherine Yelick. A performance analysis of the berkeley upc compiler. In Proceedings of the 17th annual international conference on Supercomputing, ICS '03, pages 6373, New York, NY, USA, 2003. ACM. [9] J. Issa and S. Figueira. Graphics performance analysis using amdahl's law. In Performance Evaluation of Computer and Telecommunication Systems (SPECTS), 2010 International Symposium on, pages 127 132, july 2010. [10] MPI. The Message Passing Interface (MPI) standard. http://www-unix.mcs.anl.gov/mpi. [Online ; accessed 7-January-2013]. [11] OpenMP. OpenMP : Simple, Portable, Scalable SMP Programming. http://www.openmp.org. [Online ; accessed 7-January- 2013]. [12] Christophe Prud'Homme, Vincent Chabannes, Vincent Doyeux, Mourad Ismail, Abdoulaye Samake, and Gonçalo Pena. Feel++ : A Computational Framework for Galerkin Methods and Advanced Numerical Methods. ESAIM Proceedings, page 27, December 2012. ISLE/CHPID. [13] Christophe Prud'Homme, Vincent Chabannes, Vincent Doyeux, Mourad Ismail, Abdoulaye Samake, Gonçalo Pena, Cécile Daversin, and Christophe Trophime. Advances in Feel++ : a domain specic embedded language in C++ for partial dif- ferential equations. In Eccomas'12 - European Congress on Computational Methods in Applied Sciences and Engineering, Vienna, Autriche, September 2012. MS404-1 Automation of computational modeling by advanced software tools and tech- niques FRAE/RB4FASTSIM, ISLE/CHPID. [14] F. Wolf and B. Mohr. Automatic performance analysis of hybrid mpi/openmp applications. In Parallel, Distributed and Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on, pages 13 22, feb. 2003.