Rapport_Cemracs2012

ESAIM: PROCEEDINGS, Vol. ?, 2013, 1-10
Editors: Will be set by the publisher
COMPILATION ANALYSIS, PERFORMANCE ANALYSIS, SCALABILITY USING
SCALASCA WITH FEEL++ SCIENTIFIC APPLICATIONS∗, ∗∗
Jussara Marandola1
Résumé. ...
Abstract. This paper presents the following contributions: the compilation analysis on Feel++ lan-
guage presenting one example as state of art in mesh manipulation to provide 1D, cube in 2D, tetrahe-
dron, cube in 3D models of Laplacian code. The focus was analyzed the compilation options during the
execution - mpirun, mpi execution with duo, four processors. In rst step, we showed the importance to
realize the performance analyze to compare the feel++ scientic application using Feel++ TIME class
or introducing Scalasca instrumentation to get the CPU time allocation and throughput by scalability
in terms of numbers of threads and scalability of processors for clusters.
1. Introduction
In this paper, we introduce the compilation executions and performance analysis on Feel++, Finite Element
Embedded Language in C++. Feel++ is a C++ library for arbitrary order Galerkin methods (e.g. nite and
spectral element methods ) continuous or discontinuous in 1D 2D and 3D. It include many features such
as geometries 1D, 2D, 3D and lower topological dimension 1D(curve) in 2D and 3D or 2D(surfacee) in 3D.
Through supporting Gmsh for mesh generation and Paraview for post-processing we can analyze the models by
visualization.
Thinking in terms of libraries to solve problems arising from partial dierential equations(PDEs) through
generalized Galerkin methods [13], Feel++ provides the complexity of dierential models and implementation of
state of the art robust numerical methods : a language clear to express problems specialized in a type of equation,
e.g Navier-Stokes or linear elasticity models and nally a wealth of solution algorithm. The last advances in
Feel++ describes the mesh data structure as well mesh entities(elements, faces, edges, nodes) and algebrics
representations. Mesh entities are indexed by process id by MPI in parallel context.
Feel ++ relies on MPI for parallel computations and the class Application initialises the MPI environment.
The makele of Feel++ project enable the option MPI mode that oers the parallel computation with mpirun
command.
The main goal will be the introduction of compilation executation in linear algebra environment, providing
all mesh options, displaying the models through gmesh or paraview, the base concepts of performance analysis
(equations, metrics and Time class into Feel++) and nally to introduce the use of Scalasca to analyze the
∗ Thanks : Laboratoire Jean Kuntzmann, Université Joseph Fourier Grenoble 1, BP53 38041 Grenoble Cedex 9, France, e-mail :
christophe.prudhomme@ujf-grenoble.fr
∗∗ Thanks : Embedded Real Time Systems Foundation Laboratory (LaSTRE), CEA LIST, CEA-Saclay Nano-INNOV PC172,
F91191 Gif-sur-Yvette cedex, France, e-mail : stephane.louise@cea.fr
1 jussara@nerim.net
c EDP Sciences, SMAI 2013

2 ESAIM: PROCEEDINGS
scalability performance reports on Feel++. Scalasca [6] is a performance toolset that has been specically
designed to analyze parallel application execution behavior on large-scale systems. It oers an incremental
performance anlaysis procedure that integrates runtime summaries with in-depth studies of current behavior
via event tracing, adopting a strategy of successively rened measurement congurations.
We will focus our case study on linear algebra environment, specially standard formulation : the laplacian
case. Feel ++ supports three dierent linear algebra environments that we shall call backends such as Gmm,
Petsc4, Trilinos5. Regarding the fucntion spaces denition, several types of the polynomials (P) are used as
following : Lagrange, Legendre, Dubiner, Crouzeix-Raviart, Raviart-Thomas. It supports also modal basis, e.g.
Legendre or Dubiner [12], as well as nite elements (FE) following the standard denition, set in [4], as a triplet
(κ, P, Σ) where κ is a convex, P the polynomial space and Σ the dual space.
Remainder of this paper is organized as follows.
Section 2, we will introduce the analysis of execution time for parallel algorithms as well concepts related with
speedup, execution time components and eciency and nally, the importance of Amdahl's law in performance
analysis of graphics processors. The advantage of the most common model (MPI), message passing in the context
of programming environment of Feel++ during compilation will be discuted also. In the Section 3 will be to
describe the standard formulation of laplacian without strong and weak Dirichlet conditions explaining the
compilation options by executing with the following ordering : hzise and shape. The last one provides dierents
polynomials dimensions according ne and coarse grain (represented by size). In Section 4, we present the
overview of current version of Scalasca in terms of instrumentation and measurement. The layered model of the
Scalasca architecture, the analysis conguration and the MPI sample ctest-pomp-mpi will be described. After
that we will show the laplacian case using Scalasca. And nally, we present the Section 5 concluding all results
of setup experiment using Scalasca, the contributions and limitations of Scalasca with Feel++, advances of
compilation options with mesh data structure and nally, the performance analysis on programming environment
in terms of serial/parallel time execution of code or use of toolset for instrumentation and measurement.
2. Performance Analysis
This paper presents the Scalasca toolset for scalable performance analysis of large-scale parallel applications.
It oers a basis to examine the eectiveness of parallel performance tools. The main goal is to optimize the
real applications (benchmarks and laplacian case) understing the barriers to high performance and predict
improvement. The performance tools usually provides a program metrics parallelization.
The rst objective related with optimization of parallel algorithms on Feel++ language will be the analysis
of execution time, communication time and topology view (in terms of analysis of processes in cartesian grid)
for parallel algorithm using MPI programming on codes and compilation options. We use the message passing
model because it has the advantage in scalability, since it runs on distributed memory multiprocessors. For most
of large-scale parallel machines use the common model message passing - MPI [10].
In this context, one of the challenges [8], faced by the application developers for high performance parallel
systems is the relatively diculty programming environment. Depending of parallel machine, the high perfor-
mance could be obtained by relative programming models such as OpenMP [11], threads, data parallelism or
automatic parallelization. The recent performance tools are using the hybrid programming (MPI/OpenMP) to
optimize your application.
The recent studies about graphics performance analysis [9] using Amdahl's Law is very important for design-
ing future graphics processors as well specic purpose-processors. The Amadahl's Law is the most important
formulation that describes the number of processors does not lead to a proportional increase in performance.
The performance improvement to be gained from using some faster mode of execution is limited by the fraction
of the time the faster mode can be used. The denition can be written in the form of
T = T0 + (T1 − T0)
f1
f
, (1)
where T1 : is the measured execution time at frequency f1 ; T0 : is the non-scale time ;

ESAIM: PROCEEDINGS 3
The Amadhal's Law is very important at performance analysis regarding the scalability, it means the in-
creasing number of processors, in terms of measures of computation time (throughput and execution time) and
communication time. Analysis of scalability requires many factors to measure the performance that involves
concepts of speedup and eciency.
The Speedup measures how one parallel code can run faster than the sequential counterpart and ratio of
sequential execution time to parallel execution time. We can explain the following execution time components :
• Inherently sequential computations : σ(n)
• Pontentially parallel computations : ϕ(n)
• Communication operations and other repeat computations : κ(n, p)
Regarding Speedup, ψ(n) solves a problem of size n on p processors limited by
ψ(n, p) ≤
σ(n) + ϕ(n)
σ(n) + ϕ(n)
p + κ(n, p)
, (2)
Finally, the eciency could be measured through processor utilization, ratio of speedup to number of pro-
cessors used, consequently, eciency e n,p for problem of size n on p processors is given by
ε(n, p) ≤
σ(n) + ϕ(n)
p × (σ(n) + σ(n)
p + κ(n, p))
, (3)
3. Laplacian Case
In Feel++ document source, we list all samples. In Laplacian case, there are the following examples : laplacian
with dirichlet condition weakly (problem in 2D), laplacian full, and laplacian lagrange multiplier. All codes are
found in this directory :
~/Devel/feelpp/doc/manual/tutorial$
We analyzed the laplacian with dirichlet condition weakly, the laplacian.cpp. All kind of a vartiational for-
mulation of a problem is also called weak formulation. The key item is to bring a new function and to integrate
by parts. In the mathematical formulation example, we would like to solve this problem :
Problem 1 : nd u such that
−∆υ = inΩ = [−1; 1]2
(4)
with
= 2π2
g (5)
and g is the exact solution
g = sin (πx) cos (πy)(6)
The following boundary conditions apply
υ = g|x = ±1,
∂(υ)
∂(υ)
= 0|y = ±1 (7)
The alternative mathematical formulation handles the Dirichlet condition weakly and hence we have a uniform
treatment for all types of boundary conditions. In another alternative formulation which allows to treat weakly
Dirichlet boundary condition to Neumann and Robin conditions. Following a similar development as in the
previous section, the problem reads
Ω
u. ν +
|x=−1,x=1
∂(υ)
∂(ν)
ν − υ
∂(ν)
∂(n)
+
µ
h
υν =
Ω
f.ν +
|x=−1,x=1
−g
∂(ν)
∂(n)
+
µ
h
gν (8)

3.1. Execution Options and Feel++ Implementation
In C++ Laplacian is dened through the make-options function all types of compilation during the execution.
This function, a routine that returns the list of options uses the program called boost library. All data returned
in this case is used as an argument of a Feel Application subclass. In our experiment, we focused on hsize (to
describe the mesh structure - mesh size), shape (model or format of polynomial, always hypercube or simplex),
Dim (it provides dimension of 1D, 2D, 3D), and nally the weakdir (when the user wants to use weak Dirichlet
condition).
First we dene the make options. We showed the main lines of Feel++ and make options function :
include feel/feel.hpp
/** use Feel namespace */ using namespace Feel ;
inline
po::options_description
makeOptions()
{
po::options_description laplacianoptions( Laplacian options );
laplacianoptions.add_options()
( hsize, po::valuedouble()-default_value( 0.5 ), mesh size )
( shape, Feel::po::valuestd::string()-default_value( hypercube ), shape of the domain (either
( nu, po::valuedouble()-default_value( 1 ), grad.grad coefficient )
( weakdir, po::valueint()-default_value( 1 ), use weak Dirichlet condition )
( penaldir, Feel::po::valuedouble()-default_value( 10 ),
penalisation parameter for the weak boundary Dirichlet formulation )
( exact1D, po::valuestd::string()-default_value( sin(2*Pi*x) ), exact 1D solution )
( exact2D, po::valuestd::string()-default_value( sin(2*Pi*x)*cos(2*Pi*y) ), exact 2D solution
( exact3D, po::valuestd::string()-default_value( sin(2*Pi*x)*cos(2*Pi*y)*cos(2*Pi*z) ), exact
( rhs1D, po::valuestd::string()-default_value( ), right hand side 1D )
;
return laplacianoptions.add( Feel::feel_options() );
}
Initially, the geometric dimension of the problem was setup in Dim=2. In Laplacian class, public main
function, we listed many types as
• Numerical type is double
• Geometry entities type composing the mesh, here Simplex in Dimension Dim of Order 1
• Mesh type
• Approximation function space type
• Exporter factory type
The Laplacian() provides mesh size and shape as dened below :
Laplacian()
:
super(),
meshSize( this-vm()[hsize].template asdouble() ),
shape( this-vm()[shape].template asstd::string() )
{
}
The execution denition was specied by template int Dim, int Order :
templateint Dim, int Order

void LaplacianDim,Order::run()
{
LOG(INFO) --------------------------------n;
LOG(INFO) Execute Laplacian Dim n;
Environment::changeRepository( boost::format( doc/manual/tutorial/%1%/%2%-%3%/P%4%/h_%5%/ )
% this-about().appName()
% shape
% Dim
% Order
% meshSize );
mesh_ptrtype mesh = createGMSHMesh( _mesh=new mesh_type,
_desc=domain( _name=( boost::format( %1%-%2% ) % shape % Dim ).str(
_usenames=true,
_shape=shape,
_h=meshSize,
_xmin=-1,
_ymin=-1 ) );
...}
The parameter _h in the function domain allows to change the mesh characteristic size to M_meshsize which
is given for example on the command-line using the Application framework. The createGMSHMesh is a function
which allows to generate a mesh le .msh automatically from a description le ( .geo for example) by using the
_desc parameter and store the generated mesh into the meshparameterallocatedwhencallingthefunction.
To compile the Laplacian.cpp, rst we invoked the compiler with make :
~/Devel/feel.opt$ make feelpp_doc_laplacian
To run the laplacian executable, we must invoke the command-line in this directory :
~/Devel/feel.opt/doc/manual/tutorial$
To run in parallel, we used the command-line mpirun to run the laplacian executable in N processors. The
Laplacian was executed with 2 processors and 4 processors, but we haven't implemented the time class in
Feel++ to measure the CPU allocation time. One of most limitation is the time measurement during execution,
in terms of scalability of processors (increase of N processors).
At this point of our experiment, we dened the parameters of execution related with mesh structure :
• hsize : Size of mesh.
• shape : Hypercube or Simplex shape.
• dim : Setting the 2D and 3D models.
• nu : Grad coecient setup n=1.
• weakdir : Use weak Dirichlet condition.
The MPI command-line was dened as
% mpirun -np 4 ./feelpp_doc_laplacian
Regarding these parameters, we concluded that meshes when executed, it provides the following ordering :
shape dim model
Shape Dim Model
Simplex 2D Triangle
Simplex 3D Tetrahedron
Hypercube 2D Rectangle
Hypercube 3D Cube

Figure 1. Paraview Visualization of Cube in 2D
Figure 2. Paraview Visualization of Rectangle in 3D
For all simulations, we setup mesh size = 0.1. The follow, we list all command-line in these parameters
running with 4 processors.
% mpirun -np 4 ./feelpp_doc_laplacian --hsize=0.1 --shape=simplex --dim=2 --nu=1 --weadir=yes
% mpirun -np 4 ./feelpp_doc_laplacian --hsize=0.1 --shape=simplex --dim=3 --nu=1 --weadir=yes
% mpirun -np 4 ./feelpp_doc_laplacian --hsize=0.1 --shape=hypercube --dim=2 --nu=1
--weadir=yes
% mpirun -np 4 ./feelpp_doc_laplacian --hsize=0.1 --shape=hypercube --dim=3 --nu=1
--weadir=yes
After execution simulations, Feel++ supports essentially the Gmsh mesh le format, but we used the Par-
aview. It provides also some classes to manipulate Gmsh .geo les and generate .msh les. All Gmsh and
Paraview les generated by executation could be visualized at directory bin of Feel.
~/feel/doc/manual/tutorial/laplacian$ ls
hypercube-2 hypercube-3 simplex-2 simplex-3
To visualize the cube (2D) and rectangle (3D) with 0.1 size and 4 processors, we invoked the paraview :
~/feel/doc/manual/tutorial/laplacian/hypercube-2/P2/h_0.1/np_4$ paraview laplacian-4.sos
~/feel/doc/manual/tutorial/laplacian/hypercube-3/P2/h_0.1/np_4$ paraview laplacian-4.sos
To visualize the triangle (2D) and tetrahedron (3D) with 0.1 size and 4 processors, we invoked the paraview :
~/feel/doc/manual/tutorial/laplacian/simplex-2/P2/h_0.1/np_4$ paraview laplacian-4.sos
~/feel/doc/manual/tutorial/laplacian/simplex-3/P2/h_0.1/np_4$ paraview laplacian-4.sos

Figure 3. Paraview Visualization of Triangle in 2D
Figure 4. Paraview Visualization of Tetrahedron in 3D
4. Introduction to Scalasca
In the context regarding numerical methods and algorithms for high performance computing, HPC commu-
nity requires the use of powerful and robust performance-analysis tools that makes the optimization of parallel
applications. These tools are eective and more ecient not only helping the improve the scalability character-
istics of scientic codes, but also allow experts to concentrate on the underlying science rather than to spend a
major fraction of their time tuning their application for a particular machine.
In this section, we introduce the SCALASCA, an open-source performance-analysis toolset developed at
Julich Supercomputing Center in cooperation with the University of Tennessee as the sucessor of KOJAK [14],
specically designed for use on large-scale systems including IBM Blue Gene and Cray XT, as well for small
and medium-scale HPC platforms.
The Scalasca architecture [6] shows the basic analysis workow. The base of layered model presents the
instrumentation, measurement system, trace analysis, trace utilities, report tools. The layered model is divided
in following topics : preparation (concerning insertion of probes), Execution (Event measurement collection),
postmotem (Event summarization analysis, and nally, presentation).
At rst, the target application must be instrumented, that is, probes must be inserted into the code that
carry out the measurements. This can happen at dierent levels, including source code, object code, or library.
Before running the instrumented executable on the parallel machine, the user can choose between generating a
runtine summary report or an event trace. In the gure 3, we show the Scalasca's performance-analysis work
ow.
The Scalasca presents the following research topics regarding performance analysis, tools techniques [1] :
• Prole analysis : provides the summary of aggregated metrics (per function/callpath, or per pro-
cess/thread) and tools that can generate or present such proles (event traces). Ex : gprof, mpiP,
ompP, Scalasca, TAU, Vampir, ...

Figure 5. Performance analysis work ow . When tracing is enabled, each process gener-
ates a trace le containing records for its process-local events. It is generally recommended
to optimize the instrumentation based on previously generated summary report. During the
analysis, Scalasca searches for wait states and at the end it provides the result in structure to
the summary report.
• Time-line analysis : provides visual representation of the space/time sequence of events. Requires an
execution trace. Ex : Vampir, Paraver, JumpShot, Intel TAC, Sun Studio, ...
• Pattern analysis : search for event sequences characteristics ineciencies. Can be done manually (via
visual time-line) or automatically (via KOJAK, Scalasca, Periscope, ...).
In the state of art related with productivity tools the following tools are very important in cycle life of
integration of technologies :
• KCachegrind : Callgraph-based cache simulation analysis.
• Marmot/MUST : MPI correctness checking.
• PAPI : Interfacing to hardware performance counters.
• Periscope : Automatic analysis via an on-line distributed search.
• SCALASCA : Large-scale parallel performance analysis.
• TAU : Integrated parallel performance system.
• Vampir/VampirTrace : Event tracing and graphical trace visualization analysis.
In the next section, we will explain the measurement and analysis conguration regarding the package C test
with MPI running. We will make emphasis on performance analysis.
4.1. Setup Experiment
The Scalasca measurement system that gets linked with instrumented application executable can be con-
gured via environment variables or conguration les to specify that runtime summaries and/or event traces
should be collected, along with optional hardware counter metrics.
In the introduction of section 4 we presented the performance-analysis work ow based on the following
steps : preparation, measurement, analysis, examination, optimization. The preparation prepare application,
insert extra code (probes). The measurement collect the relevant data to execution performance analysis. The
analysis provides the calculation of metrics, identication of performance metrics. The examination presents the
results in an form. And nally, the optimization modify the code to eliminate/reduce performance problems.
The perforrmance analysis presents the following steps :
• 0 : reference preparation for validation
• 1 : Program instrumentation : skin
• 2 : Summary measurement collection analysis : scan [ -s]
• 3 : Summary analysis report examination : square
• 4 : Summary experiment scoring : scan -t
• 5 : Event trace collection analysis : scan -t
• 6 : Event trace analysis report examination : square
At the samples are available execution mode in MPI, OpenMP hybrid OpenMP/MPI variants. In our setup
experiment, we will show the sample CTEST with pomp-mpi execution. In this case, the Scalasca measurement
system provides the congure le (Makele Makele.defs), its means that CTEST.C code is compiled with

command Scalasca -instrument. The prex compile/link commands in Makele denitions (cong/make.def)
with scalasca instrumenter is very important.
At the begin, you must load the Scalasca module and after you could run Scalasca for brief usage information,
as dened below.
% module load scalasca
scalasca /1.4.2 loaded
~/scalasca-1.4.2$ scalasca
Scalasca 1.4.2
Toolset for scalable performance analysis of large-scale parallel applications
usage: scalasca [-v][-n] {action}
1. prepare application objects and executable for measurement:
scalasca -instrument compile-or-link-command # skin
2. run application under control of measurement system:
scalasca -analyze application-launch-command # scan
3. interactively explore measurement analysis report:
scalasca -examine experiment-archive|report # square
-v: enable verbose commentary
-n: show actions without taking them
-h: show quick reference guide (only)
Scalasca Makele Example provides the following PREP denition part in the Makele explain how to setup
the instrumentation regarding routines by compiler
• Default instrumentation of routines by compiler PREP = scalasca -instrument.
• Build without any instrumentation PREP = scalasca -instrument -mode=none.
• Manual EPIK instrumentation (only) PREP_EPIK = $(PREP) -user -comp=all.
• Manual POMP instrumentation (only) PREP_POMP = $(PREP) -pomp -comp=all.
In our case, the PREP instrumentation is setting in Makele including makele.defs, by default, PREP macro
is not set and no instrumentation is performed for a regular production build. You must specify a PREP value
in the Makele or make on command line uses as :
% make PREP=' scalasca −instrument ' .
Before the instrumentation step, we describes the conguration about setup experiment. We must setup and
look some details :
To locate the executable of Scalasca :
% ~/home/ bin
% echo PATH
To congure the PATH :
% export PATH=$PATH:~/home/bin
To modify Makele.defs Linux GNU (conguration standard) :
Prefix = /opt/ scalasca → Prefix = /home/ bin
OPARI2DIR = $(PREFIX)
We must to modify the MPILIB -lmpich editing mpi settings
#MPILIB = -lmpich
MPILIB =
PMPILIB = -lpmpich

In this setup experiment describes the CTEST sample. CTEST is simple toy C program that can be used to
test the instrumentation and measurement features of the toolset. We will use the CTEST-POMP-MPI. The
example is founded on :
% ~/Documents/ scalasca −1.4.2/ example
Regarding Scalasca toolset commponents, during the CTEST experiment, we provide the following compo-
nents :
• Program source (with compiler / instrumenter) = instrumentated executable.
• Application + measurement lib = trace analysis
• Application + measurement lib = summary analysis
The Makele provides the command line of Scalasca instrumenter = scalasca -instrument
We used the following performance analysis steps :
Scalasca instrumenter = Prepare application objects and executable f o r measurement :
% scalasca − instrument compile−or−link −command
Scalasca measurement c o l l e c t o r analyzer = Run application under control of measurement
system :
% scalasca −analyze experiment−archive | report
Scalasca a n a l y s i s report examiner = Post−process explore measurement a n a l y s i s report :
% scalasca −examine experiment−archive | report
At example directory, we listed two CTEST : ctest-pomp-mpi and ctest-epik-mpi. The Versions of these les
including manual instrumentation using the EPIK API and POMP directives include -epik and -pomp in
their names. For ctest-pomp-mpi, we prepared the command which invoke the compiler with make, for example :
~/sclasca-1.4.2/example$ make ctest-pomp-mpi$ make ctest-pomp-mpi
scalasca -instrument -pomp -comp=all mpicc -m32 ctest-pomp.c -o ctest-pomp-mpi
INFO: Instrumented executable for MPI measurement
After, prepend commands which run the application executable with scalasca -analyze with 4 threads and 4
processors, for example :
~/scalasca-1.4.2/example$ OMP_NUM_THREADS = 4 scalasca -analyze -t mpiexec -np 4 ctest-pomp-mpi
OMP_NUM_THREADS=4 scalasca -analyze -t mpiexec -np 4 ctest-pomp-mpi
S=C=A=N: Scalasca 1.4.2 trace collection and analysis
S=C=A=N: ./epik_ctest-pomp-mpi_4_trace experiment archive
S=C=A=N: Wed Jan 9 17:15:45 2013: Collect start
/usr/bin/mpiexec -np 4 ctest-pomp-mpi
Max. memory usage : 5.715MB
Total processing time : 0.044s
SCAN: Wed Jan 9 17:15:50 2013: Analyze done (status=0) 2s
Warning: 0.035MB of analyzed trace data retained in ./epik_ctest-pomp-mpi_4_trace/ELG!
SCAN: ./epik_ctest-pomp-mpi_4_trace complete.
Running the ctest-pomp-mpi executable, we created the directory that provides the scout.cube.
cd epik_ctest-pomp-mpi_4_trace
The measurement archive directory ultimately contains a copy of the execution output (epik.log) :
• a record of the measurement conguration (epik.conf).
• the basic analysis report that was collated after measurement (epitome.cube).
• the complete analysis report produced during post-processing (summary.cube.gz).
To visualize the the results, we execute the command :
~/scalasca-1.4.2/example/epik_ctest-pomp-mpi_4_trace$ cube3 scout.cube

Figure 6. Analysis report presentation
The CUBE 1.4.2 is a parallel program analysis report exploration tools that provides libraries for XML
exporting writing, algebra utilities for report processing and GUI for interative analysis exploration (requiring
the Qt4). The representation of values (severity matrix) on three hierarchical axes :
• Performance property (metric)
• Call-tree path (program location)
• System location (process/thread)
In the gure 4 below, we showed the visualization of analysis presentation and exploration :
The metric tree provides the main information about time (CPU allocation time - included time allocated
for idle threads) and visits. The time results is 0.12 and 1716 visits. All values are in nanoseconds. At call tree,
we observed some bag in generating the trace to analysis the communication of MPI. The system tree presents
the linux cluster, the machine in execution (@marie) with 4 processes.
And nally, we execute the command scalasca -examine :
~/scalasca-1.4.2/example$ scalasca -examine epik_ctest-pomp-mpi_4_trace
Archive ./epik_ctest-pomp-mpi_4_trace
INFO: Post-processing runtime summarization report...
INFO: Post-processing trace analysis report...
INFO: Displaying ./epik_ctest-pomp-mpi_4_trace/trace.cube.gz...
The scalasca -examine display the le trace.cube.gz, the preview visualization showed with CUBE. In the
gure ?? below, we show the trace report :
The last example, we used the CTEST.c program to test the instrumentation with EPIK API [7] and
measurement features of the toolset. Scalasca provides several possibilities to instrument user application code.
Besides the automatic compiler-based instrumentation, it provides manual instrumentation using the EPIK
API.
In MPI library calls, the instrumentation is accomplished using the standard MPI proling inter PMPI. To
enable it, the application program has to be linked against the EPIK measurement library plus MPI-specic
libraries.
We restart executing the command make ctest-epik-mpi to realize the instrumentation :
~/scalasca-1.4.2/example$ make ctest-epik-mpi$ make ctest-epik-mpi
mpicc -m32 -DEPIK `kconfig --32 --cflags` ctest-epik.c
-o ctest-epik-mpi `kconfig --mpi --32 --libs`
Now, we prepared the command which run the application executable with scalasca -analyze :
~/Documents/jussara/scalasca-1.4.2/example$ scalasca -analyze -t mpiexec -np 4 ctest-epik-mpi

Figure 7. Examination Trace Report
scalasca -analyze -t mpiexec -np 4 ctest-epik-mpi
S=C=A=N: Scalasca 1.4.2 trace collection and analysis
S=C=A=N: ./epik_ctest-epik-mpi_4_trace experiment archive
S=C=A=N: Fri Jan 11 10:10:35 2013: Collect start
/usr/bin/mpiexec -np 4 ctest-epik-mpi
...
Max. memory usage : 5.711MB
Total processing time : 0.585s
S=C=A=N: Fri Jan 11 10:10:41 2013: Analyze done (status=0) 2s
Warning: 0.020MB of analyzed trace data retained in ./epik_ctest-epik-mpi_4_trace/ELG!
S=C=A=N: ./epik_ctest-epik-mpi_4_trace complete.
The last step, we execute the command scalasca -examine :
~/Documents/jussara/scalasca-1.4.2/example$ scalasca -examine epik_ctest-epik-mpi_4_trace/
Archive ./epik_ctest-epik-mpi_4_trace
INFO: Post-processing runtime summarization report...
INFO: Post-processing trace analysis report...
INFO: Displaying ./epik_ctest-epik-mpi_4_trace/trace.cube.gz...
The following gure [?], we presented the post-processing runtime summarization report.
Finally, we explained the rst window of examination report - the system metric tree. In the system, we
analyzed the following topics : time execution, the synchronizations, the communication, bytes transfered and
computational imbalance.
• Time : total CPU allocation time.
• Visits : number of visits.
• Synchronizations : number of point-to-point/collective synchronization operations.
• Communications : number of point-to-point communication operations.
• Bytes transferred : number of bytes transferred in communication operations.
• Computational imbalance : computational load imbalance heuristic (Overload and Underload).

Figure 8. Examination Trace Report
. The metric tree provides 0.19% CPU time, 1164 visits, 48 collective synchronization operations, 30
point-to-point send communication and 30 point-to-point receives communication.
4.2. Laplacian Instrumentation
5. Conclusion
Références
[1] Brian. Scalable performance analysis of large-scale parallel applications. http://www2.fz-juelich.de. [Online ; accessed 8-
January-2013].
[2] Markus Geimer Brian Wylie. Scalable performance analysis of large-scale parallel applications. www.training.prace-ri.eu/
uploads/tx.../Scalasca_Overview_01.pdf. [Online ; accessed 9-January-2013].
[3] Markus Geimer Brian Wylie. Tutorial Exercise NPB-MZ-MPI/BT - Prace training. www.training.prace-ri.eu/uploads/tx.
../MPIBTExercise_BG.pdf. [Online ; accessed 9-January-2013].
[4] Philippe G. Ciarlet. The Finite Element Method for Elliptic Problems. Siam, ISBN, 1st edition, 2002.
[5] Marcus Geimer. Introduction to Scalasca. http://www.linksceem.eu/ls2/images/stories/Introduction_to_Scalasca.pdf.
[Online ; accessed 9-January-2013].
[6] Markus Geimer, Felix Wolf, Brian J. N. Wylie, Erika Ábrahám, Daniel Becker, and Bernd Mohr. The scalasca performance
toolset architecture. Concurr. Comput. : Pract. Exper., 22(6) :702719, April 2010.
[7] User Guide. Scalabel Automatic Performance Analysis. http://www2.fzÂ-juelich.de/jsc/datapool/scalasca/UserGuide.
pdf. [Online ; accessed 9-January-2013].
[8] Parry Husbands, Costin Iancu, and Katherine Yelick. A performance analysis of the berkeley upc compiler. In Proceedings of
the 17th annual international conference on Supercomputing, ICS '03, pages 6373, New York, NY, USA, 2003. ACM.
[9] J. Issa and S. Figueira. Graphics performance analysis using amdahl's law. In Performance Evaluation of Computer and
Telecommunication Systems (SPECTS), 2010 International Symposium on, pages 127 132, july 2010.
[10] MPI. The Message Passing Interface (MPI) standard. http://www-unix.mcs.anl.gov/mpi. [Online ; accessed 7-January-2013].
[11] OpenMP. OpenMP : Simple, Portable, Scalable SMP Programming. http://www.openmp.org. [Online ; accessed 7-January-
2013].
[12] Christophe Prud'Homme, Vincent Chabannes, Vincent Doyeux, Mourad Ismail, Abdoulaye Samake, and Gonçalo Pena.
Feel++ : A Computational Framework for Galerkin Methods and Advanced Numerical Methods. ESAIM Proceedings, page 27,
December 2012. ISLE/CHPID.
[13] Christophe Prud'Homme, Vincent Chabannes, Vincent Doyeux, Mourad Ismail, Abdoulaye Samake, Gonçalo Pena, Cécile
Daversin, and Christophe Trophime. Advances in Feel++ : a domain specic embedded language in C++ for partial dif-
ferential equations. In Eccomas'12 - European Congress on Computational Methods in Applied Sciences and Engineering,
Vienna, Autriche, September 2012. MS404-1 Automation of computational modeling by advanced software tools and tech-
niques FRAE/RB4FASTSIM, ISLE/CHPID.
[14] F. Wolf and B. Mohr. Automatic performance analysis of hybrid mpi/openmp applications. In Parallel, Distributed and
Network-Based Processing, 2003. Proceedings. Eleventh Euromicro Conference on, pages 13 22, feb. 2003.

Rapport_Cemracs2012

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (11)

Similaire à Rapport_Cemracs2012

Similaire à Rapport_Cemracs2012 (20)

Rapport_Cemracs2012