Presentation of the open source CFD code Code_Saturne

HPC and CFD at
EDF with
Code_Saturne
Yvan Fournier, Jérôme Bonelle

EDF R&D
Fluid Dynamics, Power Generation and Environment Department

Open Source CFD International Conference Barcelona 2009

Summary

1. General Elements on Code_Saturne
2. Real-world performance of Code_Saturne
3. Example applications: fuel assemblies
4. Parallel implementation of Code_Saturne
5. Ongoing work and future directions

2 Open Source CFD International Conference 2009

General elements on
Code_Saturne


Code_Saturne: main capabilities
Physical modelling
Single-phase laminar and turbulent flows: k-ε, k-ω SST, v2f, RSM, LES
Radiative heat transfer (DOM, P-1)
Combustion coal, gas, heavy fuel oil (EBU, pdf, LWP)
Electric arc and Joule effect
Lagrangian module for dispersed particle tracking
Atmospheric flows (aka Mercure_Saturne)
Specific engineering module for cooling towers
ALE method for deformable meshes
Conjugate heat transfer (SYRTHES & 1D)
Common structure with NEPTUNE_CFD for eulerian multiphase flows

Flexibility
Portability (UNIX, Linux and MacOS X)
Standalone GUI and integrated in SALOME platform
Parallel on distributed memory machines
Periodic boundaries (parallel, arbitrary interfaces)
Wide range of unstructured meshes with arbitrary interfaces
Code coupling capabilities (Code_Saturne/Code_Saturne, Code_Saturne/Code_Aster, ...)


Code_Saturne: general features
Technology
Co-located finite volume, arbitrary unstructured meshes (polyhdral cells), predictor-corrector method
500 000 lines of code, 50% FORTRAN 90, 40% C, 10% Python
Development
1998: Prototype (long time EDF in-house experience, ESTET-ASTRID, N3S, ...)
2000: version 1.0 (basic modelling, wide range of meshes)
2001: Qualification for single phase nuclear thermal-hydraulic applications
2004: Version 1.1 (complex physics, LES, parallel computing)
2006: Version 1.2 (state of the art turbulence models, GUI)
2008: Version 1.3 (massively parallel, ALE, code coupling, ...)
Released as open source (GPL licence)
2008: Dvpt version 1.4 (parallel IO, multigrid, atmospheric, cooling towers, ...)
2009: Dvp version 2.0-beta (parallel mesh joining, code coupling, easy install & packaging, extended
GUI)
schedule for industrial release beginning of 2010

Code_Saturne developed under Quality Assurance


External libraries (EDF, LGPL):
• BFT: Base Functions and Types
Code_Saturne subsystems • FVM: Finite Volume Mesh
• MEI: Mathematical Expression Interpreter

Code_Saturne BFT library

Meshes Preprocessor
Run-time environment
mesh import Memory logging
mesh joining
periodicity
domain partitioning
FVM code coupling
Restart library
files parallel treatment
Parallel Kernel parallel mesh Code_Saturne
management
SYRTHES
Parallel mesh setup Code_Aster
CFD Solver SALOME platform
...
MEI library
Mathematical Post-
Expression Xml data file
processing
Interpreter
GUI output


Code_Saturne environment
Graphical User Interface
setting up of calculation parameters
parameter stored in Xml file
interactive launch of calculations
some specific physics not yet covered by GUI
advanced setting up by Fortran user routines

Integration in the SALOME platform
extension of GUI capabilities
mouse selection of boundary zones
advanced user files management
from CAD to post-processing in one tool


Allowable mesh examples

Example of mesh with
stretched cells and hanging nodes

PWR lower
Example of composite mesh
plenum

3D
polyhedral cells


Joining of non-conforming meshes
Arbitrary interfaces
Meshes may be contained in one single file or in several separate files, in any order
Arbitrary interfaces can be selected by mesh references
Caution must be exercised if arbitrary interfaces are used:
in critical regions, or with LES
with very different mesh refinements, or on curved CAD surfaces
Often used in ways detrimental to mesh quality,but a functionality we can not do without
as long as we do not have a proven alternative.
Joining of meshes built in several pieces may also be used to circumvent meshing tool memory limitations.
Periodicity is also constructed as an extension of mesh joining.


Real-world performance
of Code_Saturne


Code_Saturne Features of note to HPC

Segregated solver
All variables are solved or independently, coupling terms are explicit
Diagonal-preconditioned CG used for pressure equation, Jacobi (or bi-CGstab) used for other variables
More important, matrices have no block structure, and are very sparse
Typically 7 non-zeroes per row for hexahedra, 5 for tetrahedra
Indirect addressing + no dense blocs means less opportunities for MatVec optimization, as
memory bandwidth is as important as peak flops.

Linear equation solvers usually amount to 80% of CPU cost
(dominated by pressure), gradient reconstruction about 20%
The larger the mesh, the higher the relative cost of the pressure step


Current performance (1/3)

2 LES test cases (most I/O factored out)
1 M cells: (n_cells_min + n_cells_max)/2 = 880 at 1024 cores, 109 at
8192
10 M cells (n_cells_min + n_cells_max)/2 = 9345 at 1024 cores, 1150
at 8192
FATHER
HYP P I
1 M hexahedra LES tes t cas e
10 M h e xa h e d ra LES te st c a se
100000 500000

10000 50000
O pteron + infiniband
O pteron + Myrinet Opte ron + Myrine t
NovaS cale Opte ron + infiniba nd
Ela ps e d time

Blue G ene/L Nova S c a le

Elapsed tile
Blue Ge ne /L

1000 5000

100 500
1 10 100 1000 10000 1 10 100 1000 10000
n c ore s
n cores



RANS, 100 M tetrahedra + polyhedra (most I/O factored out)
Polyhedra due to mesh joinings may lead to higher load imbalance in
local MatVec for large core counts
96286/102242 min/max cells/core at 1024 cores
11344/12781 min/max cells cores at 8192 cores

FA Grid
RANS te s t ca s e
10000
Ela pse d tim e pe r ite ra tion

1000
Nov a S c a le
Blue Ge ne /L (C O)
Blue Ge ne /L (VN)

100

10
100 1000 10000
n c o re s



Efficiency often goes through an optimum (due du better cache hit
rates) before dropping (due to latency induced by parallel
synchronization)
Example shown here: HYPI (10 M cell LES test case)

1,6

1,4

1,2

1
Efficacité

0,8

0,6

0,4

0,2

0
1 10 100 1000 10000
Number of MPI ranks

Cluster Chatou Tantale Platine BlueGene


High Performance Computing with Code_Saturne

Code_Saturne used extensively on HPC machines
in-house EDF clusters
CCRT calculation centre (CEA based)
EDF IBM BlueGene machines (8 000 and 32 000 cores)
Run also on MareNostrum (Barcelona Computing Centre),
Cray XT, …

Code_Saturne used as reference in PRACE European project
reference code for CFD benchmarks on 6 large european HPC centres
Code_Saturne obtained “gold medal” status in scalability by Daresbury Laboratory (UK,
HPCx) machine)


Example HPC
applications:
fuel assemblies


Fuel Assembly Studies

Conflicting design goals
Good thermal mixing properties, requiring tubulent flow
Limit head loss
Limit vibrations
Fuel rods held by dimples and springs, and not welded,
as they lengthen slightly over the years due to irradiation

Complex core geometry
Circa 150 to 250 fuel assemblies per core depending
on reactor type, 8 to 10 grids per fuel assembly,
17x17 grid (mostly fuel rods, 24 guide tubes)
Geometry almost periodic, except for mix of several fuel assembly types in a given core (reload by 1/3
or 1/4)
Inlet an wall conditions not periodic, heat production not uniform at fine scale

Why we study these flows
Deformation may lead to difficulties in core unload/reload
Turbulent-induced vibrations of fuel assemblies in PWR power plants is a potential cause of
deformation and of fretting wear damage
These may lead to weeks or months of interruption of operations


Prototype FA calculation with Code_Saturne

PWR nuclear reactor mixing grid mock-up (5x5)
100 million cells
calculation run on 4 000
to 8 000 cores
Main issue is mesh generation


LES simulation of reduced FA domain

Particular features for LES
SIMPLEC algorithm with Rhie and Chow
interpolation
2nd order in time (Crank-Nicolson and Adams-
Bashforth)
2nd order in space (fully centered and sub-
iterations for non-orthogonal faces)
Fully hexahedral mesh, 8 million cells
Boundary Conditions
Implicit periodicity in x and y directions
Constant inlet conditions
Wall function when needed
Free outlet
Simulation
1 million time-steps: 40 flow passes, 20 flow
passes for averaging (no homogeneous
direction)
CFLmax= 0.8 (dt=5.10-6s)
BlueGene/L system, 1024 processors
Per time-step: 5s
For 100 000 time-steps: 1week


Parallel implementation
of Code_Saturne


Base parallel operations (1/4)
Distributed memory parallelism using domain partitioning
Use classical “ghost cell” method for both parallelism and periodicity
Most operations require only ghost cells sharing faces
Extended neighborhoods for gradients also require ghost cells sharing vertices

Global reductions (dot products) are also used, especially by the preconditioned
conjugate gradient algorithm

Periodicity uses the same mechanism
Vector and tensor
rotation also required


Base parallel operations (/)

Use of global numbering
We associate a global number to each mesh entity
A specific C type (fvm_gnum_t) is used for this. Currently an unsigned integer
(usually 32-bit), but an unsigned long integer (64-bit) will be necessary
Face-cell connectivity for hexahedral cells : size 4.n_faces, and
n_faces about 3.n_cells, → size around 12.n_cells, so numbers
requiring 64 bit around 350 million cells.
Currently equal to the initial (pre-partitioning) number
Allows for partition-independent single-image files
Essential for restart files, also used for postprocessor output
Also used for legacy coupling where matches can be saved



Use of global numbering
Redistribution on n blocks
n blocks ≤ n cores
Minimum block size may be set to avoid many small
blocks (for some communication or usage schemes),
or to force 1 block (for I/O with non-parallel libraries)
In the future, using at most 1 of every p
processors may improve MPI/IO performance if
we use a smaller communicator (to be tested)



Conversely, simply using global numbers allows reconstructing neighbor partition
entity equivalents mapping
Used for parallel ghost cell construction from
initially partitioned mesh with no ghost data
Arbitrary distribution, inefficient for halo
exchange, but allow for simpler data
structure related algorithms with
deterministic performance bounds
Owning processor determined simply by
global number, messages are aggregated


Parallel IO (1/2)

We prefer using single (partition independent) files
Easily run different stages or restarts of a calculation on different machines or queues
Avoids having thousands or tens of thousands of files in a directory
Better transparency of parallelism for the user

Use MPI I/O when available
Uses block to partition exchange when reading, partition to block when writing
Use of indexed datatypes may be tested in the future, but will not be possible everywhere
Used for reading of preprocessor and partitioner output, as well as for restart
files
These files use a unified binary format, consisting of an simple header an a succession of
sections
MPI IO pattern is thus a succession of global reads (or local read + broadcast) for
section headers and collective reading of data (with a different portion for each rank)
We could switch to HDF5 but preferred a lighter model and also avoid an extra
dependency or dependency conflicts
Infrastructure in progress for postprocessor output
Layered approach as we allow for multiple formats


Parallel IO (2/2)

Parallel I/O only of benefit with parallel filesystems
Use of MPI IO may be disabled either at build time, or for a given file using
specific hints
Without MPI IO, data for each block is written or read successively by rank 0,
using the same FVM file API

Not much feedback yet, but initial results
dissapointing
Similar performance with and without MPI IO on at least 2 systems
Whether using MPI_File_read/write_at_all or MPI_File_read/write_all
Need to retest this forcing less processors in the MPI IO communicator
Bugs encountered in several MPI/IO implementations

Ongoing work and
future directions


Parallelization of mesh joining (2008-2009)

Parallelizing this algorithm requires the same main steps as the serial
algorithm:
Detect intersections (within a given tolerance) between edges of overlapping faces
Uses parallel octree for face bounding boxes, built in a bottom-up fashion (no balance
condition required)
Subdivide edges according to inserted intersection vertices
Merge coincident or nearly-coincident vertices/intersections
This is the most complex
Must be synchronized in parallel
Choice of merging criteria has a profound impact on the quality of the resulting mesh
Re-build sub-faces

With parallel mesh joining, the most memory-intensive serial
preprocessing step is removed
We will add parallel mesh « append » within a few months (for version 2.1);
this will allow generation of huge meshes even with serial meshing tools


Coupling of Code_Saturne with itself
Objective
coupling of different models (RANS/LES)
fluid-structure interaction with large displacements
rotating machines
Two kinds of communications
data exchange at boundaries for interface coupling
volume forcing for overlapping regions
Still under development, but ...
data exchange already implemented in FVM library
optimised localisation algorithm
compliance with parallel/parallel coupling
prototype versions with promising results
more work needed on conservativity at the exchange
first version adapted to pump modelling implemented in version 2.0
rotor/stator coupling
compares favourably with CFX


Multigrid

Currently, multigrid coarsening does not cross processor
boundaries
This implies that on p processors, the coarsest matrix may not contain less
than p cells
With a high processor count, less grid levels will be used, and solving for the
coarsest matrix may be significantly more expensive than with a low processor
count
This reduces scalability, and may be checked (if suspected) using the solver summary info at
the end of the log file

Planned solution: move grids to nearest rank multiple of 4 or
8 when mean local grid size is too small
The communication pattern is not expected to change too much, as partitioning
is of a recursive nature, and should already exhibit a “multigrid” nature
This may be less optimal than repartitioning at each level, but setup time
should also remain much cheaper
Important, as grids may be rebuilt each time step

Partitioning
We currently use METIS or SCOTCH, but should move to
ParMETIS or Pt-SCOTCH within a few months
The current infrastructure makes this quite easy
We have recently added a « backup » partitioning based on
space-filling curves
We currently use the Z curve (from our octree construction for parallel joining), but the
appropriate changes in the coordinate comparison rules should allow switching to a
Hilbert curve (reputed to lead to better partitioning)
This is fully parallel and deterministic
Performance on initial tests is
about 20% worse on a single
10-million cell case on 256 processes
reasonable compared to
unoptimized partitioning


Tool chain evolution
Code_Saturne V1.3 (current production version) added many
HPC-oriented improvements compared to prior versions:
Post-processor output handled by FVM / Kernel
Ghost cell construction handled by FVM / Kernel
Up to 40% gain in preprocessor memory peak compared to V1.2
Parallelized and scales (manages 2 ghost cell sets and multiple periodicities)
Well adapted up to 150 million cells (with 64 Gb for preprocessing)
All fundamental limitations are pre-processing related

Pre-Processor: Kernel + FVM: Post-
Meshes processing
serial run distributed run
output

Version 2.0 separates partitioning from preprocessing
Also reduces their memory footprint a bit, moving newly parallelized operations
to the kernel

Pre-Processor: Partitioner: Kernel + FVM: Post-
Meshes processing
serial run serial run distributed run output


Future direction: Hybrid MPI / OpenMP (1/2)

Currently, a pure MPI model is used:
Everything is parallel, synchronization is explicit when required
On multiprocessor / multicore nodes, shared memory
parallelism could also be used (using OpenMP directives)
Parallel sections must be marked, and parallel loops must avoid
modifying the same values
Specific numberings must be used, similar to those used for
vectorization, but with different constraints:
Avoid false sharing, keep locality to limit cache misses


Future direction: Hybrid MPI / OpenMP (2/2)

Hybrid MPI / OpenMP is being tested
IBM is testing this on Blue Gene/P
Requires work on renumbering algorithms
OpenMP parallelism would ease of packaging / installation on workstations
No dependencies on source but not binary-compatible MPI library choices,
only on the compiler runtime
Good enough for current multicore workstations
Coupling the code with itself or with with SYRTHES 4 will still require MPI
The main goal is to allow MPI communicators of “only”
10000’s of ranks on machines with 100000 cores
Performance benefits expected mainly at the very high end
Reduce risk of medium-term issues with MPI_Alltoallv used in I/O and parallelism-related
data redistribution
Though sparse collective algorithms is the long term solution for this specific issue


Code_Saturne HPC roadmap
2003 2006 2007 2010 2015
Consecutive to the Civaux The whole vessel
9 fuel assemblies
thermal fatigue event reactor

No experimental approach up
Computations enable to better to now
understand the wall thermal
loading in an injection. Will enable the study of side
effects implied by the flow
Knowing the root causes of the Computation with an around neighbour fuel
event ⇒ define a new design to L.E.S. approach for assemblies.
avoid this problem. turbulent modelling
Part of a fuel assembly Better understanding of
Refined mesh near the vibration phenomena and
3 grid assemblies
wall. wear-out of the rods.

106 cells 107 cells 108 cells 109 cells 1010 cells
3.1013 operations 6.1014 operations 1016 operations 3.1017 operations 5.1018 operations

Fujistu VPP 5000 Cluster, IBM Power5 IBM Blue Gene/L « Frontier » 30 times the power of 500 times the power of
1 of 4 vector processors 400 processors 8000 processors IBM Blue Gene/L « Frontier » IBM Blue Gene/L « Frontier »
2 month length computation 9 days # 1 month # 1 month # 1 month

# 1 Gb of storage # 15 Gb of storage # 200 Gb of storage # 1 Tb of storage # 10 Tb of storage
2 Gb of memory 25 Gb of memory 250 Gb of memory 2,5 Tb of memory 25 Tb of memory

Power of the computer Pre-processing not parallelized Pre-processing not parallelized … ibid. … … ibid. …
Mesh generation … ibid. … … ibid. …
Scalability / Solver … ibid. …
Visualisation


Thank you for your attention!


Additional Notes


Load imbalance (1/3)

In this example, using 8 partitions (with METIS), we
have the following local minima and maxima:
Cells:
416 / 440 (6% imbalance)
Cells + ghost cells:
469/519 (11% imbalance)
Interior faces:
852/946 (11% imbalance)
Most loops are on cells,
but some are on cells + ghosts,
and MatVec is in cells + faces



If load imbalance increases with processor count,
scalability decreases

If load imbalance reaches a high value (say 30% to
50%) but does not increase, scalability is maintained,
though some processor power is wasted
Perfect balancing is impossible to reach, as different loops show
different imbalance levels, an synchronizations may be required
between these loops
GCP uses MatVec and dot products
Load imbalance might be reduced using weights for domain
partitioning, with Cell weight = 1 + f(n_faces)



Another possible source of load imbalance is different
cache miss rates on different ranks
Difficult to estimate a priori
With otherwise balanced loops, if a processor has a cache miss every
300 instructions, and another a cache miss every 400 instructions,
considering that the cost of a cache miss is at least 100 instructions,
the corresponding imbalance reaches 20%


Presentation of the open source CFD code Code_Saturne

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à Presentation of the open source CFD code Code_Saturne

Similaire à Presentation of the open source CFD code Code_Saturne (20)

Dernier

Dernier (20)

Presentation of the open source CFD code Code_Saturne