This talk is a general discussion of the current state of Open MPI, and a deep dive on two new features:
1. The flexible process affinity system (I presented many of these slides at the Madrid EuroMPI'13 conference in September 2013).
2. The MPI-3 "MPI_T" tools interface.
I originally gave this talk at Lawrence Berkeley Labs on Thursday, November 7, 2013.
6. Fun stats
• ohloh.net says:
§ 819,741 lines of code
§ Average 10-20
committers at a time
§ “Well-commented
source code”
• I rank in top-25 ohloh
stats for:
§
§
§
§
C
Automake
Shell script
Fortran (…ouch)
7. Current status
• Version 1.6.5 / stable series
§ Unlikely to see another release
• Version 1.7.3 / feature series
§ v1.7.4 due (hopefully) by end of 2013
§ Plan to transition to v1.8 in Q1 2014
8. MPI conformance
• MPI-2.2 conformant as of v1.7.3
§ Finally finished several 2.2 issues that no one
really cares about
• MPI-3 conformance just missing new RMA
§ Tracked on wiki:
https://svn.open-mpi.org/trac/ompi/wiki/MPIConformance
§ Hope to be done by v1.7.4
9. New MPI-3 features
• Mo’ betta Fortran bindings
§ You should “use mpi_f08”. Really.
• Matched probe
• Sparse and neighborhood collectives
• “MPI_T” tools interface
• Nonblocking communicator duplication
• Noncollective communicator creation
• Hindexed block datatype
10. New Open MPI features
• Better support for more runtime systems
§ PMI2 scalability, etc.
• New generalized processor affinity system
• Better CUDA support
• Java MPI bindings (!)
• Transports:
§ Cisco usNIC support
§ Mellanox MXM2 and hcoll support
§ Portals 4 support
11. My new favorite random feature
• mpirun CLI option <tab> completion
§ Bash and zsh
§ Contributed by Nathan Hjelm, LANL
shell$ mpirun --mca btl_usnic_<tab>
btl_usnic_cq_num
btl_usnic_eager_limit
btl_usnic_if_exclude
btl_usnic_if_include
btl_usnic_max_btls
btl_usnic_mpool
btl_usnic_prio_rd_num
btl_usnic_prio_sd_num
btl_usnic_priority_limit
btl_usnic_rd_num
btl_usnic_retrans_timeout
btl_usnic_rndv_eager_limit
btl_usnic_sd_num
--------------
Number of completion queue!
Eager send limit (0 = use !
Comma-delimited list of de!
Comma-delimited list of de!
Maximum number of usNICs t!
Name of the memory pool to!
Number of pre-posted prior!
Maximum priority send desc!
Max size of "priority" mes!
Number of pre-posted recei!
Number of microseconds bef!
Eager rendezvous limit (0 !
Maximum send descriptors t!
12. Two features to discuss
in detail…
1. “MPI_T” interface
2. Flexible process affinity system
14. MPI_T interface
• Added in MPI-3.0
• So-called “MPI_T” because all the
functions start with that prefix
§ T = tools
• APIs to get/set MPI implementation values
§ Control variables (e.g., implementation
tunables)
§ Performance variables (e.g., run-time stats)
15. MPI_T control variables (“cvar”)
• Another interface to MCA param values
• In addition to existing methods:
§ mpirun CLI options
§ Environment variables
§ Config file(s)
• Allows tools / applications to
programmatically list all OMPI MCA params
16. MPI_T cvar example
• MPI_T_cvar_get_num()
§ Returns the number of control variables
• MPI_T_cvar_get_info(index, …) returns:
§ String name and description
§ Verbosity level (see next slide)
§ Type of the variable (integer, double, etc.)
§ Type of MPI object (communicator, etc.)
§ “Writability” scope
17. Verbosity levels
Level name
Level description
USER_BASIC
Basic information of interest to users
USER_DETAIL
Detailed information of interest to users
USER_ALL
All remaining information of interest to users
TUNER_BASIC
Basic information of interest for tuning
TUNER_DETAIL
Detailed information of interest for tuning
TUNER_ALL
All remaining information of interest to tuning
MPIDEV_BASIC
Basic information for MPI implementers
MPIDEV_DETAIL
Detailed information for MPI implementers
MPIDEV_ALL
All remaining information for MPI implementers
18. Open MPI interpretation of
verbosity levels
1. User
§ Parameters required
for correctness
§ As few as possible
2. Tuner
§ Tweak MPI
performance
§ Resource levels, etc.
3. MPI developer
§ For Open MPI devs
1. Basic
Even for less-advanced
users and tuners
2. Detailed
Useful but you won’t
need to change them
often
3. All
Anything else
19. “Writeability” scope
Level name
Level description
CONSTANT
Read-only, constant value
READONLY
Read-only, but the value may change
LOCAL
Writing is local operation
GROUP
Writing must be done as a group, and all values
must be consistent
GROUP_EQ
Writing must be done as a group, and all values
must be exactly the same
ALL
Writing must be done by all processes, and all
values must be consistent
ALL_EQ
Writing must be done by all processes, and all
values must be exactly the same
20. Reading / writing a cvar
• MPI_T_cvar_handle_alloc(index, handle, …)
§ Allocates an MPI_T handle
§ Binds it to a specific MPI handle (e.g., a
communicator), or BIND_NO_OBJECT
• MPI_T_cvar_read(handle, buf)
• MPI_T_cvar_write(handle, buf)
à OMPI has very, very few writable control
variables after MPI_INIT
21. MPI_T Performance variables (“pvar”)
• New information available from OMPI
§ Run-time statistics of implementation details
§ Similar interface to control variables
• Not many available in OMPI yet
• Cisco usnic BTL exports 24 pvars
§ Per usNIC interface
§ Stats about underlying network
(more details to be provided in usNIC talk)
23. Locality matters
• Goals:
§ Minimize data transfer distance
§ Reduce network congestion and contention
• …this also matters inside the server, too!
24. Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
PCI 8086:1521
L3 (20MB)
eth0
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
Core P#0
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
Core P#7
PU P#0
PU P#1
PU P#2
PU P#3
PU P#4
PU P#5
PU P#6
PU P#7
PU P#16
PU P#17
PU P#18
PU P#19
PU P#20
PU P#21
PU P#22
PU P#23
PCI 8086:1521
eth1
PCI 8086:1521
eth2
PCI 8086:1521
eth3
PCI 1137:0043
Intel Xeon E5-2690 (“Sandy Bridge”)
2 sockets, 8 cores, 64GB per socket
eth4
PCI 1137:0043
eth5
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
PCI 1000:005b
L3 (20MB)
sda
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
Core P#0
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
Core P#7
PU P#8
PU P#9
PU P#10
PU P#11
PU P#12
PU P#13
PU P#14
PU P#15
PU P#24
PU P#25
PU P#26
PU P#27
PU P#28
PU P#29
PU P#30
PU P#31
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
25. Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
PCI 8086:1521
L3 (20MB)
eth0
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1i (32KB)
L1i (32KB)
Core P#0
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
Core P#7
PU P#0
PU P#1
PU P#2
PU P#3
PU P#4
PU P#5
PU P#6
PU P#7
PU P#16
PU P#17
PU P#18
PU P#19
PU P#20
PU P#21
PU P#22
PU P#23
L1 and L2
PCI 8086:1521
eth1
PCI 8086:1521
eth2
1G
NICs
PCI 8086:1521
eth3
PCI 1137:0043
Intel Xeon E5-2690 (“Sandy Bridge”)
2 sockets, 8 cores, 64GB per socket
eth4
PCI 1137:0043
eth5
10G
NICs
PCI 102b:0522
NUMANode P#1 (64GB)
Socket P#1
L3 (20MB)
PCI 1000:005b
Shared L3
sda
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
Core P#0
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
Core P#7
PU P#8
PU P#9
PU P#24
PU P#25
Indexes: physical
Date: Mon Jan 28 10:51:26 2013
Hyperthreading enabled
PU P#10
PU P#11
PU P#12
PU P#13
PU P#14
PU P#15
PU P#26
PU P#27
PU P#28
PU P#29
PU P#30
PU P#31
sdb
PCI 1137:0043
eth6
PCI 1137:0043
eth7
10G
NICs
26. A user’s playground
The intent of this work is to provide a mechanism that
allows users to explore the process-placement space
within the scope of their own applications.
28. LAMA
• Supports a wide range of regular mapping
patterns
§ Drawn from much prior work
§ Most notably, heavily inspired by BlueGene/P
and /Q mapping systems
29. Launching MPI applications
• Three steps in MPI process placement
1. Mapping
2. Ordering
3. Binding
• Let's discuss how these work in Open MPI
31. Mapping
• MPI's runtime must create a map, pairing
processes-to-processors (and memory).
• Basic technique:
§ Gather hwloc topologies from allocated nodes.
§ Mapping agent then makes a plan for which
resources are assigned to processes
32. Mapping agent
• Act of planning mappings:
§ Specify which process will be launched on
each server
§ Identify if any hardware resource will be
oversubscribed
• Processes are mapped to the resolution of
a single processing unit (PU)
§ Smallest unit of allocation: hardware thread
§ In HPC, usually the same as a processor core
33. Oversubscription
• Common / usual definition:
§ When a single PU is assigned more than one
process
• Complicating the definition:
§ Some application may need more than one PU
per process (multithreaded applications)
• How can the user express what their
application means by “oversubscription”?
36. Ordering
• Each process must be assigned a unique
rank in MPI_COMM_WORLD
• Two common types of ordering:
§ natural
• The order in which processes are mapped
determines their rank in MCW
§ sequential
• The processes are sequentially numbered starting
at the first processing unit, and continuing until the
last processing unit
38. Binding
• Process-launching agent working with the
OS to limit where each process can run:
1. No restrictions
2. Limited set of restrictions
3. Specific resource restrictions
• “Binding width”
§ The number of PUs to which a process is
bound
39. Command Line Interface (CLI)
• 4 levels of abstraction for the user
§ Level 1: None
§ Level 2: Simple, common patterns
§ Level 3: LAMA process layout regular patterns
§ Level 4: Irregular patterns (not described in
this talk)
40. CLI: Level 1 (none)
• No mapping or binding options specified
§ May or may not specify the number of
processes to launch (-np)
§ If not specified, default to the number of cores
available in the allocation
§ One process is mapped to each core in the
system in a "by-core" style
§ Processes are not bound
• …for backwards compatibility reasons L
41. CLI: Level 2 (common)
• Simple, common patterns for mapping and
binding
§ Specify mapping pattern with
• --map-by X (e.g., --map-by socket)
§ Specify binding option with:
• --bind-to Y (e.g., --bind-to core)
§ All of these options are translated to Level 3
options for processing by LAMA
(full list of X / Y values shown later)
42. CLI: Level 3 (regular patterns)
• LAMA process layout regular patterns
§ Power users wanting something unique for
their application
§ Four MCA run-time parameters
• rmaps_lama_map: Mapping process layout
• rmaps_lama_bind: Binding width
• rmaps_lama_order: Ordering of MCW ranks
• rmaps_lama_mppr: Maximum allowable number of
processes per resource (oversubscription)
43. rmaps_lama_map (map)
• Takes as an argument the "process layout"
§ A series of nine tokens
• allowing 9! (362,880) mapping permutation options.
§ Preferred iteration order for LAMA
• innermost iteration specified first
• outermost iteration specified last
54. rmaps_lama_order (order)
• Select which ranks are assigned to
processes in MCW
Natural order for
map-by-node (default)
Sequential order for
any mapping
• There are other possible orderings, but no
one has asked for them yet…
55. rmaps_lama_mppr (mppr)
• mppr (mip-per) sets the Maximum number
of allowable Processes Per Resource
§ User-specified definition of oversubscription
• Comma-delimited list of <#:resource>!
§ 1:c
à At most one process per core
§ 1:c,2:s à At most one process per core, and
at most two processes per socket
56. MPPR
§ 1:c à At most one process per core
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
Core P#0
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
Core P#7
PU P#0
PU P#1
PU P#2
PU P#3
PU P#4
PU P#5
PU P#6
PU P#7
PU P#16
PU P#17
PU P#18
PU P#19
PU P#20
PU P#21
PU P#22
PU P#23
57. MPPR
§ 1:c,2:s à At most one process per core and
two processes per socket
Machine (128GB)
NUMANode P#0 (64GB)
Socket P#0
L3 (20MB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L2 (256KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1d (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
L1i (32KB)
Core P#0
Core P#1
Core P#2
Core P#3
Core P#4
Core P#5
Core P#6
Core P#7
PU P#0
PU P#1
PU P#2
PU P#3
PU P#4
PU P#5
PU P#6
PU P#7
PU P#16
PU P#17
PU P#18
PU P#19
PU P#20
PU P#21
PU P#22
PU P#23
61. Report bindings
• Displays prettyprint representation of the
binding actually used for each process.
§ Visual feedback = quite helpful when exploring
mpirun -np 4 --mca rmaps lama --mca rmaps_lama_bind
1c --mca rmaps_lama_map nbsch --mca rmaps_lama_mppr
1:c --report-bindings hello_world!
MCW
MCW
MCW
MCW
rank
rank
rank
rank
0
1
2
3
bound
bound
bound
bound
to
to
to
to
socket
socket
socket
socket
0[core
1[core
0[core
1[core
0[hwt
8[hwt
1[hwt
9[hwt
0-1]]:
0-1]]:
0-1]]:
0-1]]:
[BB/../../../../../../..][../../../../../../../..]!
[../../../../../../../..][BB/../../../../../../..]!
[../BB/../../../../../..][../../../../../../../..]!
[../../../../../../../..][../BB/../../../../../..]!
62. Feedback
• Available in Open MPI v1.7.2 (and later)
• Open questions to users:
§ Are more flexible ordering options useful?
§ What common mapping patterns are useful?
§ What additional features would you like to
see?