The Message Passing Interface (MPI) in Layman's Terms

Open MPI KYOSS presentation 12 Jan, 2011 Jeff Squyres

What is the Message Passing Interface (MPI)? The Book of MPI A standards document www.mpi-forum.org

Using MPI Hardware and software implement the interface in the MPI standard (book)

MPI implementations There are many implementations of the MPI standard Some are closed source Others are open source

Open MPI Open MPI is a free, open source implementation of the MPI standard www.open-mpi.org

So what is MPI for? Let’s break it down… Message Passing Interface

1. Message passing Process A Process B Message

1. Message passing Process A Process B Pass it

1. Message passing Process A Process B Message has been passed

1. Message passing Process Thread A Thread B …as opposed to data that is shared

2. Interface Fortran too! C programming function calls MPI_Wait(req, status) MPI_Init(argv, argc) MPI_Recv(buf, count, type, src, tag, comm, status) MPI_Send(buf, count, type, dest, tag, comm) MPI_Comm_dup(in, out) MPI_Test(req, flag, status) MPI_Finalize(void) MPI_Type_size(dtype, size)

Fortran? Really? What most modern developers associate with “Fortran”

Yes, really Some of today’s most advanced simulation codes are written in Fortran

Yes, really Yes, that Intel Optimized for Nehalem, Westmere, and beyond!

Fortran is great for what it is A simple language for mathematical expressions and computations Targeted at scientists and engineers …not computer scientists or web developers or database developers or …

Putting it back together Message Passing Interface “An interface for passing messages” “C functions for passing messages” Fortran too!

C/Fortran functions for message passing Process A Process B MPI_Send(…)

C/Fortran functions for message passing Process A Process B MPI_Recv(…)

Really? Is that all MPI is? “Can’t I just do that with sockets?” Yes! (…and no)

Comparison (TCP) Sockets Connections based on IP addresses and ports Point-to-point communication Stream-oriented Raw data (bytes / octets) Network-independent “Slow” MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast

Comparison MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast Whoa! What are these?

Peer integer “rank” MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11

“Collective”: broadcast MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11

“Collective”: scatter MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11

“Collective”: gather MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11

“Collective”: reduce MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 4 6 5 3 4 5 5 6 3 6 7 8 3 2 4 9 10 11 2 4 4

“Collective”: reduce MPI 42 Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11

“Collective”: …and others MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11

Messages, not bytes MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast Entire message is sent and received Not a stream of individual bytes

Messages, not bytes MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast Contents: 17 integers 23 doubles 98 structs …or whatever Not a bunch of bytes!

Network independent MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast MPI_Send(…) MPI_Recv(…) Underlying network Ethernet Myrinet InfiniBand Shared memory TCP iWARP RoCE

Network independent MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast MPI_Send(…) MPI_Recv(…) Underlying network Ethernet Myrinet Regardless of underlying network or transport protocol, the application code stays the same InfiniBand Shared memory TCP iWARP RoCE

Blazing fast MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast One microsecond (!) …more on performance later

What is MPI? MPI is probably somewhere around here

What is MPI? MPI is hides all the layers underneath

What is MPI? A high-level network programming abstraction IP addresses byte streams raw bytes

What is MPI? A high-level network programming abstraction Nothing to see here Please move along IP addresses byte streams raw bytes

So what? What’s all this message passing stuff got to do with supercomputers?

So what? Let’s define “supercomputers”

Supercomputers “Nebulae” National Supercomputing Centre, Shenzen, China

Supercomputers “Mare Nostrum” (Our Sea) Barcelona Supercomputer Center, Spain Used to be a church

Supercomputers Notice anything?

Supercomputers They’re just racks of servers!

Generally speaking… Supercomputer = Lots of processors Lots of RAM Lots of disk + +

Generally speaking… Supercomputer = (Many) Racks of (commodity) high-end servers (this is one definition; there are others)

So if that’s a supercomputer… Rack of 36 1U servers

How is it different from my web farm? Rack of 36 1U servers

Just a bunch of servers? The difference between supercomputers and web farms and database farms (and …) All the servers act together to solve a single computational problem

Acting together Take your computational problem… Input Output Computational problem

Acting together …and split it up! Input Output Computational problem

Acting together Distribute the input data across a bunch of servers Input Output Computational problem

Acting together Use the network between servers to communicate / coordinate Input Output

Acting together MPI is used for this communication Input Output

Why go to so much trouble? One processor hour Computational problem 1 processor = …a long time…

Why go to so much trouble? One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour Computational problem One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour 21 processors = ~1 hour (!) Disclaimer: scaling is rarely perfect

High Performance Computing HPC = Using supercomputers to solve real world problems that are TOO BIG for laptops, desktops, or individuals servers

Why does HPC MPI? Network abstraction Are these cores?

Why does HPC MPI? Network abstraction …or servers?

Why does HPC MPI? Message semantics Array of 10,000 integers

Why does HPC MPI? Ultra-low network latency (depending on your network type!) 1 micro second

1 microsecond = 0.000001 second From here To here

Let’s get into some details…

MPI Basics “6 function MPI” MPI_Init(): startup MPI_Comm_size(): how many peers? MPI_Comm_rank(): my unique (ordered) ID MPI_Send(): send a message MPI_Recv(): receive a message MPI_Finalize(): shutdown Can implement a huge number of parallel applications with just these 6 functions

Let’s see “Hello, World” in MPI

MPI Hello, World #include <stdio.h> #include <mpi.h> intmain(intargc, char **argv) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, world! I am %d of %d", rank, size); MPI_Finalize(); return 0; } Initialize MPI Who am I? Num. peers? Shut down MPI

Compile it with Open MPI shell$ mpicchello.c -o hello shell$ Open MPI comes standard in many Linux and BSD distributions (and OS X) Hey – what’s that? Where’s gcc?

“Wrapper” compiler mpicc simply fills in a bunch of compiler command line options for you shell$ mpicchello.c -o hello –showme gcchello.c -o hello -I/opt/openmpi/include -pthread -L/open/openmpi/lib -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl shell$

Now let’s run it shell$ mpirun –np 4 hello Hey – what’s that? Why don’t I just run “./hello”?

mpirun launcher mpirun launches N copies of your program and “wires them up” shell$ mpirun –np 4 hello “-np” = “number of processes” This command launches a 4 process parallel job

mpirun launcher shell$ mpirun –np 4 hello hello hello Four copies of “hello” are launched Then they are “wired up” on the network hello hello

Now let’s run it shell$ mpirun –np 4 hello Hello, world! I am 0 of 4 Hello, world! I am 1 of 4 Hello, world! I am 2 of 4 Hello, world! I am 3 of 4 shell$ By default, all copies run on the local host

Run on multiple servers! shell$ cat my_hostfile host1.example.com host2.example.com host3.example.com host4.example.com shell$

Run on multiple servers! shell$ cat my_hostfile host1.example.com host2.example.com host3.example.com host4.example.com shell$ mpirun–hostfilemy_hostfile–np 4 hello Hello, world! I am 0 of 4 Hello, world! I am 1 of 4 Hello, world! I am 2 of 4 Hello, world! I am 3 of 4 shell$  Ran on host1  Ran on host2  Ran on host3  Ran on host4

Run it again shell$ mpirun –hostfilemy_hostfile –np 4 hello Hello, world! I am 2 of 4 Hello, world! I am 3 of 4 Hello, world! I am 0 of 4 Hello, world! I am 1 of 4 shell$ 2 3 0 1 Hey – why are the numbers out of order?

Standard output re-routing shell$ mpirun–hostfilemy_hostfile –np 4 hello Hello, world! I am 0 of 4 Hello, world! I am 1 of 4 hello 0 hello 1 mpirun Each “hello” program’s standard output is intercepted and sent across the network to mpirun hello 3 hello 2 Hello, world! I am 2 of 4 Hello, world! I am 3 of 4

Standard output re-routing shell$ mpirun–hostfilemy_hostfile –np 4 hello hello 0 hello 1 mpirun But the exact ordering of received printf’s is non-deterministic hello 3 hello 2 Hello, world! I am 2 of 4 Hello, world! I am 3 of 4 Hello, world! I am 0 of 4 Hello, world! I am 1 of 4

Printf debugging = Bad If you can’t rely on output ordering, printf debugging is pretty lousy (!)

Parallel debuggers Fortunately, there are parallel debuggers and other tools Parallel debugger Attaches to all processes in the MPI job hello 0 hello 1 mpirun hello 3 hello 2

Now let’s send a simple MPI message

Send a simple message int rank; double buffer[SIZE]; MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (0 == rank) { /* …initialize buffer[]… */ MPI_Send(buffer, SIZE, MPI_DOUBLE, 1, 123, MPI_COMM_WORLD); } else if (1 == rank) { MPI_Recv(buffer, SIZE, MPI_DOUBLE, 0, 123, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } If I’m number 0, send the buffer[] array to number 1 If I’m number 1, receive the buffer[] array from number 0

That’s enough MPI for now…

Open MPI PACX-MPI LAM/MPI Project founded in 2003 after intense discussions between multiple open source MPI implementations LA-MPI FT-MPI Sun CT 6

Open_MPI_Init() shell$ svn log –r 1 https://svn.open-mpi.org/svn/ompi ------------------------------------------------------------------------ r1 | jsquyres | 2003-11-22 11:36:58 -0500 (Sat, 22 Nov 2003) | 2 lines Firstcommit ------------------------------------------------------------------------ shell$

Open_MPI_Current_status() shell$ svn log –r HEAD https://svn.open-mpi.org/svn/ompi ------------------------------------------------------------------------ r24226 | rhc | 2011-01-11 20:57:47 -0500 (Tue, 11 Jan 2011) | 25 lines Fixes #2683: Move ORTE DPM compiler warning squash to v1.4 ------------------------------------------------------------------------ shell$

Open MPI 2011 Membership 15 members, 11 contributors, 2 partners

Fun stats ohloh.net says: 517,400 lines of code 30 developers (over time) “Well-commented source code” I rank in top-25 ohloh stats for: C Automake Shell script Fortran (ouch!)

Open MPI has grown It’s amazing (to me) that the Open MPI project works so well New features, new releases, new members Long live Open MPI!

Recap Defined Message Passing Interface (MPI) Defined “supercomputers” Defined High Performance Computing (HPC) Showed what MPI is Showed some trivial MPI codes Discussed Open MPI

Additional Resources MPI Forum web site The only site for the official MPI standards http://www.mpi-forum.org/ NCSA MPI basic and intermediate tutorials Requires a free account http://ci-tutor.ncsa.uiuc.edu/login.php “MPI Mechanic” magazine columns http://cw.squyres.com/

Additional Resources Research, Computing, and Engineering (RCE) podcast http://www.rce-cast.com/ My blog: MPI_BCAST http://blogs.cisco.com/category/performance/

The Message Passing Interface (MPI) in Layman's Terms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to The Message Passing Interface (MPI) in Layman's Terms

Similar to The Message Passing Interface (MPI) in Layman's Terms (20)

More from Jeff Squyres

More from Jeff Squyres (20)

Recently uploaded

Recently uploaded (20)

The Message Passing Interface (MPI) in Layman's Terms