Introduction to the basic concepts of what the Message Passing Interface (MPI) is, and a brief overview of the Open MPI open source software implementation of the MPI specification.
12. Fortran? Really? What most modern developers associate with “Fortran”
13. Yes, really Some of today’s most advanced simulation codes are written in Fortran
14. Yes, really Yes, that Intel Optimized for Nehalem, Westmere, and beyond!
15. Fortran is great for what it is A simple language for mathematical expressions and computations Targeted at scientists and engineers …not computer scientists or web developers or database developers or …
20. Really? Is that all MPI is? “Can’t I just do that with sockets?” Yes! (…and no)
21. Comparison (TCP) Sockets Connections based on IP addresses and ports Point-to-point communication Stream-oriented Raw data (bytes / octets) Network-independent “Slow” MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast
22. Comparison MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast Whoa! What are these?
23. Peer integer “rank” MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11
24. Peer integer “rank” MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11
25. “Collective”: broadcast MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11
26. “Collective”: broadcast MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11
27. “Collective”: scatter MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11
28. “Collective”: scatter MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11
29. “Collective”: gather MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11
30. “Collective”: gather MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11
31. “Collective”: reduce MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 4 6 5 3 4 5 5 6 3 6 7 8 3 2 4 9 10 11 2 4 4
32. “Collective”: reduce MPI 42 Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11
33. “Collective”: …and others MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast 0 1 2 3 4 5 6 7 8 9 10 11
34. Messages, not bytes MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast Entire message is sent and received Not a stream of individual bytes
35. Messages, not bytes MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast Contents: 17 integers 23 doubles 98 structs …or whatever Not a bunch of bytes!
36. Network independent MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast MPI_Send(…) MPI_Recv(…) Underlying network Ethernet Myrinet InfiniBand Shared memory TCP iWARP RoCE
37. Network independent MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast MPI_Send(…) MPI_Recv(…) Underlying network Ethernet Myrinet Regardless of underlying network or transport protocol, the application code stays the same InfiniBand Shared memory TCP iWARP RoCE
38. Blazing fast MPI Based on peer integer “rank” (e.g., 8) Point-to-point and collective and one-sided and … Message oriented Typed messages Network independent Blazing fast One microsecond (!) …more on performance later
39. What is MPI? MPI is probably somewhere around here
40. What is MPI? MPI is hides all the layers underneath
41. What is MPI? A high-level network programming abstraction IP addresses byte streams raw bytes
42. What is MPI? A high-level network programming abstraction Nothing to see here Please move along IP addresses byte streams raw bytes
43. So what? What’s all this message passing stuff got to do with supercomputers?
52. So if that’s a supercomputer… Rack of 36 1U servers
53. How is it different from my web farm? Rack of 36 1U servers
54. Just a bunch of servers? The difference between supercomputers and web farms and database farms (and …) All the servers act together to solve a single computational problem
55. Acting together Take your computational problem… Input Output Computational problem
61. Why go to so much trouble? One processor hour Computational problem 1 processor = …a long time…
62. Why go to so much trouble? One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour Computational problem One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour One processor hour 21 processors = ~1 hour (!) Disclaimer: scaling is rarely perfect
63. High Performance Computing HPC = Using supercomputers to solve real world problems that are TOO BIG for laptops, desktops, or individuals servers
64. Why does HPC MPI? Network abstraction Are these cores?
65. Why does HPC MPI? Network abstraction …or servers?
66. Why does HPC MPI? Message semantics Array of 10,000 integers
67. Why does HPC MPI? Message semantics Array of 10,000 integers
68. Why does HPC MPI? Ultra-low network latency (depending on your network type!) 1 micro second
73. MPI Basics “6 function MPI” MPI_Init(): startup MPI_Comm_size(): how many peers? MPI_Comm_rank(): my unique (ordered) ID MPI_Send(): send a message MPI_Recv(): receive a message MPI_Finalize(): shutdown Can implement a huge number of parallel applications with just these 6 functions
75. MPI Hello, World #include <stdio.h> #include <mpi.h> intmain(intargc, char **argv) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, world! I am %d of %d", rank, size); MPI_Finalize(); return 0; } Initialize MPI Who am I? Num. peers? Shut down MPI
76. Compile it with Open MPI shell$ mpicchello.c -o hello shell$ Open MPI comes standard in many Linux and BSD distributions (and OS X) Hey – what’s that? Where’s gcc?
77. “Wrapper” compiler mpicc simply fills in a bunch of compiler command line options for you shell$ mpicchello.c -o hello –showme gcchello.c -o hello -I/opt/openmpi/include -pthread -L/open/openmpi/lib -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl shell$
78. Now let’s run it shell$ mpirun –np 4 hello Hey – what’s that? Why don’t I just run “./hello”?
79. mpirun launcher mpirun launches N copies of your program and “wires them up” shell$ mpirun –np 4 hello “-np” = “number of processes” This command launches a 4 process parallel job
80. mpirun launcher shell$ mpirun –np 4 hello hello hello Four copies of “hello” are launched Then they are “wired up” on the network hello hello
81. Now let’s run it shell$ mpirun –np 4 hello Hello, world! I am 0 of 4 Hello, world! I am 1 of 4 Hello, world! I am 2 of 4 Hello, world! I am 3 of 4 shell$ By default, all copies run on the local host
82. Run on multiple servers! shell$ cat my_hostfile host1.example.com host2.example.com host3.example.com host4.example.com shell$
83. Run on multiple servers! shell$ cat my_hostfile host1.example.com host2.example.com host3.example.com host4.example.com shell$ mpirun–hostfilemy_hostfile–np 4 hello Hello, world! I am 0 of 4 Hello, world! I am 1 of 4 Hello, world! I am 2 of 4 Hello, world! I am 3 of 4 shell$ Ran on host1 Ran on host2 Ran on host3 Ran on host4
84. Run it again shell$ mpirun –hostfilemy_hostfile –np 4 hello Hello, world! I am 2 of 4 Hello, world! I am 3 of 4 Hello, world! I am 0 of 4 Hello, world! I am 1 of 4 shell$ 2 3 0 1 Hey – why are the numbers out of order?
85. Standard output re-routing shell$ mpirun–hostfilemy_hostfile –np 4 hello Hello, world! I am 0 of 4 Hello, world! I am 1 of 4 hello 0 hello 1 mpirun Each “hello” program’s standard output is intercepted and sent across the network to mpirun hello 3 hello 2 Hello, world! I am 2 of 4 Hello, world! I am 3 of 4
86. Standard output re-routing shell$ mpirun–hostfilemy_hostfile –np 4 hello hello 0 hello 1 mpirun But the exact ordering of received printf’s is non-deterministic hello 3 hello 2 Hello, world! I am 2 of 4 Hello, world! I am 3 of 4 Hello, world! I am 0 of 4 Hello, world! I am 1 of 4
87. Printf debugging = Bad If you can’t rely on output ordering, printf debugging is pretty lousy (!)
88. Parallel debuggers Fortunately, there are parallel debuggers and other tools Parallel debugger Attaches to all processes in the MPI job hello 0 hello 1 mpirun hello 3 hello 2
90. Send a simple message int rank; double buffer[SIZE]; MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (0 == rank) { /* …initialize buffer[]… */ MPI_Send(buffer, SIZE, MPI_DOUBLE, 1, 123, MPI_COMM_WORLD); } else if (1 == rank) { MPI_Recv(buffer, SIZE, MPI_DOUBLE, 0, 123, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } If I’m number 0, send the buffer[] array to number 1 If I’m number 1, receive the buffer[] array from number 0
92. Open MPI PACX-MPI LAM/MPI Project founded in 2003 after intense discussions between multiple open source MPI implementations LA-MPI FT-MPI Sun CT 6
96. Fun stats ohloh.net says: 517,400 lines of code 30 developers (over time) “Well-commented source code” I rank in top-25 ohloh stats for: C Automake Shell script Fortran (ouch!)
97. Open MPI has grown It’s amazing (to me) that the Open MPI project works so well New features, new releases, new members Long live Open MPI!
98. Recap Defined Message Passing Interface (MPI) Defined “supercomputers” Defined High Performance Computing (HPC) Showed what MPI is Showed some trivial MPI codes Discussed Open MPI
99. Additional Resources MPI Forum web site The only site for the official MPI standards http://www.mpi-forum.org/ NCSA MPI basic and intermediate tutorials Requires a free account http://ci-tutor.ncsa.uiuc.edu/login.php “MPI Mechanic” magazine columns http://cw.squyres.com/
100. Additional Resources Research, Computing, and Engineering (RCE) podcast http://www.rce-cast.com/ My blog: MPI_BCAST http://blogs.cisco.com/category/performance/