PREDICTING THE TIME OF OBLIVIOUS PROGRAMS
The BSP model can be extended with a zero cost synchronization mechanism, which can be used when the number of messages due to receives is known. This mechanism, usually known as "oblivious synchronization" implies that different processors can be in different supersteps at the same time. An unwanted consequence of these software improvements is a loss of accuracy in prediction. This paper proposes an extension of the BSP complexity model to deal with oblivious barriers and shows its accuracy.
7. BSP Model vs OBSP Model 2,i =3*w + 2*(g*h+L b ) h h PS T BSP =4*w + 2*(g*h+L) h h PS BSP OBSP P1 P0 time w w 2w 2w L b L b g*h g*h P1 P0 time w w 2w 2w L L g*h g*h
8. FFT Analysis using the OBSP Model 1,i (T k (1) ,X k (1) , i (1) ) P1 P0 P2 P3 seq_fft Division bsp_partition Combination 2,i (T k (1) ,X k (1) , i (1) ) 1,i (T k (2) ,X k (2) , i (2) ) 1,i (T (0) ,X (0) ,0) 2,i (T (0) ,X (0) ,0) bsp_done X (0) ={0,1,2,3} X 0 (1) ={0,1} X 1 (1) ={2,3} X k (2) ={k} k=0,..,3 w 1,i g*h 1,i +L b w 2,i w 2,i (1) w 1,i (1) g*h 1,i (1) +L b i (1) w 1,i (2) i (2)
9. OBSP Prediction Accuracy Real and OBSP predicted time for the FFT algorithm on the CRAY T3E Real and OBSP predicted time for the RAP algorithm on the CRAY T3E N=1000, M=1000 N=2048 OBSP parameter values on the CRAY T3E. g is in bytes per second p=16
Good afternoon ladies and gentlemen. In this paper, we propose a Parallel Computing Model that extends the well-known Bulk Synchronous Parallel model to work with algorithms that don´t require global barrier synchronisation, and deals with new programming features as processor-partition operations and oblivious synchronisation. This last feature gives name to the model: the Oblivious BSP.
Presentation starts with a brief introduction to the BSP model concepts, and then I will present the Oblivious BSP model. A methodology for predicting the execution time is shown using a trivial example. After that, I will show the preliminaries results obtained using the OBSP model to predict the execution time of two algorithms: FFT, which is an example of Data Parallelism, and RAP, which is solved by a intensive communication pipeline algorithm. To conclude the presentation I will mention current and future works into this line.
The Bulk Synchronous Parallel model was proposed by Prof. Valiant in 1990. It considers a parallel machine made of a set of p processor with private memory, interconnected throe a global communication network and a mechanism for synchronising the processors. The BSP model can be characterised by the following parameters: the communication gap g , defined as the unary packet transmission time, which reflects the per-processor bandwidth; the latency L , which corresponds to the time needed to synchronise all processors. These values depend on the number of processors p . A BSP computation is organised into supersteps, each of them consists of: Local computation, inter-process communication, and a global synchronisation. The execution time for a superstep s is given by: the largest amount of work performed by any processor during the superstep, w s plus the largest number of packets sent or received by any processor during the superstep, h s plus the time required by the global synchronisation.
The OBSP model extends the BSP model to deal with oblivious synchronisation and processor-partition operations. When the number of messages due to receive by a processor in a superstep is known, a zero-cost synchronisation mechanism can be used to reduce the synchronisation overhead. An Oblivious Synchronisation blocks a processor until the expected number of messages are received. A partition operation splits the current set of processors into several subsets. Each of them acts as an autonomous BSP machine with its own processor numbering and synchronisation points. The OBSP machine communication capabilities are characterised by the following parameters: the gap g, the Synchronising Latency, L the Oblivious Latency, L b and the special values for small packet sizes g 0 and L b0
The Paderborn University BSP library (PUB) is a parallel C library based on the BSP model. In addition to the most common BSP features, PUB provides routines to perform: oblivious synchronisation, partition operations, and collective communications.
In an OBSP prediction analysis, we assume that: 1) supersteps are numbered starting at 1, 2) all processors perform the same number of supersteps R, and 3) because processors can be in different supersteps at the same time, a processor in its superstep s can send a message to other processor in a previous superstep. The system ensures that communication is not made effective until the receiver processor finishes its superstep s. Instead of using a global barrier, the OBSP model defines the incoming partners of each processor OMEGA as the set of processors that sends a message to this processor union itself. EICh sub s,i denotes the maximum number of communicated packet by a processor. PHI sub s,i denotes the time spent by processor i in superstep s, and is given by these recursive formulas. When a partition operation is performed, this schema is recursively applied into each submachine.
In this slice I compare both execution models using a trivial example. In the first superstep one processor performs local computation and sends a message to the other processor, which has to do double amount of work. Then, they synchronise and the second superstep is a symmetrical one. Using the BSP model, the maximum amount of local computation in each superstep is 2w so the total computing time is given by: Using the OBSP model, the first processor can get the second superstep while the second processor remains in the first superstep. The system buffers the message until the receiver processor is ready to receive it. This overlapping allows reduce the total execution time.
This figure represents the FFT execution under the OBSP model. Coloured blocks corresponds to local computation, and black blocks denotes inter-processors communication. Blue lines on the right denotes the supersteps performed by a machine X (j) , while the black lines marks the computing and communication parts in every superstep. In the original set of processors, each of them performs some local computing that include a partition into two subsets to solve the odd and even components transformation. This partition process continues until only one processor remains in each submachine. Each of these inner submachines performs only a superstep to compute a sequential transformation, and then rejoin to the outer machine. Local computation in the first superstep includes the work performed by the inner submachine. The superstep finishes with a data exchange, and the second superstep consists of the odd and even transformed signal combination.
Preliminary results have been obtained on a CRAY T3E. The first table shows the model parameters values for this machine. We note that the values for small packet sizes are not available. In the second table, we can see the measured time and the OBSP predicted time for the FFT algorithm with an input vector of size 2 million of elements. The prediction accuracy is quite good. Percentage errors are less than 3% for the overall algorithm. After this paper acceptance, some experiments have been carried out with a fine-grain intensive-communication pipeline algorithm that solves the RAP. Percentage errors are larger than the previous example, but we point out that this algorithm uses small message sizes and the used model parameters are g y L b.
Preliminary results have been obtained on a CRAY T3E. The first table shows the model parameters values for this machine. We note that the values for small packet sizes are not available. In the second table, we can see the measured time and the OBSP predicted time for the FFT algorithm with an input vector of size 2 million of elements. The prediction accuracy is quite good. Percentage errors are less than 3% for the overall algorithm. After this paper acceptance, some experiments have been carried out with a fine-grain intensive-communication pipeline algorithm that solves the RAP. Percentage errors are larger than the previous example, but we point out that this algorithm uses small message sizes and the used model parameters are g y L b.
As conclusions: We have proposed a new parallel computing model that extends the BSP model to work with oblivious synchronisation and partition operations. Preliminary results shows that prediction accuracy is as good as the BSP model, but In future works we want to obtain the parameters values for small message sizes, and we want to extend the analysis to other algorithms and parallel platforms.
In the first superstep, processor 1 has to make double amount of work than processor 0. Processor 1 receives a message from processor 0, so its omega set include both processor. If h is the amount of communicated data, PHI’s for each processor is ... Processor 0 starts its second superstep while processor 1 remains still in the previous one. System buffers the message to ensure it will be delivered when receiver processor demands it. Processor 1 has less work to do in the second superstep, so it sends the message back and finishes.