Contenu connexe
Similaire à 20120140505010
Similaire à 20120140505010 (20)
Plus de IAEME Publication
Plus de IAEME Publication (20)
20120140505010
- 1. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print),
ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME
82
PERFORMANCE EVALUATION OF PARALLEL
COMPUTING SYSTEMS
Dr. Narayan Joshi
Associate Professor, CSE Department, Institute of technology
NIRMA University, Ahmedabad, Gujarat, India
Parjanya Vyas
CSE Department, Institute of technology,
NIRMA University, Ahmedabad, Gujarat, India
ABSTRACT
Optimization of any computational problem’s algorithm using a sole high performance
processor can improve the execution of any algorithm only up to an extent. To further improve the
execution of the algorithm and thus to improve the system performance technique of parallel
processing is widely adopted nowadays. Implementing parallel processing assists in attaining better
system performance and computational efficiency while maintaining the clock frequency at normal
level. Pthreads and CUDA are well-known techniques for CPU and GPU respectively. However, both
of them possess interesting performance behavior. The paper evaluates their performance behavior in
varying conditions. The results and chart outline the CUDA as a better approach. Furthermore, we
present significant suggestions for attaining better performance with Pthreads and CUDA.
Keywords: Parallel Computing, Multi-Core CPU, GPGPU, Pthreads, CUDA.
1. INTRODUCTION
Before the era of multi-core processing units began, the only way to make CPU faster was to
increase the clock frequency. More and more number of transistors was attached on a chip, to
increase the performance. However, addition of more number of transistors on the processor chip
keep on demanding more electricity and thereby causing increase in heat emission; which imposed
constraint of limited clock frequency. In order to overcome this severe limitation and to achieve
higher performance with processors, advancements in micro electronics technology enabled an era of
INTERNATIONAL JOURNAL OF ADVANCED RESEARCH
IN ENGINEERING AND TECHNOLOGY (IJARET)
ISSN 0976 - 6480 (Print)
ISSN 0976 - 6499 (Online)
Volume 5, Issue 5, May (2014), pp. 82-90
© IAEME: www.iaeme.com/ijaret.asp
Journal Impact Factor (2014): 7.8273 (Calculated by GISI)
www.jifactor.com
IJARET
© I A E M E
- 2. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print),
ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME
83
parallel processing proposing an idea of using more than one CPUs for a common address space. The
processors which support parallel programming are called multi-core processors.
Computational problems which include vast amount of data and operations that are entirely
independent of the intermediate result of each other, then such problems can be solved much more
efficiently with parallel programming then sequential programming. Ideally a parallel program will
apply every operation simultaneously on all the data elements and will produce the final result in the
time, in which sequential program can produce result of the operation on only one data element.
Hence, when dealing with the problems where operations to be performed are highly independent or
partially dependent on the intermediate results, parallel computing will provide immense
improvement in the efficiency over sequential programs.
Section 2 describes the literature survey. Section 3presents the architecture and working of
GPGPU. The CPU parallel programming approach is mentioned in section 4. Section 5 consists of
comparative study of these two parallel programming approaches. In section 6 behavior of these two
approaches is discussed. The concluding remarks and further course of study are describes in
section 7.
2. RELATED WORK
Because of its significant benefits, the technique of parallel computing has been widely
adopted by research and development sectors to solve medium to large scale computational
problems. Many researchers have studied the area of parallel programming including analysis of its
various techniques in past. Jakimovska et al. have suggested optimal method for parallel
programming with Pthreads and OpenMPI [3]. Moreover ShuaiChe et al. have presented a
performance study of general purpose applications on graphics processor using CUDA [10]. They
have compared GPU performance to both single core and multi core CPU performance. A unique
approach towards gaining optimum performance is presented by Yi Yang et al; they have presented a
novel approach to utilize the CPU resource to facilitate the execution of GPGPU programs as fused
CPU–GPU architecture in their paper [11]. Abu Asaduzzaman et al. have presented a detailed power
consumption analysis of OpenMPI and POSIX threads in [1]; they have summarized variations in the
performance of MPI. Researchers have also worked on Pthreads optimization. One of such works is
presented by Stevens and Chouliaras related to a parametarizable multiprocessor chip with hardware
Pthreads support [4]. In another study published by Cerin et al. describes experimental analysis of
various thread scheduling libraries [2].
3. GPGPU PARALLEL PROGRAMMING
A typical CUDA program follows below stated general steps [8]:
1. CPU allocates storage on GPU
2. CPU copies input data from CPU → GPU
3. CPU launches kernel(s) on GPU to process the input data
4. GPU does the processing on the data by multiple threads as defined by the CPU
5. CPU copies resultsfrom GPU to CPU
The GPGPU based parallel programs follow the master-salve processing terminology. CPU
acts as master which controls the sequence of steps whereas GPU is merely a collection of high
number of slave processors resulting in efficient execution of multiple threads in parallel [9][13].
- 3. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976
ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp.
Figure 1
As shown in the figure 1, a GPGP
units(SMs);the SMs comprising ofmany co
an instruction at a time. All of these thread
each SM unit also has its private memory which acts as a global memory for the thread processors
belonging to an individual SM unit
which is global to all SMs. Memory model of a GPGPU can
Figure
4. CPU PARALLEL PROGRAMMING
Pthreads, OpenMP, TBB (Thread building blocks), Click++, MPI are some of the well known
CPU parallel programming techniques available
comparative study in [5].As discussed in the introductory section above, the paper focuses on
Pthread and CUDA, this section describes the Pthread library and its working
International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976
Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME
84
Figure 1: GPGPU memory model
As shown in the figure 1, a GPGPU consists of various Streaming Multiproces
the SMs comprising ofmany co-operating thread processors, each of which can execute
an instruction at a time. All of these thread processors do have their private local memory
also has its private memory which acts as a global memory for the thread processors
belonging to an individual SM unit [14]. Furthermore, the GPU also contain its device memory,
. Memory model of a GPGPU can isshown in figure 2 [16][17][18]
Figure 2: GPGPU architecture
4. CPU PARALLEL PROGRAMMING
Pthreads, OpenMP, TBB (Thread building blocks), Click++, MPI are some of the well known
CPU parallel programming techniques available [7].EnsarAjkunic et al have discussed their
.As discussed in the introductory section above, the paper focuses on
Pthread and CUDA, this section describes the Pthread library and its working [12].
International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print),
Streaming Multiprocessor
thread processors, each of which can execute
local memory; as well
also has its private memory which acts as a global memory for the thread processors
GPU also contain its device memory,
6][17][18].
Pthreads, OpenMP, TBB (Thread building blocks), Click++, MPI are some of the well known
et al have discussed their
.As discussed in the introductory section above, the paper focuses on
- 4. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976
ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp.
The Pthread library considered in this paper for p
scheduler Pthreads implementation, as shown in figure 3. Pthread library facilitates a
pthread_create() function for user-level thread creation. The Pthread library scheduler is responsible
for thread scheduling inside a process. As and when a particular thread is scheduled by the Pthread
library scheduler, it is associated with a kernel thread in the thread pool and now, scheduling of these
kernel threads is done by the OS. Mapping of user level threads with kernel th
unique. Its (associated kernel thread's) ID can change over time as each user level thread can be
associated with different kernel thread
to a kernel thread is done every time
Each individual thread has its own copy of stack whereas, all threads of a process share same
global memory (heap and data section). Many other global resources such as process ID, parent
process ID, open file descriptors, etc. are shared among
5. STUDY
The problem of matrix multiplication is addressed using Pthreads and Nvidia CUDA on CPU
and GPU respectively.The test bed description of our experimental envir
Hardware:
CPU: Intel®
CORETM
i7-2670QM CPU @ 2.20 GHz
Mainmemory: 4 GB
GPU: Nvidia GEFORCE®
GT 520MX
GPU memory:1 GB
Software:
OS: Ubuntu 13.04
Kernel version 3.8.0-35-generic #50
GNU/LINUX
Driver: NVIDIA-LINUX-x86_64-331.49
International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976
Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME
85
The Pthread library considered in this paper for performance evaluation uses a two
scheduler Pthreads implementation, as shown in figure 3. Pthread library facilitates a
level thread creation. The Pthread library scheduler is responsible
e a process. As and when a particular thread is scheduled by the Pthread
library scheduler, it is associated with a kernel thread in the thread pool and now, scheduling of these
kernel threads is done by the OS. Mapping of user level threads with kernel threads is NOT fixed or
unique. Its (associated kernel thread's) ID can change over time as each user level thread can be
associated with different kernel thread every time. As and when a new thread is scheduled, mapping
to a kernel thread is done every time, which necessitates mode switching [15].
Figure 3: Pthreadsmodel
Each individual thread has its own copy of stack whereas, all threads of a process share same
global memory (heap and data section). Many other global resources such as process ID, parent
riptors, etc. are shared among the threads of a single process.
The problem of matrix multiplication is addressed using Pthreads and Nvidia CUDA on CPU
The test bed description of our experimental environment is shown below:
2670QM CPU @ 2.20 GHz
GT 520MX
generic #50-Ubuntu SMP TUE DEC 3 01:24:59 x86_64 x86_64x86_64
331.49
International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print),
erformance evaluation uses a two-level
scheduler Pthreads implementation, as shown in figure 3. Pthread library facilitates a
level thread creation. The Pthread library scheduler is responsible
e a process. As and when a particular thread is scheduled by the Pthread
library scheduler, it is associated with a kernel thread in the thread pool and now, scheduling of these
reads is NOT fixed or
unique. Its (associated kernel thread's) ID can change over time as each user level thread can be
. As and when a new thread is scheduled, mapping
Each individual thread has its own copy of stack whereas, all threads of a process share same
global memory (heap and data section). Many other global resources such as process ID, parent
the threads of a single process.
The problem of matrix multiplication is addressed using Pthreads and Nvidia CUDA on CPU
onment is shown below:
Ubuntu SMP TUE DEC 3 01:24:59 x86_64 x86_64x86_64
- 5. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print),
ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME
86
CUDA version: Nvidia-CUDA-6.0
Pthreadslibrary:libpthread 2.17
Pthreads Program:
int M,N,P; /* dimensions of input matrices and answer matrix are,
MxN, NxP and MxP respectively */
unsigned long longint mat1[1000][1000];
unsigned long longint mat2[1000][1000];
unsigned long longintans[1000][1000];
void* matmul(void *arg)
{
inti,*arr;
arr = (int *)arg;
for(i=0;i<N;i++)
ans[*arr][*(arr+1)] += mat1[*arr][i] * mat2[i][*(arr+1)];
}
int main(intargc, char *argv[])
{
/*declare and initialize pthreads, pthread arguments and time variables*/
/*Initialize the input matrices*/
...
gettimeofday(&start,NULL);
for(i=0;i<M;i++)
{
for(j=0;j<P;j++,k++)
{
arg[k][0]=i;arg[k][1]=j;/*Passing i and j as
arguments in arg array*/
pthread_create(&p[i][j],NULL,matmul,(void *)(arg
[k]));
...
}
}
for(i=0;i<M;i++)
{
for(j=0;j<P;j++)
pthread_join(p[i][j],NULL);
...
}
...
gettimeofday(&end,NULL);
/*Determine the execution time between end and start*/
return 0;
}
- 6. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print),
ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME
87
CUDA program:
__global__ void matmul(unsigned long longint *d_outp, unsigned long longint *d_inp1, unsigned
long longint *d_inp2, int M, int N, int P)
{
...
/*Get current working row and column number into integer variables r and c*/
if(r*c<=M*P)//If thread is of use then
{
unsigned long longint temp=0;
for(inti=0; i<N; i++)
temp+=d_inp1[(r)*N+(i)]*d_inp2[(i)*P+(c)];
d_outp[(r)*P+(c)]=temp;
}
}
int main(intargc, char *argv[])
{
/* Declare and initialize the constants M, N, P with dimensions*/
/* Determine the total number of threads and blocks and initialize thrd and blkaccordingly */
/* Initialize the input matrices */
constint sz1 = M * N * sizeof(unsigned long longint);
constint sz2 = N * P * sizeof(unsigned long longint);
constintanssz = M * P * sizeof(unsigned long longint);
unsigned long longint *d_inp1, *d_inp2, *d_outp;
gettimeofday(&start,NULL);
cudaMalloc((void**) &d_inp1,sz1);
cudaMalloc((void**) &d_inp2,sz2);
cudaMalloc((void**) &d_outp,anssz);
...
cudaMemcpy(d_inp1,inputarray1,sz1,cudaMemcpyHostToDevice);
cudaMemcpy(d_inp2,inputarray2,sz2,cudaMemcpyHostToDevice);
matmul<<<blk,dim3(thrd,thrd,1)>>>(d_outp,d_inp1,d_inp2,M,N,P);
cudaMemcpy(outputarray,d_outp,anssz,cudaMemcpyDeviceToHost);
...
gettimeofday(&end,NULL);
/*Determine the execution time between end and start*/
cudaFree(d_inp1);
cudaFree(d_inp2);
cudaFree(d_outp);
- 7. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print),
ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME
88
return 0;
}
Implementation and results
Here, in both of the cases CPU and GPGPU, the two input matrices are of MxN and NxP
dimensions respectively; and therefore the dimension of resultant matrix is MxP. Considering the
equal values of M, N, and P the experiments were carried out on CPU and GPGPU.The results in
terms of time taken by each experiment on CPU and GPGPU are shown in Table 1.
Table 1: Results in terms of time taken
Dimensions
Time taken in
milliseconds
Pthreads CUDA
50 52 178
55 63 183
60 71 184
65 80 184
70 94 183
75 107 182
80 130 180
85 156 184
90 168 177
95 195 175
6. DISCUSSION
Chart 1: Line chart of results in Table 1
- 8. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print),
ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME
89
The chart 1 depicts the behavior of both the parallel programming approaches based on the
results shown in table 1.
The chart clearly shows that initially, for lower number of threads and dimensions, Pthreads
start with high efficiency, i.e., very less amount of time. However, the gradual increase in dimensions
and thus increase in number of Pthreads, keeps on approximately linear increment in the time
required for experiment completion.
On the other side however, the chart also represents that the execution time taken by
CUDAfor the same set of experiments is nearly identical irrespective of the number of dimensions.
The discussion made above about the chart is in support of justification of our earlier made
discussion in section 2 that the CUDA approach is highly efficient in parallel executing multiple
threads. The GPGPU unit possesses a very high number f processors to parallely execute every
individual thread.
However, the CPU approach remains less efficient in spite of the high performance CPU due
to the constraint of limited number of cores in the CPU.
Hence in GPGPU scheduling doesn’t become constraint and therefore, it solves a
computational problem in nearly constant time irrespective of the number of dimensions and threads.
In case of CPU as it is a high performance processor, it can solve a problem in lesser time than GPU
for small number of threads but as the number of threads increases, scheduling and management of
these threads must also be done by the CPU which is an overhead for the CPU. This overhead keeps
on increasing with increase in the number of threads, which results into overall increase in total time
taken by CPU for solving thecomputational problem with increase in the number of dimensions and
threads.
Chart 1 also depicts an exceptional behavior of GPGPU – CUDA takes nearly the same time
in solving the computational problem even for less number of dimensions and threads; apart from the
thread management operations, CUDA program involves two way CPU-GPU I/O operations.
As stated in section 3, each time a new thread is scheduled, a mode switch is mandatory in addition
to a context switch, as each user level thread is associated with a kernel level thread. These mode
switches and context switches increase with increase in number of threads. This is in support to the
justification about the increase in execution time which is represented by the Pthreads line in chart 1.
Based on the discussion made above, some performance-centric suggestions are made here:
Problems comprising of threads less than the threshold point should be assigned for execution using
Pthreads to the CPU; or the problem may be submitted to the GPGPU as a wholeelse the problem
may be divided to run jointly on CPU and GPGPU simultaneously. Furthermore, it may become
desirable to adopt the fused CPU-GPU processing units [ieee1], to deploy a computational problem
upon them.
Another notable suggestion towards Pthreads library maintainers is given here. While solving
the computational problem, the library may decide to divide the total threads between CPU
andGPGPU with respect to the threshold point already discussed above.
Moreover, one more suggestion to the CPU manufacturers and the operating system developers is to
reserve and designate one specific core as a master core dedicated for scheduling, mode switching
and context switching; it may result into freeing other cores i.e., slave cores from the scheduling and
switching responsibilities thereby dedicating them only for solving of the computational problems
and thus increase in overall performance.
7. CONCLUDING REMARKS
A noble approach comprising of the performance determination of the CUDA and the
Pthreads parallel programming techniques is presented in this paper. The extra-ordinary performance
behavior of CUDA along with the threshold point is also highlighted in the paper. Furthermore,
- 9. International Journal of Advanced Research in Engineering and Technology (IJARET), ISSN 0976 – 6480(Print),
ISSN 0976 – 6499(Online) Volume 5, Issue 5, May (2014), pp. 82-90 © IAEME
90
significant suggestions pertaining to performance improvement with the CUDA and the Pthreads
parallel programming approaches are also presented in this paper. In future we intend to continue our
work in the direction of betterment of the open source Pthreads technique.
REFERENCES
1. A. Asaduzzaman, F. Sibai, H. El-sayed(2013),"Performance and power comparisons of MPI
VsPthread implementations on multicore systems", Innovations in Information Technology
(IIT), 2013 9th International Conference on , vol., no., pp.1, 6.
2. C. Cerin, H. Fkaier, M. Jemni (2008), "Experimental Study of Thread Scheduling Libraries on
Degraded CPU," Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE
International Conference on, vol., no., pp.697, 704.
3. D. Jakimovska, G. Jakimovski, A. Tentov, D. Bojchev (2012), “Performance estimation of
parallel processing techniques on various platforms” Telecommunications Forum (TELFOR)
4. D. Stevens, V. Chouliaras (2010), "LE1: A Parameterizable VLIW Chip-Multiprocessor with
Hardware PThreads Support," VLSI (ISVLSI), 2010 IEEE Computer Society Annual
Symposium on , vol., no., pp.122, 126.
5. E.Ajkunic, H. Fatkic, E. Omerovic, K. Talic and N. Nosovic (2012), “A Comparison of Five
Parallel Programming Models for C++”, MIPRO 2012, Opatija, Croatia.
6. F.Mueller (1993), “A Library Implementation of POSIX Threads under UNIX”, In
Proceedings of the USENIX Conference, Florida State University, San Diego, CA, pp.29-41.
7. G. Narlikar, G. Blelloch, (1998), “Pthreads for dynamic and irregular parallelism. In
Proceedings of the 1998 ACM/IEEE conference on Supercomputing (SC '98), IEEE Computer
Society, Washington, DC, USA, 1-16.
8. M. Ujaldon (2012), "High performance computing and simulations on the GPU using
CUDA," High Performance Computing and Simulation (HPCS), 2012 International
Conference on, vol., no., pp.1, 7.
9. NVIDIA (2006), “NVIDIA GeForce 8800 GPU Architecture Overview”, TB-02787-001_v01.
10. S. Che, M. Boyer, J.Meng, D.Tarjan, J. Sheaffer, K.Skadron (2008), "A Performance Study of
General-Purpose Applications on Graphics Processors Using CUDA", Journal of Parallel and
Distributed Computing.
11. Y. Yang, P. Xiang, M. Mantor, H. Zhou (2012), "CPU-assisted GPGPU on fused CPU-GPU
architectures," High Performance Computer Architecture (HPCA), 2012 IEEE 18th
International Symposium on, vol., no., pp. 1, 12.
12. B. Nichols, D. Buttlar, J. Farrell, “Pthread Programming”, O’Reilly Media Inc., USA.
13. https://www.pgroup.com/lit/articles/insider/v2n1a5.htm
14. http://www.yuwang-cg.com/project1.html
15. http://man7.org/linux/man-pages/man7/pthreads.7.html
16. http://www.nvidia.in/object/cuda-parallel-computing-in.html
17. http://www.nvidia.in/object/nvidia-kepler-in.html
18. http://www.nvidia.in/object/gpu-computing-in.html
19. Aakash Shah, Gautami Nadkarni, Namita Rane and Divya Vijan, “Ubiqutous Computing
Enabled in Daily Life”, International Journal of Computer Engineering & Technology
(IJCET), Volume 4, Issue 5, 2013, pp. 217 - 223, ISSN Print: 0976 – 6367, ISSN Online:
0976 – 6375.
20. Vinod Kumar Yadav, Indrajeet Gupta, Brijesh Pandey and Sandeep Kumar Yadav,
“Overlapped Clustering Approach for Maximizing the Service Reliability of Heterogeneous
Distributed Computing Systems”, International Journal of Computer Engineering &
Technology (IJCET), Volume 4, Issue 4, 2013, pp. 31 - 44, ISSN Print: 0976 – 6367,
ISSN Online: 0976 – 6375.