(Check my blog @ http://www.marioalmeida.eu/ )
In this presentation I present the performance metrics and results of running the parsec benchmark with the raytrace application on Upc's boada server
8. Paraver
● Detailed quantitative analysis of a program
performance.
● Concurrent comparative analysis of several
traces.
● Support for mixed message passing and
shared memory.
● Building of derived metrics.
6
9. Configuration (1/4)
Boada server:
● Dual CPU Six Core with Hyperthreading.
● Kills applications after a few minutes.
● 24 GB of RAM.
Boada server:
● Used cpulimit to limit the cpu usage up to four cores.
7
10. Configuration (2/4)
Installed and/or configured:
● Parsec 2.1 with raytrace package only.
● Extrae 2.2.1.
● Paraver 4.3.0 (in my laptop).
● CpuLimit
● Minor configurations on .bashrc.
● Multiple scripts to clean, build and run.
8
15. Raytrace
Code
For every pixel in the image
calculate trajectory of ray striking pixel
find closest intersection point of ray with scene
geometry
calculate contribution of all lights at intersection point
recursively trace specularly reflected ray
end for
12
16. Raytrace
Inputs
● simsmall - 1 million polygons (480x270)
● simmedium - 1 million poly (960x540)
● simlarge - 1 million poly (1920x1080)
● native - 10 million poly (1920x1080)
13
21. Raytrace
Cache and instructions
High number of cache misses Very low number of cache misses
There were no significative
diferences of IPC between
threads.
18
22. Raytrace
Execution time (1/3)
These are average times from
multiple executions of the parallel
code only and without extrae
overhead.
There was a high average
deviation of 0.3 seconds in the
experiments.
Bigger inputs were more accurate.
19
23. Raytrace
Execution time (2/3)
There was a smaller average
deviation of 0.03 seconds.
With 64 threads it runs almost
three times faster!
20
24. Raytrace
Execution time (3/3)
There was a even smaller average
deviation of 0.02 seconds.
With 64 threads it runs almost
three times faster!
21
25. Raytrace
Configuration comparison
In the case of the limited
configuration, although
perfomance doesn't seem
to degrade, the execution
time seems to stabilize for
more than 8 threads.
22
28. Conclusions (1/3)
● The system seemed to perform worse for a
number of threads multiple of the total
number of physical cores.
● The program has a good load balancing.
● Fine-granular parallelism.
24
29. Conclusions (2/3)
● Although it wasn't possible to verify,
increasing the input should cause higher
cache misses, because of the big working
sets that won't fit on the memory.
● Memory bandwidth should be the main issue
for good speedups.
● Boada killed almost all the native input
executions. 25
30. Conclusions (3/3)
● Paraver simplifies the process of analyzing
an application performance.
● Better knowledge of the systems
architecture would be needed in order
further analyse the performance of the
application.
26