SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
Characteristics of an On-Chip Cache
on NEC SX Vector Architecture
Akihiro MUSA1;3Ã
, Yoshiei SATO1y
, Ryusuke EGAWA1;2z
, Hiroyuki TAKIZAWA1;2x
,
Koki OKABE2}
, and Hiroaki KOBAYASHI1;2k
1
Graduate School of Information Sciences, Tohoku University, Sendai 980-8578, Japan
2
Cyberscience Center, Tohoku University, Sendai 980-8578, Japan
3
NEC Corporation, Tokyo 108-8001, Japan
Received May 6, 2008; final version accepted December 10, 2008
Thanks to the highly effective memory bandwidth of the vector systems, they can achieve the high computation
efficiency for computation-intensive scientific applications. However, they have been encountering the memory
wall problem and the effective memory bandwidth rate has decreased, resulting in the decrease in the bytes per
flop rates of recent vector systems from 4 (SX-7 and SX-8) to 2 (SX-8R) and 2.5 (SX-9). The situation is getting
worse as many functions units and/or cores will be brought into a single chip, because the pin bandwidth is limited
and does not scale. To solve the problem, we propose an on-chip cache, called vector cache, to maintain the
effective memory bandwidth rate of future vector supercomputers. The vector cache employs a bypass mechanism
between the main memory and register files under software controls. We evaluate the performance of the vector
cache on the NEC SX vector processor architecture with bytes per flop rates of 2 B/FLOP and 1 B/FLOP, to
clarify the basic characteristics of the vector cache. For the evaluation, we use the NEC SX-7 simulator extended
with the vector cache mechanism. Benchmark programs for performance evaluation are two DAXPY-like loops
and five leading scientific applications. The results indicate that the vector cache boosts the computational
efficiencies of the 2 B/FLOP and 1 B/FLOP systems up to the level of the 4 B/FLOP system. Especially, in the
case where cache hit rates exceed 50%, the 2 B/FLOP system can achieve a performance comparable to the 4 B/
FLOP system. The vector cache with the bypass mechanism can provide the data both from the main memory and
the cache simultaneously. In addition, from the viewpoints of designing the cache, we investigate the impact of
cache associativity on the cache hit rate, and the relationship between cache latency and the performance. The
results also suggest that the associativity hardly affects the cache hit rate, and the effects of the cache latency
depend on the vector loop length of applications. The cache shorter latency contributes to the performance
improvement of the applications with shorter loop lengths, even in the case of the 4 B/FLOP system. In the case of
longer loop lengths of 256 or more, the latency can effectively be hidden, and the performance is not sensitive to
the cache latency. Finally, we discuss the effects of selective caching using the bypass mechanism and loop
unrolling on the vector cache performance for the scientific applications. The selective caching is effective for
efficient use of the limited cache capacity. The loop unrolling is also effective for the improvement of
performance, resulting in a synergistic effect with caching. However, there are exceptional cases; the loop
unrolling worsens the cache hit rate due to an increase in the working space to process the unrolled loops over the
cache. In this case, an increase in the cache miss rate cancels the gain obtained by unrolling.
KEYWORDS: Vector processing, Vector cache, Performance characterization, Memory system,
Scientific application
1. Introduction
Vector supercomputers have high computation efficiency for computation-intensive scientific applications. In our
previous work (Kobayashi (2006), Musa et al. (2006)), we have shown that NEC SX vector supercomputers can
achieve high sustained performance in several leading applications. This is due to their high bytes per flop rates
compared to scalar systems. However, as recent advantages in VLSI technology have also been accelerating processor
speeds, vector supercomputers have been encountering the memory wall problem. As a result, it is getting harder for
vector supercomputers to keep a high memory bandwidth balanced with the improvement of their flop/s performance.
On the NEC SX systems, the bytes per flop rate has decreased from 8 B/FLOP to 2 B/FLOP during the last twenty
years.
Ã
E-mail: musa@sc.isc.tohoku.ac.jp
y
E-mail: yoshiei@sc.isc.tohoku.ac.jp
z
E-mail: egawa@isc.tohoku.ac.jp
x
E-mail: tacky@isc.tohoku.ac.jp
}
E-mail: okabe@isc.tohoku.ac.jp
k
E-mail: koba@isc.tohoku.ac.jp
Interdisciplinary Information Sciences Vol. 15, No. 1 (2009) 51–66
#Graduate School of Information Sciences, Tohoku University
ISSN 1340-9050 print/1347-6157 online
DOI 10.4036/iis.2009.51
We have proposed an on-chip cache, called vector cache, to maintain the effective memory bandwidth rate of future
vector supercomputers (Kobayashi et al. (2007)) and provided the early evaluation of the vector cache using the NEC
SX simulator (Musa et al. (2007)). The vector cache employs a bypass mechanism between a vector register and a main
memory under software controls, and load/store data are selectively cached. The bypass mechanism and the cache
work complementarily together to provide data to the register file, resulting in the high effective memory bandwidth
rate. However, we did not investigate the fundamental configuration of the vector cache, i.e. cache associativity and
cache latency. Moreover, our previous work did not discuss the relationship between loop unrolling and caching. The
main contribution of this paper is to clarify the characteristics of the on-chip vector cache on the vector
supercomputers. We investigate the relationship among cache associativity, cache latency and performance using five
practical applications selected from the areas of advanced computational sciences. In addition, we discuss in detail
selective caching and loop unrolling under the vector cache to improve the performance.
The rest of the paper is organized as follows. Section 2 presents related work. Section 3 describes a vector
architecture with an on-chip vector cache discussed in this paper. Section 4 provides our experimental methodology
and benchmark programs for performance evaluation. Section 5 presents experimental results when executing the
kernel loops and five real applications on our architecture. Through the experimental results, we clarify the
characteristics of the vector cache. In Section 6, we examine the effect of selective caching on the performance, in
which only the data with high localities of reference are cached. In addition, we discuss the relationship between loop
unrolling and the vector cache. Finally, Section 7 summarizes the paper with some future work.
2. Related Work
Vector caches have been previously studied by a number of researchers using trace driven simulators of convectional
vector architectures in the 1990’s. Gee et al. have provided an evaluation of the cache performance in vector
architectures: Cray X-MP and Ardent Titan (Gee and Smith (1992), Gee and Smith (1994)). Their caches employ a full-
associative cache and an n-way set associative cache. The line sizes range from 16 to 128 bytes, and the maximum
cache size is 4 MB. Their benchmark programs are selected from Livermore Loops, NAS kernels, Los Alamos
benchmark set and real applications in chemistry, fluid dynamics and linear analysis. They showed that vector
references contained somewhat less temporal locality, but large amounts of spatial locality compared to instruction and
scalar references, and the cache improved the computational performance of the vector processors. Kontothanassis
et al. have evaluated the cache performance of Cray C90 architecture using NAS parallel benchmarks (Kontothanassis
et al. (1994)). Their cache is a direct-mapped cache with a 128 bytes line size, and the maximum cache size is 8 MB.
They showed that the cache reduced memory traffic from off-chip memory, and a DRAM memory system with the
cache was competitive to a SRAM memory system.
Modern vector supercomputer Cray X1 has a 2 MB vector cache, called Ecache, organized as a 2-way set associative
write-back cache with 32 bytes line size and the least recently used (LRU) replacement policy (Abts et al. (2003)). The
performance of Cray X1 has been evaluated using real scientific applications by several researchers (Dunigan Jr. et al.
(2005), Oliker et al. (2004), Shan and Strohmaier (2004)). These studies have compared the performance of Cray X1
with that of other platforms; IBM Power, SGI Altix and NEC SX. However, they have not quantitatively discussed the
effect of the cache on its performance.
The aim of our research is to quantitatively investigate the effect of the vector cache using modern vector
supercomputer NEC SX architecture. We clarify the basic characteristics of the vector cache and discuss its effective
usages.
3. On-Chip Cache Memory for Vector Architecture
We design an on-chip vector cache for a vector processor architecture as shown in Figure 1. Our vector processor has
a superscalar unit, a vector unit and an address control unit. The vector unit contains five types of vector arithmetic
pipes (Mask, Logical, Multiply, Add/Shift and Divide), vector registers and a vector cache. Four parallel vector pipes
1B/FLOP
~ 4B/FLOP
Vector Registers
Main Memory
Vector
Cache
Vector
Cache
Vector
Cache
Vector
Cache
Vector
Cache ×32
4B/FLOP
Vector
Cache
Vector
Registers
Mask Reg. Mask
Logical
Multiply
Add/Shift
Divide
Vector Unit
Scalar Unit
×4
×4
×4
×4
×4
MainMemoryUnit
Address Control
Unit
. .
Figure 1. Vector architecture with vector cache and memory system block diagram.
52 MUSA et al.
are provided for each type; total 20 pipes are included in the vector unit. The address control unit queues and issues
vector operation instructions to the vector unit. The vector operation instructions contain vector memory instructions
and vector computational instructions. Each instruction concurrently deals with 256 double-precision floating-point
data.
The main memory unit employs an interleaved memory system. The main memory is divided into 32 parts, each of
which operates independently and supplies data to the vector processor. Thus, the main memory and the vector
processor are interconnected through 32 memory ports. Moreover, we implement the vector cache in each memory
part; the vector cache consists of 32 sub-caches, and the sub-cache is a non-blocking cache with a bypass mechanism
between a vector register file and the main memory. The bypass mechanism is software-controlled, and vector load/
store instructions have a flag for cache control, i.e. cache on/off. When the flag indicates ‘‘cache on,’’ the supplied data
by the vector load/store instruction are cached. Meanwhile, when the flag is ‘‘cache off,’’ the vector load/store
instruction provides data via the bypass mechanism: the data are not cached. The sub-cache employs a set-associative
write-through cache with the LRU replacement policy. The line size is 8 bytes; it is the unit for memory accesses.
4. Experimental Environment
4.1 Methodology
We use a trace-driven simulator that can simulate the behavior of the proposed vector architecture at the register
transfer level. It is based on the NEC SX simulator, which accurately models a single processor of the SX architecture;
the vector unit, the scalar unit and the memory system. The simulator takes a system parameter file and a trace file as
input, and the output of the simulator contains instruction cycle counts of a benchmark program and cache hit
information. The system parameter file has configuration parameters of a vector architecture and setting parameters,
i.e., the cache size, the associativity, the cache latency and the memory bandwidth.
The trace file contains an instruction sequence of a benchmark program and directives of cache control: cache on/
off. These directives are used in selective caching and set the cache control flags in the vector load/store instructions. A
benchmark program is compiled by the NEC FORTRAN compiler: FORTRAN90/SX. It supports ANSI/ISO
Fortran95 in addition to functions of automatic vectorization and automatic parallelization. The executable program
runs on the SX trace generator to produce the trace file. In this work, we use two kernel loops and the original source
codes of five applications, and compile them with the highest optimizations option (-C hopt) and inlining subroutines.
To evaluate on-chip vector caching for future vector processors with a higher flop/s rate but a relatively lower off-
chip memory bandwidth, we evaluate the effect of the cache on the vector processor by limiting its memory bandwidth
per flop/s rate from 4 B/FLOP down to 1 B/FLOP. Here, we adopt the NEC SX-7 architecture (Kitagawa et al. (2003))
to our vector processor. The setting parameters are shown in Table 11
.
4.2 Benchmark Programs
To clarify the basic characteristics and validity of the vector cache, we select basic kernel loops and five leading
applications. In this section, we describe our benchmark programs in detail.
The following kernel loops are used to evaluate the performance of the vector processor with the cache.
AðiÞ ¼ XðiÞ þ YðiÞ: ð1Þ
AðiÞ ¼ XðiÞ Â YðiÞ þ ZðiÞ: ð2Þ
Table 1. Summary of setting parameters.
Base System Architecture NEC SX-7
Main Memnory DDR-SDRAM
Vector Cache SRAM
Total Size (Sub-cache Size) 256 KB–8 MB (8 KB–256 KB)
Cache Policy LRU, Write-through
Associativity 2WAY, 4WAY, 8WAY
Cache Latency 15%, 50%, 85%, 100%
(Rate of main memory latency)
Cache Bank Cycle 5% of memory cycle
Line Size 8 B
Memory—Cache bandwidth per flop/s 1 B/FLOP, 2 B/FLOP, 4 B/FLOP
Cache—Register bandwidth per flop/s 4 B/FLOP
1
The absolute values of memory related parameters such as latency and bank cycle cannot be disclosed in this paper because of their confidentiality.
Characteristics of an On-Chip Cache on NEC SX Vector Architecture 53
As leading applications, we have selected five applications in three scientific computing areas. The applications have
been developed by researchers of Tohoku University for the SX-7 system and are representative of each research area.
Table 2 shows the summary of the evaluated applications, which methods are standard in individual areas (Kobayashi
(2006)).
The GPR simulation code evaluates a performance of the SAR-GPR (Synthetic Aperture Radar–Ground Penetrating
Radar) in detection of buried anti personnel mines under conditions of inhomogeneous subsurface mediums (Kobayashi
et al. (2003)). The simulation method is based on the three dimensional FDTD (Finite Difference Time Domain)
method (Kunz and Luebbers (1993)). The vector operation ratio is 99.7% and the vector loop length is over 500.
The APFA simulation is applied to design of a high gain antenna; an Anti-Podal Fermi Antenna (APFA) (Takagi
et al. (2004)). The simulation calculates directional characteristics of the APFA using the FDTD method and the
Fourier transform. The vector operation ratio is 99.9% and the vector loop length is 255.
The PRF simulation provides numerical simulations of two dimensional Premixed Reactive Flow (PRF) in
combustion for design of engines of planes (Tsuboi and Masuya (2003)). The simulation uses the sixth-order compact
finite difference scheme and the third-order Runge-Kutta method for time advancement. The vector operation ratio is
99.3% and the vector loop length is 513.
The SFHT simulation realizes direct numerical simulations of three-dimensional laminar Separated Flow and Heat
Transfer (SFHT) on surfaces of a plane (Nakajima et al. (2005)). The finite-difference forms are the fifth-order upwind
difference scheme for space derivatives and the Crank–Nicholson method for a time derivative. The resulting finite-
difference equations are solved using the SMAC method. The vector operation ratio is 99.4% and the vector loop length
is 349.
The PBM simulation uses the three-dimensional numerical Plate Boundary Models (PBM) to explain an observed
variation in propagation speed of postseismic slip (Ariyoshi et al. (2007)). This is a quasi-static simulation in an elastic
half-space including a rate- and state-dependent friction. The vector operation ratio is 99.5% and the vector loop length
is 32400.
5. Performance Evaluation of Vector Cache
We evaluate the performance of the proposed vector processor architecture by using the benchmark programs and
clarify the basic characteristics of the vector cache.
5.1 Impact of Memory Bandwidth on Effective Performance
To evaluate the impact of the memory bandwidth on performance of the vector processor without the vector cache,
we varied memory bandwidth per flop/s rates from 4 B/FLOP down to 1 B/FLOP. Figure 2 shows the computation
efficiency, the ratio of the sustained performance to the peak performance, in the execution of the benchmark programs.
Here, the 4 B/FLOP case is equal to the memory bandwidth per flop/s rate of the SX-8 or earlier system, the 2 B/FLOP
is the same as Cary X1 and BlackWidow (Abts et al. (2007)), and the 1 B/FLOP is the same level as the commodity-
based high performance scalar systems such as NEC TX7 series (Senta et al. (2003)) and SGI Altix series (SGI (2007)).
Figure 2 indicates that the computational performance is seriously affected by the B/FLOP rate. In particular, the
performances of GPR and PBM, which are memory-intensive programs, are degraded by half and quarter when the
memory bandwidth per flop/s rate is reduced to 2 B/FLOP and 1 B/FLOP from 4 B/FLOP. Even if APFA is
computation-intensive, the performance on the 1 B/FLOP system is decreased to 69% of the performance on the
4 B/FLOP system. Therefore, the memory bandwidth per flop/s rate needs the 4 B/FLOP rate on the vector
supercomputer.
Table 2. Summary of benchmark programs.
Area Name and Description
Method
Subdivision
Memory
Size
Electromagnetic Analysis
GPR simulation: Simulation of
Array Antenna Ground Penetrating Radar
FDTD
50 Â 750 Â 750
8.9 GB
APFA simulation: Simulation of
Anti-Podal Fermi Antenna
FDTD
612 Â 105 Â 505
12 GB
CFD/Heat Analysis
PRF simulation: Simulation of
Premixed Reactive Flow in Combustion
DNS
513 Â 513
1.4 GB
SFHT simulation: Simulation of
Separated Flow and Heat Transfer
SMAC
711 Â 91 Â 221
6.6 GB
Seismology
PBM simulation: Simulation of
Plate Boundary Model on Seismic Slow Slip
Friction Law
32400 Â 32400
8 GB
54 MUSA et al.
5.2 Relationship between Efficiency and Cache Hit Rate on Kernel Loops
We simulate the execution of the kernel loop (1) on the vector processor with the vector cache. Since this loop has no
temporal locality, we store a part of XðiÞ and YðiÞ data on the vector cache in advance of the loop execution. In the
following two ways, we select the data for caching to change the range of cache hit rates, and examine their effects on
performance. One is to store both XðiÞ and YðiÞ with the same index i on the cache, and load instructions of both XðiÞ
and YðiÞ are set to ‘‘cache on,’’ named Case 1. The other is to cache only XðiÞ, and load instructions of XðiÞ and YðiÞ are
set to ‘‘cache on’’ and ‘‘cache off,’’ respectively. This is Case 2. Here, the range of index i is 1 i 25600.
Figure 3 shows the relationship between the cache hit rate and the relative memory bandwidth that is obtained by
normalizing the effective memory bandwidth of the systems with the vector cache by that of the 4 B/FLOP system
without the cache. Figure 4 shows the relationship between the cache hit rate and the computation efficiency. Figure 3
indicates that the relative memory bandwidth of each case increases as the cache hit rate increases. Therefore, the
vector cache is one of the promising solutions to cover a lack of the memory bandwidth. Similar to Figure 3, the
computation efficiency of each case in Figure 4 improves as the cache hit rate increases.
Figures 3 and 4 show the correlation between the relative memory bandwidth and the computation efficiency. On
both the 2 B/FLOP and 1 B/FLOP systems, the performance of Case 2 is greater than that of Case 1. In particular,
Case 2 with the vector cache of the 50% hit rate on the 2 B/FLOP system achieves the same performance of the 4 B/
FLOP system. On the 4 B/FLOP system, each of XðiÞ and YðiÞ is provided at a 2 B/FLOP bandwidth rate on average.
When the cache hit rate is 50% in Case 2, all data of XðiÞ are supplied directly from the vector cache and all data of YðiÞ
are supplied from the memory at the 2 B/FLOP bandwidth rate through the bypass mechanism. Consequently, each of
XðiÞ and YðiÞ is provided at the 2 B/FLOP bandwidth rate on the vector registers, and the total amount of data provided
to vector registers in time is equal to that of the 4 B/FLOP system. However, the 1 B/FLOP system with vector caching
with the 50% hit rate does not achieve the performance of the 4 B/FLOP system. Because YðiÞ is provided at a 1 B/
FLOP bandwidth rate, the total bandwidth rate at the vector registers does not reach the 4 B/FLOP. On the other hand,
in Case 1, the data of XðiÞ and YðiÞ are supplied from either of the vector cache or the memory at the same index i.
While supplying the data from the memory, XðiÞ and YðiÞ have to share the memory bandwidth rate. Therefore, the 1 B/
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50 60 70 80 90 100
RelativeMemoryBandwidth
Cache Hit Rate (%)
4B/F 2B/FCase1
2B/FCase2 1B/FCase1
1B/FCase2
Figure 3. Relationship between cache hit rate and relative memory bandwidth on Kernel loop (1).
0
10
20
30
40
50
60
X +Y X*Y+Z GPR APFA PRF SFHT PBM
Efficiency(%)
4B/FLOP 2B/FLOP 1B/FLOP
Figure 2. Effect of memory bandwidth per flop/s rate.
Characteristics of an On-Chip Cache on NEC SX Vector Architecture 55
FLOP and 2 B/FLOP systems need a cache hit rate of 100% to achieve a performance comparable to that of the 4 B/
FLOP system.
Similarly, we examine the performance of Kernel (2) on the 2 B/FLOP system in the following three cases. In
Case 1, all of XðiÞ, YðiÞ, and ZðiÞ are cached with the same index i in advance. In Case 2, both XðiÞ and YðiÞ are
provided from the vector cache, and therefore the maximum cache hit rate is 66%. In Case 3, only XðiÞ is in the cache,
and hence the maximum cache hit rate is 33%. Figure 5 shows the relationship between the cache hit rate and the
change in the relative memory bandwidth regarding Kernel (2). On the 4 B/FLOP system, each of XðiÞ, YðiÞ and ZðiÞ is
provided at a 4/3 B/FLOP bandwidth rate on average. The performance of Case 2 is comparable to that of the 4 B/
FLOP system when the cache hit rate is 66%, because the bandwidth per flop/s rate of each data is the 4/3 B/FLOP
rate on average. However, in Case 3, both YðiÞ and ZðiÞ are provided from the memory, and YðiÞ and ZðiÞ have to share
the 2 B/FLOP bandwidth rate at the vector register. As a result, the relative memory bandwidth never reaches that of
the 4 B/FLOP system.
These results indicate that the vector cache has a potential for the future SX systems to cover the shortage of memory
bandwidth. In additions, all data do not need to be cached from the discussions on Figures 3 and 5. The key data
determining the performance should be cached to make good use of limited on-chip capacity.
5.3 Relationship between Efficiency and Cache Hit Rate on Five Scientific Applications
We simulate the execution of five scientific applications with the vector cache varying memory bandwidth per flop/s
rate; 1 B/FLOP and 2 B/FLOP. Here, all vector load/store instructions are set to ‘‘cache on.’’ Figure 6 shows the
relationship between the cache hit rate of the 8 MB vector cache and the relative memory bandwidth. Figure 6 indicates
that the vector cache can improve the effective memory bandwidth, depending on the cache hit rate of the applications.
Cache hit rates vary from 13% in SFHT to 96% in APFA, because these depend on the locality of reference in
individual applications.
In APFA, the relative memory bandwidth is 0.96 at the 2 B/FLOP system with almost 100% hit of the 8 MB cache,
resulting in an effective memory bandwidth almost equal to the 4 B/FLOP system. In this case, most data are provided
0
2
4
6
8
10
12
14
16
18
0 10 20 30 40 50 60 70 80 90 100
Efficiency(%)
Cache Hit Rate (%)
4B/F 2B/FCase1
2B/FCase2 1B/FCase1
1B/FCase2
Figure 4. Relationship between cache hit rate and computation efficiency on Kernel loop (1).
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50 60 70 80 90 100
RelativeMemoryBandwidth
Cache Hit Rate (%)
4B/F
2B/F Case1
2B/F Case2
2B/F Case3
Figure 5. Relationship between cache hit rate and relative memory bandwidth on Kernel loop (2).
56 MUSA et al.
from the cache. PBM of the 2 B/FLOP system reaches the effective memory bandwidth of the 4 B/FLOP system at the
cache hit rate 50%. Figure 7 shows the main routine of PBM. The inner loop, index j, is vectorized and arrays gd dip
and wary are stored in the vector cache by vector load instructions. However, gd dip is spilled from the cache, because
gd dip needs 7.6 GB. On the other hand, wary can be held in the cache, because its needs only 250 KB and the array is
defined in the preceding loop. Therefore, the cache hit rate of this loop is 50%. Then all data of wary are supplied
directly from the vector cache at the 2 B/FLOP rate and all data of gd dip are supplied from the memory at the 2 B/
FLOP rate through the cache, and the total amount of data provided to vector registers in time is equal to that of the
4 B/FLOP system.
The relative memory bandwidths in others applications, SFHT, GPR and PRF are over 0.6 on the 2 B/FLOP system
and 0.3 on the 1 B/FLOP system. Here, the relative memory bandwidth without the cache is 0.5 on the 2 B/FLOP
system and 0.25 on the 1 B/FLOP system. Thus, the improvement of the relative memory bandwidth is over 20% using
the 8 MB cache.
Figure 8 shows the computation efficiency of five scientific applications on the 4 B/FLOP system without the cache,
the 2 B/FLOP and 1 B/FLOP systems with/without the 8 MB cache. Figure 8 indicates that the efficiency of the 2 B/
FLOP and 1 B/FLOP systems is increased by the cache. In particular, the 2 B/FLOP system with the cache achieves the
same efficiency as the 4 B/FLOP system in APFA and PBM. Figure 9 shows the relationship between the cache hit rate
of the 8 MB cache and the recovery rate of the performance by the cache from the 2 B/FLOP and 1 B/FLOP systems to
the 4 B/FLOP system. The recovery rate goes up as the cache hit rate increases. The vector cache can increase the
recovery rate by 18 to 99% on the 2 B/FLOP system and 9 to 96% on the 1 B/FLOP system, depending on the data
access locality of each application. The results indicate that the vector cache has a potential for the future SX systems to
cover the shortage of memory bandwidth on real scientific applications.
5.4 Relationship between Associativity and Cache Hit Rate
In this section, we examine the effect of the associativity ranging from 2 to 8 ways on the performance. Figure 10
shows that the cache hit rate on each set associative cache, covering cache sizes from 256 KB to 8 MB. Four
applications, APFA, PRF, SFHT and PBM, approximately have constant cache hit rates across three associativity cases.
In GPR, the cache hit rate slightly varies with the associativity.
The cache hit rate is dependent on among placements of the arrays on memory addresses, memory access patterns of
the programs and the associativity. In GPR, its basic computational structure consists of triple nested loops. The loops
access many arrays at intervals of 584 bytes stride addresses, and the cache data are frequently replaced. Therefore, the
0
0.25
0.5
0.75
1
0 20 40 60 80 100
RelativeMemoryBandwidth
Cache Hit Rate (%)
2B/F 8MB
1B/F 8MB
PBM
SFHT
GPR
PRF
APFA
Figure 6. Relationship between cache hit rate and relative memory bandwidth on five applications.
01 do i=1,ncells
02 do j=1,ncells
03 wf_dip(i)=wf_dip(i)+gd_dip(j,i)*wary(j)
04 end do
05 end do
Figure 7. Main routine of PBM.
Characteristics of an On-Chip Cache on NEC SX Vector Architecture 57
cache hit rate varies with the cache capacities and the associativity. In SFHT, the memory access patterns have 8 bytes
stride accesses. On the 2 MB case, the cached data which will be reused on the 2-way set associative cache are evicted
by some other data. Thus this is due to a placement of memory addresses of the arrays. Other programs have constant
cache hit rates across three associativity cases. These memory access patterns are 8 bytes or 16 bytes stride accesses,
and the placement of the arrays on memory do not cause the conflict of reused data.
These results provide that the associativity hardly have effects on the cache hit rates, and the cache capacity mainly
influences the cache hit rate across the five applications.
5.5 Effects of Vector Cache Latency on Performance
In general, the latency of the on-chip cache is considerably shorter than that of the off-chip memory. We investigate
the effects of the cache access latency on the performance of the five applications. Figure 11 shows the relationship
between the computation efficiency of the applications and four cache latencies; 15, 50, 85 and 100% of the memory
access latency on the 2 B/FLOP system with the 2 MB cache. The five applications have almost the same efficiency
when changing the cache latency. Because the vector architecture can hide the memory access times by pipelined
vector operations when a vector operation ratio and a vector loop length of applications are enough large. In addition,
the vector cache consists of 32 sub-caches between a vector register file and the main memory via each memory port.
The sub-caches connected to 32 memory ports provide multiple words at a time. The cache latency of the second
memory access or later is hidden in the sub-cache system.
For comparison, we examine the performance of shorter loop length cases using Kernel loops (1) and (2) on the 2 B/
FLOP system when changing the cache access latency. Figure 12 shows that the relationship between the computation
efficiency of Kernel loops and four cache latency cases. Here, we examine three loop length cases in the 100% cache hit
rate; 1 i 64, 1 i 128, 1 i 256. On the shorter loop length cases, Figure 12 indicates that the vector cache
latency is an importance factor in the performance. Especially, on the loop length 64 case, the 15% latency case has two
times higher efficiency in Kernel (1) than the 100% case, and the 15% latency case has 12% higher efficiency in Kernel
0
20
40
60
80
100
0 20 40 60 80 100
PerformanceRecoveryRate(%)
Cache Hit Rate (%)
2B/F 8MB
1B/F 8MB
PBM
SFHT
GPR
PRF
APFA
Figure 9. Recovery rate of performance on 8 MB cache.
0
10
20
30
40
50
GPR APFA PRF SFHT PBM
Efficiency(%)
4B/F 2B/F non-cache 2B/F 8MB 1B/F non-cache 1B/F 8MB
Figure 8. Computation efficiency of five applications with/without 8 MB vector cache.
58 MUSA et al.
(2). However, the longer the loop length, the lower the effect of the cache latency. Figure 13 shows that the
computation efficiency of Kernel loops in the 4 B/FLOP and 2 B/FLOP systems with/without the vector cache. On
both the loop lengths 64 and 128 of Kernel (1), the efficiency of the 4 B/FLOP system with the cache is higher than that
of the 4 B/FLOP system without the cache, and the loop length 128 case of the 4 B/FLOP system with the cache
achieves the highest efficiency. Meanwhile, in Kernel (2), the loop length 64 case of the 4 B/FLOP system with the
cache has the highest efficiency. This is because the ratio of memory access time to the processing time becomes the
GPR APFA
PRF SFHT
PBM
0
10
20
30
40
50
2WAY 4WAY 8WAY
CacheHitRate(%)
0.25MB 0.5MB 1MB 2MB 4MB 8MB
0
20
40
60
80
100
2WAY 4WAY 8WAY
CacheHitRate(%)
0.25MB 0.5MB 1MB 2MB 4MB 8MB
0
10
20
30
40
50
2WAY 4WAY 8WAY
CacheHitRate(%)
0.25MB 0.5MB 1MB 2MB 4MB 8MB
0
10
20
30
40
50
2WAY 4WAY 8WAY
CacheHitRate(%)
0.25MB 0.5MB 1MB 2MB 4MB 8MB
0
10
20
30
40
50
2WAY 4WAY 8WAY
CacheHitRate(%)
0.25MB 0.5MB 1MB 2MB 4MB 8MB
Figure 10. Cache hit rate vs. associativity on five applications.
0
10
20
30
40
50
60
15% 50% 85% 100%
Efficiency(%)
Cache Latency Case
GPR APFA PRF SFHT PBM
Figure 11. Computation efficiency of five applications on four cache latency cases.
Characteristics of an On-Chip Cache on NEC SX Vector Architecture 59
smallest due to the decrease in memory latency by the cache, and in these kernel loops the memory access time of the
shorter loop length case cannot be entirely hidden by pipelined vector operations only. Therefore, the performance of
shorter loop length cases is sensitive to the vector cache latency, and the vector cache can boost the performance of the
applications with short vector lengths on the 4 B/FLOP system.
Moreover, the cache latency is not hidden when the bank conflicts occur. In this work, the bank conflicts hardly occur
in the five applications. As our future work, quantitative discussions on the influence of the bank conflicts will be
addressed.
6. Optimizations for Vector Caching: Selective Caching and Loop Unrolling
In this section, we discuss some optimization techniques for the vector cache to reduce cache miss rates on scientific
applications. In particular, we show that the relationship between loop-unrolling and caching. Many compilers
automatically optimize programs using the loop-unrolling, which is the most basic loop optimization.
6.1 Effects of Selective Caching
The vector cache has a bypass mechanism, and data are selectively cached: selective caching. The simulation
methods of the four scientific applications, GPR, APFA, PRF and SFHT, are based on the difference schemes; Finite
Difference Time Domain (FDTD) method and Direct Numerical Simulation of flow and heart (DNS), and a part of
arrays has a high locality in many cases. As the on-chip cache size is limited, selective caching might be an effective
approach to efficient use of the on-chip cache. We evaluate the selective caching for two applications, GPR and PRF,
which are typical examples of FDTD and DNS, respectively.
Figure 14 shows one of kernel loops of GPR; the loop is a main routine of the FDTD method, and many routines of
GPR are similar to this kernel loop. The innermost loop is indexed with j. Usually, FORTRAN compilers interchange
loops j and i, however, FORTRAN90/SX compiler does not interchange the loops, because the compiler decides that
the computational performance cannot be improved due to the short length of loop i. Loop i has a length of 50, and j is
750.
As the size of each array used in GPR is 330 MB, the cache cannot store all the arrays. However, the difference
scheme as shown in Figure 14 generally has temporal locality in accessing arrays; H x, H y and H z. We select these
arrays for selective caching. Figure 15 shows the effect of selective caching on the 2 B/FLOP system with the vector
cache in GPR. Figure 16 shows the recovery rate of the performance from the 2 B/FLOP system without the cache to
the 4 B/FLOP system by selective caching in GPR. We discuss the cache sizes from 256 KB up to 8 MB. Here, ‘‘ALL’’
in the figures indicates that all the arrays in the loop are cached. ‘‘Selective’’ shows that some of the arrays are
selectively cached. ‘‘Cache Usage Rate’’ indicates the ratio of number of cache hit references to the total memory
references.
0
2
4
6
8
10
12
14
16
18
15% 50% 85% 100%
Efficiency(%)
Cache Latency Case
Kernel Loop (1)
64 128 256
0
5
10
15
20
25
30
15% 50% 85% 100%
Efficiency(%)
Cache Latency Case
Kernel Loop (2)
64 128 256
Figure 12. Computation efficiency of two kernel loops on four cache latency cases.
0
2
4
6
8
10
12
14
16
18
20
64 128 256
Efficiency(%)
Loop Length
Kernel Loop (1)
4B/F 4B/F 2MB 2B/F 2B/F 2MB
0
5
10
15
20
25
30
35
64 128 256
Efficiency(%)
Loop Length
Kernel Loop (2)
4B/F 4B/F 2MB 2B/F 2B/F 2MB
Figure 13. Computation efficiency of two kernel loops on 4 B/FLOP and 2 B/FLOP.
60 MUSA et al.
In the 256 KB cache case, the efficiencies of ‘‘ALL’’ and ‘‘Selective’’ are 33.3 and 34.1%, and the cache usage rate
are 9.6 and 16.6%, respectively. The arrays H yði; j; kÞ, H yði À 1; j; kÞ (line 08 in Figure 14), H xði; j; kÞ (line 11) and
H zði À 1; j; kÞ (line 12) can load data from the cache on ‘‘Selective.’’ In ‘‘ALL,’’ H yði À 1; j; kÞ (line 08) and
01 DO 10 k=0,Nz
02 DO 10 i=0,Nx
03 DO 10 j=0,Ny
04 E_x(i,j,k) = C_x_a(i,j,k)*E_x(i,j,k)
05 & + C_x_b(i,j,k) * (( H_z(i,j,k) - H_z(i,j-1,k))/dy
06 & - (H_y(i,j,k) - H_y(i,j,k-1))/dz - E_x_Current(i,j,k))
07 E_z(i,j,k) = C_z_a(i,j,k)*E_z(i,j,k)
08 & + C_z_b(i,j,k) * ((H_y(i,j,k)-H_y(i-1,j,k) )/dx
09 & - (H_x(i,j,k)-H_x(i,j-1,k) )/dy - E_z_Current(i,j,k))
10 E_y(i,j,k) = C_y_a(i,j,k)*E_y(i,j,k)
11 & + C_y_b(i,j,k) * ((H_x(i,j,k) - H_x(i,j,k-1) )/dz
12 & - (H_z(i,j,k) - H_z(i-1,j,k))/dx - E_y_Current(i,j,k))
13 10 CONTINU
Figure 14. A kernel loop of GPR.
0
10
20
30
30
32
34
36
38
40
256K 512K 1M 2M 4M 8M
CacheUsageRate(%)
Efficiency(%)
Cache Size (B)
ALL (Performance)
Selective (Performance)
ALL (Cache usage rate)
Selective (Cache usage rate)
Figure 15. Efficiency of selective caching and cache hit rate on 2 B/FLOP system (GPR).
0
10
20
30
0
10
20
30
40
256K 512K 1M 2M 4M 8M
CacheUsageRate(%)
RecoveryRateofPerformanc(%)
Cache Size (B)
ALL (Recover Rate)
Selective (Recover Rate)
ALL (Cache usage rate)
Selective (Cache usage rate)
Figure 16. Recovery rate of performance and cache hit rate on selective caching (GPR).
Characteristics of an On-Chip Cache on NEC SX Vector Architecture 61
H zði À 1; j; kÞ (line 12) cause cache misses, because the reused data of these arrays are replaced by other array data.
Figures 15 and 16 indicate that the performance and the cache usage rate increase as the cache size increase. On the
4 MB cache, the cache hit rate of ‘‘Selective’’ is 24% and its efficiency is 3% higher than that of ‘‘ALL.’’ The recovery
rate of performance is mere 34%. However, GPR cannot achieve high cache usage rates and recovery rates of
performance by selective caching. This is because this loop has many non-locality arrays; C x a, E x, E x Current etc.
Figure 17 shows one of kernel loops of PRF; the loop is a high cost routine. The size of each array in PRF is 18 MB,
and many loops are doubly nested loop. Thus, the arrays are treated as two dimensions array, and the size of cached
data per array is 2 MB only. wDY1YC1, wDY1YC2 and wDY1YC3 in Figure 17 are defined in the preceding loop. In
addition, wDY1YC1 and wDY1YC2 have the spatial locality. We select these arrays for selective caching. Here,
wDY1YC3ðI; KK À 1; LÞ (line 04 in Figure 17), wDY1YC2ðI; KK; LÞ (line 05), wDY1YC1ðI; KK; LÞ (line 06) and
wDY1YC2ðI; KK; LÞ (line 06) can reuse data on the register files. Thus, these arrays do not access the cache.
Figures 18 and 19 show the efficiency and the recovery rate of the performance from the 2 B/FLOP system to the
4 B/FLOP system by selective caching in PRF. Figure 18 indicates that the cache hit rates and efficiency of ‘‘Selective’’
are the same as that of ‘‘ALL’’ until 2 MB cache size. wDY1YC2ðI; KK À 1; LÞ (line 03) and wDY1YC1ðI; KK À 1; LÞ
(line 08) are cache hit arrays in each case, and the cache hit rate is 33%. Over 4 MB cache size, ‘‘Selective’’ achieves a
66% cache hit rate, resulting in a 30+% improvement in efficiency. Moreover, Figure 19 indicates that the recovery
rate is 78%.
The results of GPR and PRF show that the selective caching improves the performance of the applications.
Compared with GPR, PRF has higher values regarding both the cache usage rate and improved efficiency, because the
ratio of arrays with locality per loop on PRF is higher than that of GPR. In GPR the reused data are only defined in the
loop, and the cache miss always occurs in this loop. For example, arrays H yði; j; kÞ (line 08 in Figure 14) hits the data
of H yði; j; kÞ (line 06), but H yði; j; kÞ (line 06) cannot hit the data due to their first accesses (cold miss). On the other
hand, the cached data of PRF are provided in the immediately preceding loop, thus cache misses do not occur like the
first access in PRF.
01 DO KK = 2,NJ
02 DO I = 1,NI
03 wDY1YC3(I,KK-1,L) = wDY1YC3(I,KK-1,L) * wDY1YC2(I,KK-1,L)
04 wDY1YC2(I,KK,L) = wDY1YC2(I,KK,L) – wDY1YC1(I,KK,L) * wDY1YC3(I,KK-1,L)
05 wDY1YC2(I,KK,L) = 1.D0 / wDY1YC2(I,KK,L)
06 wDY1YC(I,KK,L) = (wPHIY12(I,KK) – wDY1YC1(I,KK,L) * wDY1YC1(I,KK-1,L)) * wDY1YC2(I,KK,L)
07 END DO
08 END DO
Figure 17. High cost kernel loop of PRF.
0
10
20
30
40
50
60
70
20
25
30
35
256K 512K 1M 2M 4M 8M
CacheUsageRate(%)
Efficiency(%)
Cache Size (B)
ALL (Performance)
Selective (Performance)
ALL (Cache usage rate)
Selective (Cache usage rate)
Figure 18. Efficiency and cache hit rate of selective caching on performance (PRF).
62 MUSA et al.
6.2 Effects of Loop-unrolling and Caching
Loop unrolling is effective for higher utilization of vector pipelines, however, loop unrolling also needs more cache
capacity to capture all the arrays of unrolled loops. Therefore, we discuss the relationship between loop unrolling and
vector caching on the performance.
Figure 20 shows a code outer unrolled 4 times, whose original code is shown in Figure 7. In the outer unrolling case,
array wary of the outer loop index i (line 03 in Figure 20) reuses the data of the index i À 1 on the cache and arrays
wary (line 04, 05 and 06) reuse the data on register files of wary (line 03), hence, memory references are reduced.
However, the cache capacity for gd dip increases due to an increase in the number of arrays reference in the innermost
loop. Figure 21 shows the efficiency and the cache hit rate of the outer unrolling case on PBM with the 2 B/FLOP
system. ‘‘2B/F’’ indicates the without-cache case, and the efficiency constantly increases as the degree of unrolling
increases. The cache hit rate of the 0.5 MB cache case is 49% on the non-unrolling case and becomes poor as the loop
unrolling proceeds, because the cached data of wary are evicted from the cache by gd dip. Thus, the efficiency of the
0.5 MB cache case achieves 49% on the non-unrolling case, and it is similar to that of the 2 B/FLOP system with
unrolling. In this case, a conflicting effect between caching and unrolling appears. Moreover, the cache hit rates of the
8 MB cache case decrease as the degree of unrolling increases. The cached data remain on the cache in this case, but the
cache hit rate decreases due to a decrease in the number of wary (line 03) references. However, the efficiency is
approximately constant; 48 or 49% because unrolling covers the losing effect of caching. This case shows that the
effects of caching are comparable to that of unrolling.
Figure 22 shows a code obtained by unrolling code two times shown in Figure 17. Outer unrolling reduces the
memory references of the arrays wDY1YC2ðI; KK; LÞ (line 07 in Figure 22) and wDY1YC1ðI; KK; LÞ (line 10).
Figure 23 shows the efficiency and the cache hit rate of outer unrolling on PRF with the 2 B/FLOP system. The cache
hit rates of both the 0.5 and 8 MB caches gradually reduce as the degree of unrolling increase, because of an increase in
the cache miss rate. In this case, the highest efficiency is 33% on the 8 MB cache with the unrolling degree of 4 since
the gain by loop unrolling covers the loss due to the increase in the miss rate. It is higher than the selective caching
case; it is 30.7% (Ref. Figure 18). Thus, this case indicates the synergistic effect of both caching and unrolling.
0
10
20
30
40
50
60
70
20
40
60
80
256K 512K 1M 2M 4M 8M
CacheUsageRate(%)
RecoveryRateofperformance(%)
Cache Size (B)
ALL (Performance)
Selective (Performance)
ALL (Cache usage rate)
Selective (Cache usage rate)
Figure 19. Recovery rate of performance and cache hit rate on selective caching (PRF).
01 do i=1,ncells,4
02 do j=1,ncells
03 wf_dip(i)=wf_dip(i)+gd_dip(j,i)*wary(j)
04 wf_dip(i+1)=wf_dip(i+1)+gd_dip(j,i+1)*wary(j)
05 wf_dip(i+2)=wf_dip(i+2)+gd_dip(j,i+2)*wary(j)
06 wf_dip(i+3)=wf_dip(i+3)+gd_dip(j,i+3)*wary(j)
07 end do
08 end do
Figure 20. Outer unrolling loop of PBM.
Characteristics of an On-Chip Cache on NEC SX Vector Architecture 63
These results indicate that loop unrolling has both conflicting and synergistic effects of caching. Loop unrolling is a
most basic optimization and has beneficial effects in various applications. Therefore, caching has to be used carefully as
a complementary tuning option for loop unrolling.
7. Conclusions
This paper has presented the characteristics of an on-chip vector cache with the bypass mechanism on the vector
supercomputer NEC SX architecture. We have demonstrated the relationship between the cache hit rate and the
0
10
20
30
40
50
60
0
10
20
30
40
50
60
Non-Unroll Unroll2 Unroll4 Unroll8 Unroll16
CacheHitRate(%)
Efficiency(%)
2B/F 0.5MB 8MB Hit rate 0.5MB Hit rate 8MB
Figure 21. Efficiency and cache hit rate of outer unrolling on performance (PBM).
01 DO KK = 2,NJ
02 DO I = 1,NI
03 wDY1YC3(I,KK-1,L) = wDY1YC3(I,KK-1,L) * wDY1YC2(I,KK-1,L)
04 wDY1YC2(I,KK,L) = wDY1YC2(I,KK,L) – wDY1YC(I,KK,L) * wDY1YC3(I,KK-1,L)
05 wDY1YC2(I,KK,L) = 1.D0 / wDY1YC2(I,KK,L)
06 wDY1YC(I,KK,L) = (wPHIY12(I,KK) – wDY1YC1(I,KK,L) * wDY1YC1(I,KK-1,L)) * wDY1YC2(I,KK,L)
07 wDY1YC3(I,KK,L) = wDY1YC3(I,KK,L) * wDY1YC2(I,KK,L)
08 wDY1YC2(I,KK+1,L) = wDY1YC2(I,KK+1,L) – wDY1YC(I,KK+1,L) * wDY1YC3(I,KK,L)
09 wDY1YC2(I,KK+1,L) = 1.D0 / wDY1YC2(I,KK+1,L)
10 wDY1YC(I,KK+1,L) = (wPHIY12(I,KK+1) – wDY1YC1(I,KK+1,L) * wDY1YC1(I,KK,L)) * wDY1YC2(I,KK+1,L)
11 END DO
12 END DO
Figure 22. Outer unrolling loop of PRF.
0
10
20
30
40
50
60
0
5
10
15
20
25
30
35
Non-Unroll Unroll 2 Unroll 4 Unroll 8 Unroll 16
CacheHitRate(%)
Efficiency(%)
2B/F 0.5MB 8MB hit rate 0.5MB hit rate 8MB
Figure 23. Efficiency and cache hit rate of outer unrolling on performance (PRF).
64 MUSA et al.
performance using two DAXPY-like loops. Moreover, we have clarified the same tendency of the relationship on the
five scientific applications. The vector cache can recover the lack of the memory bandwidth, and boosts the
computation efficiency of the 2 B/FLOP and 1 B/FLOP systems. Especially, as cache hit rates are 50% or more, the
2 B/FLOP system can achieve a performance comparable to the 4 B/FLOP system. The vector cache with a bypass
mechanism can provide the data from both the memory and the cache in time, and the sustained memory bandwidth for
the registers are increased. Therefore, the vector cache has a potential for the future SX systems to cover the shortage of
their memory bandwidth.
We have also discussed the relationship between performance and cache design parameters such as cache
associativity and cache latency. We have examined the effect of the cache associativity setting from 2-way to 8-way.
When the associativity is two or more, the cache hit rate is not sensitive to the associativity in the five applications. In
addition, we have examined the effect of the cache latency on the performance when changing it to 15, 50, 85 and
100% of the memory latency. We have demonstrated the computational efficiencies of five applications are constant
across these latency changes, when the vector loop lengths of the applications are 256 or more. In these cases, the
latency is hidden by pipelined vector operations. However, in the case of shorter vector loop lengths, the cache latency
affects the performance, and the 15% latency case of Kernel (1) has two times higher efficiency than the 100% case in
the 2 B/FLOP system. In addition, the 4 B/FLOP system is also boosted due to the effect of the short latency of the
cache. The above results indicate that the performance of cache does not largely depend on the cache associativity,
however, the cache latency affects the performance of applications with short vector loop lengths.
Finally we have discussed selective caching and the relationship between loop-unrolling and caching. We have
shown that selective caching, which is controlled by means of the bypass mechanism, is effective for efficient use of
limited on-chip caches. We have examined two cases; the ratios of arrays with locality per loop are higher and lower
cases. In each case, the higher performance is obtained by selective caching, compared with all the data caching. In
addition, the loop unrolling is useful in the improvement of performance, and caching is complementary to the effect of
loop unrolling.
In the future work, we will evaluate other scientific applications using the vector cache, and quantitatively
discussions the relationship between the bank conflicts and the cache latency. In addition, we will also investigate the
mechanism of vector cache on multi core vector processors.
Acknowledgments
The authors would like to gratefully thank: Dr. M. Sato, Dr. A. Hasegawa, Professor G. Masuya, Professor T. Ota,
Professor K. Sawaya of Tohoku University, for their advice about applications. We also would like to thank Matsuo
Aoyama and Hirofumi Akiba of NEC Software Tohoku for their assistance in experiments.
REFERENCES
[1] Abts, D., et al., ‘‘The Cray BlackWidow: A Highly Scalable Vector Multiprocessor,’’ In Proceedings of the ACM/IEEE
SC2007 conference (2007).
[2] Abts, D., Scott, S., and Lilja, D. J., ‘‘So Many States, So Little Time: Verifying Memory Coherence in the Cray X1,’’ In
International Parallel and Distributed Processing Symposium (2003, April).
[3] Ariyoshi, K., Matsuzawa, T., and Hasegawa, A., ‘‘The key frictional parameters controlling spatial variation in the speed of
postseismic slip propagation on a subduction plate boundary,’’ Earth Planet. Sci. Lett., 256: 136–146 (2007).
[4] Dunigan Jr., T. H., Vetter, J. S., White III, J. B., and Worley, P. H., ‘‘Performance Evaluation of The Cray X1 Distributed
Shared-Memory Architecture,’’ In IEEE MICRO, 25: 30–40 (2005).
[5] Gee, J. D., and Smith, A. J., ‘‘Vector Processor Caches,’’ Technical Report UCB/CSD-92-707, University of California (1992,
October).
[6] Gee, J. D., and Smith, A. J., ‘‘The Effectiveness of Caches for Vector Processors,’’ In The 8th international Conference on
Supercomputing 1994, pp. 333–343 (1994, July).
[7] Kitagawa, K., Tagaya, S., Hagihara, Y., and Kanoh, Y., ‘‘A Hardware Overview of SX-6 and SX-7 Supercomputer,’’ NEC
Research & Development, 44: 2–7 (2003).
[8] Kobayashi, H., ‘‘Implication of Memory Performance in Vector-Parallel and Scalar-Parallel HEC,’’ In M. Resch et al. (Eds.),
High Performance Computing on Vector Systems 2006, pp. 21–50, Springer-Verlag (2006).
[9] Kobayashi, H., Musa, A., Sato, Y., Takizawa, H., and Okabe, K., ‘‘The potential of On-chip Memory Systems for Future
Vector Architectures,’’ In M. Resch et al. (Eds.), High Performance Computing on Vector Systems 2007, pp. 247–264,
Springer-Verlag (2007).
[10] Kobayashi, T., Feng, X., and Sato, M., ‘‘FDTD simulation on array antenna SAR-GPR for land mine detection,’’ In
Proceedings of SSR2003: 1st International Symposium on Systems and Human Science, Osaka, Japan, pp. 279–283 (2003,
November).
[11] Kontothanassis, L. I., Sugumar, R. A., Faanes, G. J., Smith, J. E., and Scott, M. L., ‘‘Cache Performance in Vector
Supercomputers,’’ In Proceedings of the 1994 conference on Supercomputing, Washington D.C., pp. 255–264 (1994).
[12] Kunz, K. S., and Luebbers, R. J., ‘‘The Finite Difference Time Domain Method for Electromagnetics,’’ Boca Raton, FL: CRC
Press (1993).
[13] Musa, A., Sato, Y., Egawa, R., Takizawa, H., Okabe, K., and Kobayashi, H., ‘‘An On-Chip cache Design for Vector
Characteristics of an On-Chip Cache on NEC SX Vector Architecture 65
Processors,’’ In Proceedings of the 8th MEDEA workshop on Memory performance: Dealing with Applications, systems and
architecture, Romania, pp. 17–23 (2007).
[14] Musa, A., Takizawa, H., Okabe, K., Soga, T., and Kobayashi, H., ‘‘Implications of Memory Performance for Highly Efficient
Supercomputing of Scientific Applications,’’ In Proceedings of the 4th International Symposium on Parallel and Distributed
Processing and Application 2006, Italy, pp. 845–858 (2006).
[15] Nakajima, M., Yanaoka, H., Yoshikawa, H., and Ota, T., ‘‘Numerical Simulation of Three-Dimensional Separated Flow and
Heat Transfer around Staggerd Surface-Mounted Rectangular Blocks in a Channel,’’ Numerical Heat Transfer (PartA), 47:
691–708 (2005).
[16] Oliker, L., et al., ‘‘A Performance Evaluation of the Cray X1 for Scientific Applications,’’ In VECPAR’04: 6th International
Meeting on High Performance Computing for Computational Science, Valencia, Spain (2004, June).
[17] Senta, T., Takahashi, A., Kadoi, T., Itoh, T., Kawaguchi, S., and Kishida, Y., ‘‘Itanium2 32-way Server System Architecture,’’
NEC Research & Development, 44: 8–12 (2003).
[18] SGI, ‘‘SGI Altix 4700 Servers and Supercomputers,’’ Technical report, SGI Inc., http://www.sgi.com/pdfs/3867.pdf (2007).
[19] Shan, H., and Strohmaier, E., ‘‘Performance Characteristics of the Cray X1 and Their Implications for Application
Performance Tuning,’’ In 18th Annual International Conference on Supercomputing, Malo, France, pp. 175–183 (2004).
[20] Takagi, Y., Sato, H., Wagatsuma, Y., Mizuno, K., and Sawaya, K., ‘‘Study of High Gain and Broadband Antipodal Fermi
Anetnna with Corrugation,’’ In 2004 International Symposium on Antennas and Propagation, Vol. 1, pp. 69–72 (2004).
[21] Tsuboi, K., and Masuya, G., ‘‘Direct Numerical Simulations for Instabilities of Remixed Planar Flames,’’ In The Fourth Asia-
Pacific Conference on Combustion, Nanjing, China, pp. 136–139 (2003, November).
66 MUSA et al.

Contenu connexe

Tendances

Design and Simulation of a 16kb Memory using Memory Banking technique
Design and Simulation of a 16kb Memory using Memory Banking techniqueDesign and Simulation of a 16kb Memory using Memory Banking technique
Design and Simulation of a 16kb Memory using Memory Banking techniqueIRJET Journal
 
Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDhritiman Halder
 
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A SupercomputerIntroduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A SupercomputerFörderverein Technische Fakultät
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...eSAT Publishing House
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...Bomm Kim
 
Internet data mining 2006
Internet data mining   2006Internet data mining   2006
Internet data mining 2006raj_vij
 
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...Serhan
 
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with PrioritiesA Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Prioritiesidescitation
 
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation NetworksAnalyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation Networksbalmanme
 
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...IJCNCJournal
 
C++ neural networks and fuzzy logic
C++ neural networks and fuzzy logicC++ neural networks and fuzzy logic
C++ neural networks and fuzzy logicJamerson Ramos
 
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA csandit
 
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...IJCNCJournal
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
 
High performance modified bit-vector based packet classification module on lo...
High performance modified bit-vector based packet classification module on lo...High performance modified bit-vector based packet classification module on lo...
High performance modified bit-vector based packet classification module on lo...IJECEIAES
 

Tendances (18)

shashank_hpca1995_00386533
shashank_hpca1995_00386533shashank_hpca1995_00386533
shashank_hpca1995_00386533
 
Crayl
CraylCrayl
Crayl
 
Design and Simulation of a 16kb Memory using Memory Banking technique
Design and Simulation of a 16kb Memory using Memory Banking techniqueDesign and Simulation of a 16kb Memory using Memory Banking technique
Design and Simulation of a 16kb Memory using Memory Banking technique
 
Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache Memory
 
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A SupercomputerIntroduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
 
RL-Cache: Learning-Based Cache Admission for Content Delivery
RL-Cache: Learning-Based Cache Admission for Content DeliveryRL-Cache: Learning-Based Cache Admission for Content Delivery
RL-Cache: Learning-Based Cache Admission for Content Delivery
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...
 
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...(Im2col)accelerating deep neural networks on low power heterogeneous architec...
(Im2col)accelerating deep neural networks on low power heterogeneous architec...
 
Internet data mining 2006
Internet data mining   2006Internet data mining   2006
Internet data mining 2006
 
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
Implementation and Optimization of FDTD Kernels by Using Cache-Aware Time-Ske...
 
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with PrioritiesA Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
A Kernel-Level Traffic Probe to Capture and Analyze Data Flows with Priorities
 
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation NetworksAnalyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
 
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
 
C++ neural networks and fuzzy logic
C++ neural networks and fuzzy logicC++ neural networks and fuzzy logic
C++ neural networks and fuzzy logic
 
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA
LARGE SCALE IMAGE PROCESSING IN REAL-TIME ENVIRONMENTS WITH KAFKA
 
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
THRESHOLD BASED VM PLACEMENT TECHNIQUE FOR LOAD BALANCED RESOURCE PROVISIONIN...
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
 
High performance modified bit-vector based packet classification module on lo...
High performance modified bit-vector based packet classification module on lo...High performance modified bit-vector based packet classification module on lo...
High performance modified bit-vector based packet classification module on lo...
 

Similaire à Characteristics of an on chip cache on nec sx

Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
 
Interface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration FrameworkInterface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration FrameworkLiang Men
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsJames McGalliard
 
2014 IEEE DOTNET NETWORKING PROJECT Content caching-and-scheduling-in-wireles...
2014 IEEE DOTNET NETWORKING PROJECT Content caching-and-scheduling-in-wireles...2014 IEEE DOTNET NETWORKING PROJECT Content caching-and-scheduling-in-wireles...
2014 IEEE DOTNET NETWORKING PROJECT Content caching-and-scheduling-in-wireles...IEEEFINALSEMSTUDENTSPROJECTS
 
1.multicore processors
1.multicore processors1.multicore processors
1.multicore processorsHebeon1
 
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURESREDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURESijcsit
 
Reducing Competitive Cache Misses in Modern Processor Architectures
Reducing Competitive Cache Misses in Modern Processor ArchitecturesReducing Competitive Cache Misses in Modern Processor Architectures
Reducing Competitive Cache Misses in Modern Processor ArchitecturesAIRCC Publishing Corporation
 
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURESREDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURESijcsit
 
Panasas ® Los Alamos National Laboratory
Panasas ® Los Alamos National LaboratoryPanasas ® Los Alamos National Laboratory
Panasas ® Los Alamos National LaboratoryPanasas
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersRyousei Takano
 
IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...
IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...
IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...IRJET Journal
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSijdpsjournal
 
Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...IJECEIAES
 
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...VLSICS Design
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingeSAT Journals
 

Similaire à Characteristics of an on chip cache on nec sx (20)

Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
 
Interface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration FrameworkInterface for Performance Environment Autoconfiguration Framework
Interface for Performance Environment Autoconfiguration Framework
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
IEEExeonmem
IEEExeonmemIEEExeonmem
IEEExeonmem
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific Applications
 
Cache memory
Cache memoryCache memory
Cache memory
 
2014 IEEE DOTNET NETWORKING PROJECT Content caching-and-scheduling-in-wireles...
2014 IEEE DOTNET NETWORKING PROJECT Content caching-and-scheduling-in-wireles...2014 IEEE DOTNET NETWORKING PROJECT Content caching-and-scheduling-in-wireles...
2014 IEEE DOTNET NETWORKING PROJECT Content caching-and-scheduling-in-wireles...
 
1.multicore processors
1.multicore processors1.multicore processors
1.multicore processors
 
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURESREDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
 
Reducing Competitive Cache Misses in Modern Processor Architectures
Reducing Competitive Cache Misses in Modern Processor ArchitecturesReducing Competitive Cache Misses in Modern Processor Architectures
Reducing Competitive Cache Misses in Modern Processor Architectures
 
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURESREDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
REDUCING COMPETITIVE CACHE MISSES IN MODERN PROCESSOR ARCHITECTURES
 
An efficient multi-level cache system for geometrically interconnected many-...
An efficient multi-level cache system for geometrically  interconnected many-...An efficient multi-level cache system for geometrically  interconnected many-...
An efficient multi-level cache system for geometrically interconnected many-...
 
Panasas ® Los Alamos National Laboratory
Panasas ® Los Alamos National LaboratoryPanasas ® Los Alamos National Laboratory
Panasas ® Los Alamos National Laboratory
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...
IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...
IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...
 
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
 
Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...Design of storage benchmark kit framework for supporting the file storage ret...
Design of storage benchmark kit framework for supporting the file storage ret...
 
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
Optimized Design of 2D Mesh NOC Router using Custom SRAM & Common Buffer Util...
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
 

Plus de Léia de Sousa

Algoritmos geometricos
Algoritmos geometricosAlgoritmos geometricos
Algoritmos geometricosLéia de Sousa
 
Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingLéia de Sousa
 
Coerãªncia+ +memoria
Coerãªncia+ +memoriaCoerãªncia+ +memoria
Coerãªncia+ +memoriaLéia de Sousa
 
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...Léia de Sousa
 
Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...
Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...
Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...Léia de Sousa
 
Data cache design itanium 2
Data cache design itanium 2Data cache design itanium 2
Data cache design itanium 2Léia de Sousa
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009Léia de Sousa
 

Plus de Léia de Sousa (11)

Algoritmos geometricos
Algoritmos geometricosAlgoritmos geometricos
Algoritmos geometricos
 
Dark silicon and the end of multicore scaling
Dark silicon and the end of multicore scalingDark silicon and the end of multicore scaling
Dark silicon and the end of multicore scaling
 
Cray xt3
Cray xt3Cray xt3
Cray xt3
 
Coerãªncia+ +memoria
Coerãªncia+ +memoriaCoerãªncia+ +memoria
Coerãªncia+ +memoria
 
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
Chrome server2 print_http_www_uni_mannheim_de_acm97_papers_soderquist_m_13736...
 
Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...
Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...
Chrome server2 print_http_www_hardware_com_br_dicas_entendendo_cache_ht_13737...
 
Data cache design itanium 2
Data cache design itanium 2Data cache design itanium 2
Data cache design itanium 2
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009
 
Ultrasparc3 (mpr)
Ultrasparc3 (mpr)Ultrasparc3 (mpr)
Ultrasparc3 (mpr)
 
Sx 6-single-node
Sx 6-single-nodeSx 6-single-node
Sx 6-single-node
 
Gr spice wrks
Gr spice wrksGr spice wrks
Gr spice wrks
 

Dernier

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Dernier (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Characteristics of an on chip cache on nec sx

  • 1. Characteristics of an On-Chip Cache on NEC SX Vector Architecture Akihiro MUSA1;3Ã , Yoshiei SATO1y , Ryusuke EGAWA1;2z , Hiroyuki TAKIZAWA1;2x , Koki OKABE2} , and Hiroaki KOBAYASHI1;2k 1 Graduate School of Information Sciences, Tohoku University, Sendai 980-8578, Japan 2 Cyberscience Center, Tohoku University, Sendai 980-8578, Japan 3 NEC Corporation, Tokyo 108-8001, Japan Received May 6, 2008; final version accepted December 10, 2008 Thanks to the highly effective memory bandwidth of the vector systems, they can achieve the high computation efficiency for computation-intensive scientific applications. However, they have been encountering the memory wall problem and the effective memory bandwidth rate has decreased, resulting in the decrease in the bytes per flop rates of recent vector systems from 4 (SX-7 and SX-8) to 2 (SX-8R) and 2.5 (SX-9). The situation is getting worse as many functions units and/or cores will be brought into a single chip, because the pin bandwidth is limited and does not scale. To solve the problem, we propose an on-chip cache, called vector cache, to maintain the effective memory bandwidth rate of future vector supercomputers. The vector cache employs a bypass mechanism between the main memory and register files under software controls. We evaluate the performance of the vector cache on the NEC SX vector processor architecture with bytes per flop rates of 2 B/FLOP and 1 B/FLOP, to clarify the basic characteristics of the vector cache. For the evaluation, we use the NEC SX-7 simulator extended with the vector cache mechanism. Benchmark programs for performance evaluation are two DAXPY-like loops and five leading scientific applications. The results indicate that the vector cache boosts the computational efficiencies of the 2 B/FLOP and 1 B/FLOP systems up to the level of the 4 B/FLOP system. Especially, in the case where cache hit rates exceed 50%, the 2 B/FLOP system can achieve a performance comparable to the 4 B/ FLOP system. The vector cache with the bypass mechanism can provide the data both from the main memory and the cache simultaneously. In addition, from the viewpoints of designing the cache, we investigate the impact of cache associativity on the cache hit rate, and the relationship between cache latency and the performance. The results also suggest that the associativity hardly affects the cache hit rate, and the effects of the cache latency depend on the vector loop length of applications. The cache shorter latency contributes to the performance improvement of the applications with shorter loop lengths, even in the case of the 4 B/FLOP system. In the case of longer loop lengths of 256 or more, the latency can effectively be hidden, and the performance is not sensitive to the cache latency. Finally, we discuss the effects of selective caching using the bypass mechanism and loop unrolling on the vector cache performance for the scientific applications. The selective caching is effective for efficient use of the limited cache capacity. The loop unrolling is also effective for the improvement of performance, resulting in a synergistic effect with caching. However, there are exceptional cases; the loop unrolling worsens the cache hit rate due to an increase in the working space to process the unrolled loops over the cache. In this case, an increase in the cache miss rate cancels the gain obtained by unrolling. KEYWORDS: Vector processing, Vector cache, Performance characterization, Memory system, Scientific application 1. Introduction Vector supercomputers have high computation efficiency for computation-intensive scientific applications. In our previous work (Kobayashi (2006), Musa et al. (2006)), we have shown that NEC SX vector supercomputers can achieve high sustained performance in several leading applications. This is due to their high bytes per flop rates compared to scalar systems. However, as recent advantages in VLSI technology have also been accelerating processor speeds, vector supercomputers have been encountering the memory wall problem. As a result, it is getting harder for vector supercomputers to keep a high memory bandwidth balanced with the improvement of their flop/s performance. On the NEC SX systems, the bytes per flop rate has decreased from 8 B/FLOP to 2 B/FLOP during the last twenty years. Ã E-mail: musa@sc.isc.tohoku.ac.jp y E-mail: yoshiei@sc.isc.tohoku.ac.jp z E-mail: egawa@isc.tohoku.ac.jp x E-mail: tacky@isc.tohoku.ac.jp } E-mail: okabe@isc.tohoku.ac.jp k E-mail: koba@isc.tohoku.ac.jp Interdisciplinary Information Sciences Vol. 15, No. 1 (2009) 51–66 #Graduate School of Information Sciences, Tohoku University ISSN 1340-9050 print/1347-6157 online DOI 10.4036/iis.2009.51
  • 2. We have proposed an on-chip cache, called vector cache, to maintain the effective memory bandwidth rate of future vector supercomputers (Kobayashi et al. (2007)) and provided the early evaluation of the vector cache using the NEC SX simulator (Musa et al. (2007)). The vector cache employs a bypass mechanism between a vector register and a main memory under software controls, and load/store data are selectively cached. The bypass mechanism and the cache work complementarily together to provide data to the register file, resulting in the high effective memory bandwidth rate. However, we did not investigate the fundamental configuration of the vector cache, i.e. cache associativity and cache latency. Moreover, our previous work did not discuss the relationship between loop unrolling and caching. The main contribution of this paper is to clarify the characteristics of the on-chip vector cache on the vector supercomputers. We investigate the relationship among cache associativity, cache latency and performance using five practical applications selected from the areas of advanced computational sciences. In addition, we discuss in detail selective caching and loop unrolling under the vector cache to improve the performance. The rest of the paper is organized as follows. Section 2 presents related work. Section 3 describes a vector architecture with an on-chip vector cache discussed in this paper. Section 4 provides our experimental methodology and benchmark programs for performance evaluation. Section 5 presents experimental results when executing the kernel loops and five real applications on our architecture. Through the experimental results, we clarify the characteristics of the vector cache. In Section 6, we examine the effect of selective caching on the performance, in which only the data with high localities of reference are cached. In addition, we discuss the relationship between loop unrolling and the vector cache. Finally, Section 7 summarizes the paper with some future work. 2. Related Work Vector caches have been previously studied by a number of researchers using trace driven simulators of convectional vector architectures in the 1990’s. Gee et al. have provided an evaluation of the cache performance in vector architectures: Cray X-MP and Ardent Titan (Gee and Smith (1992), Gee and Smith (1994)). Their caches employ a full- associative cache and an n-way set associative cache. The line sizes range from 16 to 128 bytes, and the maximum cache size is 4 MB. Their benchmark programs are selected from Livermore Loops, NAS kernels, Los Alamos benchmark set and real applications in chemistry, fluid dynamics and linear analysis. They showed that vector references contained somewhat less temporal locality, but large amounts of spatial locality compared to instruction and scalar references, and the cache improved the computational performance of the vector processors. Kontothanassis et al. have evaluated the cache performance of Cray C90 architecture using NAS parallel benchmarks (Kontothanassis et al. (1994)). Their cache is a direct-mapped cache with a 128 bytes line size, and the maximum cache size is 8 MB. They showed that the cache reduced memory traffic from off-chip memory, and a DRAM memory system with the cache was competitive to a SRAM memory system. Modern vector supercomputer Cray X1 has a 2 MB vector cache, called Ecache, organized as a 2-way set associative write-back cache with 32 bytes line size and the least recently used (LRU) replacement policy (Abts et al. (2003)). The performance of Cray X1 has been evaluated using real scientific applications by several researchers (Dunigan Jr. et al. (2005), Oliker et al. (2004), Shan and Strohmaier (2004)). These studies have compared the performance of Cray X1 with that of other platforms; IBM Power, SGI Altix and NEC SX. However, they have not quantitatively discussed the effect of the cache on its performance. The aim of our research is to quantitatively investigate the effect of the vector cache using modern vector supercomputer NEC SX architecture. We clarify the basic characteristics of the vector cache and discuss its effective usages. 3. On-Chip Cache Memory for Vector Architecture We design an on-chip vector cache for a vector processor architecture as shown in Figure 1. Our vector processor has a superscalar unit, a vector unit and an address control unit. The vector unit contains five types of vector arithmetic pipes (Mask, Logical, Multiply, Add/Shift and Divide), vector registers and a vector cache. Four parallel vector pipes 1B/FLOP ~ 4B/FLOP Vector Registers Main Memory Vector Cache Vector Cache Vector Cache Vector Cache Vector Cache ×32 4B/FLOP Vector Cache Vector Registers Mask Reg. Mask Logical Multiply Add/Shift Divide Vector Unit Scalar Unit ×4 ×4 ×4 ×4 ×4 MainMemoryUnit Address Control Unit . . Figure 1. Vector architecture with vector cache and memory system block diagram. 52 MUSA et al.
  • 3. are provided for each type; total 20 pipes are included in the vector unit. The address control unit queues and issues vector operation instructions to the vector unit. The vector operation instructions contain vector memory instructions and vector computational instructions. Each instruction concurrently deals with 256 double-precision floating-point data. The main memory unit employs an interleaved memory system. The main memory is divided into 32 parts, each of which operates independently and supplies data to the vector processor. Thus, the main memory and the vector processor are interconnected through 32 memory ports. Moreover, we implement the vector cache in each memory part; the vector cache consists of 32 sub-caches, and the sub-cache is a non-blocking cache with a bypass mechanism between a vector register file and the main memory. The bypass mechanism is software-controlled, and vector load/ store instructions have a flag for cache control, i.e. cache on/off. When the flag indicates ‘‘cache on,’’ the supplied data by the vector load/store instruction are cached. Meanwhile, when the flag is ‘‘cache off,’’ the vector load/store instruction provides data via the bypass mechanism: the data are not cached. The sub-cache employs a set-associative write-through cache with the LRU replacement policy. The line size is 8 bytes; it is the unit for memory accesses. 4. Experimental Environment 4.1 Methodology We use a trace-driven simulator that can simulate the behavior of the proposed vector architecture at the register transfer level. It is based on the NEC SX simulator, which accurately models a single processor of the SX architecture; the vector unit, the scalar unit and the memory system. The simulator takes a system parameter file and a trace file as input, and the output of the simulator contains instruction cycle counts of a benchmark program and cache hit information. The system parameter file has configuration parameters of a vector architecture and setting parameters, i.e., the cache size, the associativity, the cache latency and the memory bandwidth. The trace file contains an instruction sequence of a benchmark program and directives of cache control: cache on/ off. These directives are used in selective caching and set the cache control flags in the vector load/store instructions. A benchmark program is compiled by the NEC FORTRAN compiler: FORTRAN90/SX. It supports ANSI/ISO Fortran95 in addition to functions of automatic vectorization and automatic parallelization. The executable program runs on the SX trace generator to produce the trace file. In this work, we use two kernel loops and the original source codes of five applications, and compile them with the highest optimizations option (-C hopt) and inlining subroutines. To evaluate on-chip vector caching for future vector processors with a higher flop/s rate but a relatively lower off- chip memory bandwidth, we evaluate the effect of the cache on the vector processor by limiting its memory bandwidth per flop/s rate from 4 B/FLOP down to 1 B/FLOP. Here, we adopt the NEC SX-7 architecture (Kitagawa et al. (2003)) to our vector processor. The setting parameters are shown in Table 11 . 4.2 Benchmark Programs To clarify the basic characteristics and validity of the vector cache, we select basic kernel loops and five leading applications. In this section, we describe our benchmark programs in detail. The following kernel loops are used to evaluate the performance of the vector processor with the cache. AðiÞ ¼ XðiÞ þ YðiÞ: ð1Þ AðiÞ ¼ XðiÞ Â YðiÞ þ ZðiÞ: ð2Þ Table 1. Summary of setting parameters. Base System Architecture NEC SX-7 Main Memnory DDR-SDRAM Vector Cache SRAM Total Size (Sub-cache Size) 256 KB–8 MB (8 KB–256 KB) Cache Policy LRU, Write-through Associativity 2WAY, 4WAY, 8WAY Cache Latency 15%, 50%, 85%, 100% (Rate of main memory latency) Cache Bank Cycle 5% of memory cycle Line Size 8 B Memory—Cache bandwidth per flop/s 1 B/FLOP, 2 B/FLOP, 4 B/FLOP Cache—Register bandwidth per flop/s 4 B/FLOP 1 The absolute values of memory related parameters such as latency and bank cycle cannot be disclosed in this paper because of their confidentiality. Characteristics of an On-Chip Cache on NEC SX Vector Architecture 53
  • 4. As leading applications, we have selected five applications in three scientific computing areas. The applications have been developed by researchers of Tohoku University for the SX-7 system and are representative of each research area. Table 2 shows the summary of the evaluated applications, which methods are standard in individual areas (Kobayashi (2006)). The GPR simulation code evaluates a performance of the SAR-GPR (Synthetic Aperture Radar–Ground Penetrating Radar) in detection of buried anti personnel mines under conditions of inhomogeneous subsurface mediums (Kobayashi et al. (2003)). The simulation method is based on the three dimensional FDTD (Finite Difference Time Domain) method (Kunz and Luebbers (1993)). The vector operation ratio is 99.7% and the vector loop length is over 500. The APFA simulation is applied to design of a high gain antenna; an Anti-Podal Fermi Antenna (APFA) (Takagi et al. (2004)). The simulation calculates directional characteristics of the APFA using the FDTD method and the Fourier transform. The vector operation ratio is 99.9% and the vector loop length is 255. The PRF simulation provides numerical simulations of two dimensional Premixed Reactive Flow (PRF) in combustion for design of engines of planes (Tsuboi and Masuya (2003)). The simulation uses the sixth-order compact finite difference scheme and the third-order Runge-Kutta method for time advancement. The vector operation ratio is 99.3% and the vector loop length is 513. The SFHT simulation realizes direct numerical simulations of three-dimensional laminar Separated Flow and Heat Transfer (SFHT) on surfaces of a plane (Nakajima et al. (2005)). The finite-difference forms are the fifth-order upwind difference scheme for space derivatives and the Crank–Nicholson method for a time derivative. The resulting finite- difference equations are solved using the SMAC method. The vector operation ratio is 99.4% and the vector loop length is 349. The PBM simulation uses the three-dimensional numerical Plate Boundary Models (PBM) to explain an observed variation in propagation speed of postseismic slip (Ariyoshi et al. (2007)). This is a quasi-static simulation in an elastic half-space including a rate- and state-dependent friction. The vector operation ratio is 99.5% and the vector loop length is 32400. 5. Performance Evaluation of Vector Cache We evaluate the performance of the proposed vector processor architecture by using the benchmark programs and clarify the basic characteristics of the vector cache. 5.1 Impact of Memory Bandwidth on Effective Performance To evaluate the impact of the memory bandwidth on performance of the vector processor without the vector cache, we varied memory bandwidth per flop/s rates from 4 B/FLOP down to 1 B/FLOP. Figure 2 shows the computation efficiency, the ratio of the sustained performance to the peak performance, in the execution of the benchmark programs. Here, the 4 B/FLOP case is equal to the memory bandwidth per flop/s rate of the SX-8 or earlier system, the 2 B/FLOP is the same as Cary X1 and BlackWidow (Abts et al. (2007)), and the 1 B/FLOP is the same level as the commodity- based high performance scalar systems such as NEC TX7 series (Senta et al. (2003)) and SGI Altix series (SGI (2007)). Figure 2 indicates that the computational performance is seriously affected by the B/FLOP rate. In particular, the performances of GPR and PBM, which are memory-intensive programs, are degraded by half and quarter when the memory bandwidth per flop/s rate is reduced to 2 B/FLOP and 1 B/FLOP from 4 B/FLOP. Even if APFA is computation-intensive, the performance on the 1 B/FLOP system is decreased to 69% of the performance on the 4 B/FLOP system. Therefore, the memory bandwidth per flop/s rate needs the 4 B/FLOP rate on the vector supercomputer. Table 2. Summary of benchmark programs. Area Name and Description Method Subdivision Memory Size Electromagnetic Analysis GPR simulation: Simulation of Array Antenna Ground Penetrating Radar FDTD 50 Â 750 Â 750 8.9 GB APFA simulation: Simulation of Anti-Podal Fermi Antenna FDTD 612 Â 105 Â 505 12 GB CFD/Heat Analysis PRF simulation: Simulation of Premixed Reactive Flow in Combustion DNS 513 Â 513 1.4 GB SFHT simulation: Simulation of Separated Flow and Heat Transfer SMAC 711 Â 91 Â 221 6.6 GB Seismology PBM simulation: Simulation of Plate Boundary Model on Seismic Slow Slip Friction Law 32400 Â 32400 8 GB 54 MUSA et al.
  • 5. 5.2 Relationship between Efficiency and Cache Hit Rate on Kernel Loops We simulate the execution of the kernel loop (1) on the vector processor with the vector cache. Since this loop has no temporal locality, we store a part of XðiÞ and YðiÞ data on the vector cache in advance of the loop execution. In the following two ways, we select the data for caching to change the range of cache hit rates, and examine their effects on performance. One is to store both XðiÞ and YðiÞ with the same index i on the cache, and load instructions of both XðiÞ and YðiÞ are set to ‘‘cache on,’’ named Case 1. The other is to cache only XðiÞ, and load instructions of XðiÞ and YðiÞ are set to ‘‘cache on’’ and ‘‘cache off,’’ respectively. This is Case 2. Here, the range of index i is 1 i 25600. Figure 3 shows the relationship between the cache hit rate and the relative memory bandwidth that is obtained by normalizing the effective memory bandwidth of the systems with the vector cache by that of the 4 B/FLOP system without the cache. Figure 4 shows the relationship between the cache hit rate and the computation efficiency. Figure 3 indicates that the relative memory bandwidth of each case increases as the cache hit rate increases. Therefore, the vector cache is one of the promising solutions to cover a lack of the memory bandwidth. Similar to Figure 3, the computation efficiency of each case in Figure 4 improves as the cache hit rate increases. Figures 3 and 4 show the correlation between the relative memory bandwidth and the computation efficiency. On both the 2 B/FLOP and 1 B/FLOP systems, the performance of Case 2 is greater than that of Case 1. In particular, Case 2 with the vector cache of the 50% hit rate on the 2 B/FLOP system achieves the same performance of the 4 B/ FLOP system. On the 4 B/FLOP system, each of XðiÞ and YðiÞ is provided at a 2 B/FLOP bandwidth rate on average. When the cache hit rate is 50% in Case 2, all data of XðiÞ are supplied directly from the vector cache and all data of YðiÞ are supplied from the memory at the 2 B/FLOP bandwidth rate through the bypass mechanism. Consequently, each of XðiÞ and YðiÞ is provided at the 2 B/FLOP bandwidth rate on the vector registers, and the total amount of data provided to vector registers in time is equal to that of the 4 B/FLOP system. However, the 1 B/FLOP system with vector caching with the 50% hit rate does not achieve the performance of the 4 B/FLOP system. Because YðiÞ is provided at a 1 B/ FLOP bandwidth rate, the total bandwidth rate at the vector registers does not reach the 4 B/FLOP. On the other hand, in Case 1, the data of XðiÞ and YðiÞ are supplied from either of the vector cache or the memory at the same index i. While supplying the data from the memory, XðiÞ and YðiÞ have to share the memory bandwidth rate. Therefore, the 1 B/ 0 0.2 0.4 0.6 0.8 1 0 10 20 30 40 50 60 70 80 90 100 RelativeMemoryBandwidth Cache Hit Rate (%) 4B/F 2B/FCase1 2B/FCase2 1B/FCase1 1B/FCase2 Figure 3. Relationship between cache hit rate and relative memory bandwidth on Kernel loop (1). 0 10 20 30 40 50 60 X +Y X*Y+Z GPR APFA PRF SFHT PBM Efficiency(%) 4B/FLOP 2B/FLOP 1B/FLOP Figure 2. Effect of memory bandwidth per flop/s rate. Characteristics of an On-Chip Cache on NEC SX Vector Architecture 55
  • 6. FLOP and 2 B/FLOP systems need a cache hit rate of 100% to achieve a performance comparable to that of the 4 B/ FLOP system. Similarly, we examine the performance of Kernel (2) on the 2 B/FLOP system in the following three cases. In Case 1, all of XðiÞ, YðiÞ, and ZðiÞ are cached with the same index i in advance. In Case 2, both XðiÞ and YðiÞ are provided from the vector cache, and therefore the maximum cache hit rate is 66%. In Case 3, only XðiÞ is in the cache, and hence the maximum cache hit rate is 33%. Figure 5 shows the relationship between the cache hit rate and the change in the relative memory bandwidth regarding Kernel (2). On the 4 B/FLOP system, each of XðiÞ, YðiÞ and ZðiÞ is provided at a 4/3 B/FLOP bandwidth rate on average. The performance of Case 2 is comparable to that of the 4 B/ FLOP system when the cache hit rate is 66%, because the bandwidth per flop/s rate of each data is the 4/3 B/FLOP rate on average. However, in Case 3, both YðiÞ and ZðiÞ are provided from the memory, and YðiÞ and ZðiÞ have to share the 2 B/FLOP bandwidth rate at the vector register. As a result, the relative memory bandwidth never reaches that of the 4 B/FLOP system. These results indicate that the vector cache has a potential for the future SX systems to cover the shortage of memory bandwidth. In additions, all data do not need to be cached from the discussions on Figures 3 and 5. The key data determining the performance should be cached to make good use of limited on-chip capacity. 5.3 Relationship between Efficiency and Cache Hit Rate on Five Scientific Applications We simulate the execution of five scientific applications with the vector cache varying memory bandwidth per flop/s rate; 1 B/FLOP and 2 B/FLOP. Here, all vector load/store instructions are set to ‘‘cache on.’’ Figure 6 shows the relationship between the cache hit rate of the 8 MB vector cache and the relative memory bandwidth. Figure 6 indicates that the vector cache can improve the effective memory bandwidth, depending on the cache hit rate of the applications. Cache hit rates vary from 13% in SFHT to 96% in APFA, because these depend on the locality of reference in individual applications. In APFA, the relative memory bandwidth is 0.96 at the 2 B/FLOP system with almost 100% hit of the 8 MB cache, resulting in an effective memory bandwidth almost equal to the 4 B/FLOP system. In this case, most data are provided 0 2 4 6 8 10 12 14 16 18 0 10 20 30 40 50 60 70 80 90 100 Efficiency(%) Cache Hit Rate (%) 4B/F 2B/FCase1 2B/FCase2 1B/FCase1 1B/FCase2 Figure 4. Relationship between cache hit rate and computation efficiency on Kernel loop (1). 0.5 0.6 0.7 0.8 0.9 1 0 10 20 30 40 50 60 70 80 90 100 RelativeMemoryBandwidth Cache Hit Rate (%) 4B/F 2B/F Case1 2B/F Case2 2B/F Case3 Figure 5. Relationship between cache hit rate and relative memory bandwidth on Kernel loop (2). 56 MUSA et al.
  • 7. from the cache. PBM of the 2 B/FLOP system reaches the effective memory bandwidth of the 4 B/FLOP system at the cache hit rate 50%. Figure 7 shows the main routine of PBM. The inner loop, index j, is vectorized and arrays gd dip and wary are stored in the vector cache by vector load instructions. However, gd dip is spilled from the cache, because gd dip needs 7.6 GB. On the other hand, wary can be held in the cache, because its needs only 250 KB and the array is defined in the preceding loop. Therefore, the cache hit rate of this loop is 50%. Then all data of wary are supplied directly from the vector cache at the 2 B/FLOP rate and all data of gd dip are supplied from the memory at the 2 B/ FLOP rate through the cache, and the total amount of data provided to vector registers in time is equal to that of the 4 B/FLOP system. The relative memory bandwidths in others applications, SFHT, GPR and PRF are over 0.6 on the 2 B/FLOP system and 0.3 on the 1 B/FLOP system. Here, the relative memory bandwidth without the cache is 0.5 on the 2 B/FLOP system and 0.25 on the 1 B/FLOP system. Thus, the improvement of the relative memory bandwidth is over 20% using the 8 MB cache. Figure 8 shows the computation efficiency of five scientific applications on the 4 B/FLOP system without the cache, the 2 B/FLOP and 1 B/FLOP systems with/without the 8 MB cache. Figure 8 indicates that the efficiency of the 2 B/ FLOP and 1 B/FLOP systems is increased by the cache. In particular, the 2 B/FLOP system with the cache achieves the same efficiency as the 4 B/FLOP system in APFA and PBM. Figure 9 shows the relationship between the cache hit rate of the 8 MB cache and the recovery rate of the performance by the cache from the 2 B/FLOP and 1 B/FLOP systems to the 4 B/FLOP system. The recovery rate goes up as the cache hit rate increases. The vector cache can increase the recovery rate by 18 to 99% on the 2 B/FLOP system and 9 to 96% on the 1 B/FLOP system, depending on the data access locality of each application. The results indicate that the vector cache has a potential for the future SX systems to cover the shortage of memory bandwidth on real scientific applications. 5.4 Relationship between Associativity and Cache Hit Rate In this section, we examine the effect of the associativity ranging from 2 to 8 ways on the performance. Figure 10 shows that the cache hit rate on each set associative cache, covering cache sizes from 256 KB to 8 MB. Four applications, APFA, PRF, SFHT and PBM, approximately have constant cache hit rates across three associativity cases. In GPR, the cache hit rate slightly varies with the associativity. The cache hit rate is dependent on among placements of the arrays on memory addresses, memory access patterns of the programs and the associativity. In GPR, its basic computational structure consists of triple nested loops. The loops access many arrays at intervals of 584 bytes stride addresses, and the cache data are frequently replaced. Therefore, the 0 0.25 0.5 0.75 1 0 20 40 60 80 100 RelativeMemoryBandwidth Cache Hit Rate (%) 2B/F 8MB 1B/F 8MB PBM SFHT GPR PRF APFA Figure 6. Relationship between cache hit rate and relative memory bandwidth on five applications. 01 do i=1,ncells 02 do j=1,ncells 03 wf_dip(i)=wf_dip(i)+gd_dip(j,i)*wary(j) 04 end do 05 end do Figure 7. Main routine of PBM. Characteristics of an On-Chip Cache on NEC SX Vector Architecture 57
  • 8. cache hit rate varies with the cache capacities and the associativity. In SFHT, the memory access patterns have 8 bytes stride accesses. On the 2 MB case, the cached data which will be reused on the 2-way set associative cache are evicted by some other data. Thus this is due to a placement of memory addresses of the arrays. Other programs have constant cache hit rates across three associativity cases. These memory access patterns are 8 bytes or 16 bytes stride accesses, and the placement of the arrays on memory do not cause the conflict of reused data. These results provide that the associativity hardly have effects on the cache hit rates, and the cache capacity mainly influences the cache hit rate across the five applications. 5.5 Effects of Vector Cache Latency on Performance In general, the latency of the on-chip cache is considerably shorter than that of the off-chip memory. We investigate the effects of the cache access latency on the performance of the five applications. Figure 11 shows the relationship between the computation efficiency of the applications and four cache latencies; 15, 50, 85 and 100% of the memory access latency on the 2 B/FLOP system with the 2 MB cache. The five applications have almost the same efficiency when changing the cache latency. Because the vector architecture can hide the memory access times by pipelined vector operations when a vector operation ratio and a vector loop length of applications are enough large. In addition, the vector cache consists of 32 sub-caches between a vector register file and the main memory via each memory port. The sub-caches connected to 32 memory ports provide multiple words at a time. The cache latency of the second memory access or later is hidden in the sub-cache system. For comparison, we examine the performance of shorter loop length cases using Kernel loops (1) and (2) on the 2 B/ FLOP system when changing the cache access latency. Figure 12 shows that the relationship between the computation efficiency of Kernel loops and four cache latency cases. Here, we examine three loop length cases in the 100% cache hit rate; 1 i 64, 1 i 128, 1 i 256. On the shorter loop length cases, Figure 12 indicates that the vector cache latency is an importance factor in the performance. Especially, on the loop length 64 case, the 15% latency case has two times higher efficiency in Kernel (1) than the 100% case, and the 15% latency case has 12% higher efficiency in Kernel 0 20 40 60 80 100 0 20 40 60 80 100 PerformanceRecoveryRate(%) Cache Hit Rate (%) 2B/F 8MB 1B/F 8MB PBM SFHT GPR PRF APFA Figure 9. Recovery rate of performance on 8 MB cache. 0 10 20 30 40 50 GPR APFA PRF SFHT PBM Efficiency(%) 4B/F 2B/F non-cache 2B/F 8MB 1B/F non-cache 1B/F 8MB Figure 8. Computation efficiency of five applications with/without 8 MB vector cache. 58 MUSA et al.
  • 9. (2). However, the longer the loop length, the lower the effect of the cache latency. Figure 13 shows that the computation efficiency of Kernel loops in the 4 B/FLOP and 2 B/FLOP systems with/without the vector cache. On both the loop lengths 64 and 128 of Kernel (1), the efficiency of the 4 B/FLOP system with the cache is higher than that of the 4 B/FLOP system without the cache, and the loop length 128 case of the 4 B/FLOP system with the cache achieves the highest efficiency. Meanwhile, in Kernel (2), the loop length 64 case of the 4 B/FLOP system with the cache has the highest efficiency. This is because the ratio of memory access time to the processing time becomes the GPR APFA PRF SFHT PBM 0 10 20 30 40 50 2WAY 4WAY 8WAY CacheHitRate(%) 0.25MB 0.5MB 1MB 2MB 4MB 8MB 0 20 40 60 80 100 2WAY 4WAY 8WAY CacheHitRate(%) 0.25MB 0.5MB 1MB 2MB 4MB 8MB 0 10 20 30 40 50 2WAY 4WAY 8WAY CacheHitRate(%) 0.25MB 0.5MB 1MB 2MB 4MB 8MB 0 10 20 30 40 50 2WAY 4WAY 8WAY CacheHitRate(%) 0.25MB 0.5MB 1MB 2MB 4MB 8MB 0 10 20 30 40 50 2WAY 4WAY 8WAY CacheHitRate(%) 0.25MB 0.5MB 1MB 2MB 4MB 8MB Figure 10. Cache hit rate vs. associativity on five applications. 0 10 20 30 40 50 60 15% 50% 85% 100% Efficiency(%) Cache Latency Case GPR APFA PRF SFHT PBM Figure 11. Computation efficiency of five applications on four cache latency cases. Characteristics of an On-Chip Cache on NEC SX Vector Architecture 59
  • 10. smallest due to the decrease in memory latency by the cache, and in these kernel loops the memory access time of the shorter loop length case cannot be entirely hidden by pipelined vector operations only. Therefore, the performance of shorter loop length cases is sensitive to the vector cache latency, and the vector cache can boost the performance of the applications with short vector lengths on the 4 B/FLOP system. Moreover, the cache latency is not hidden when the bank conflicts occur. In this work, the bank conflicts hardly occur in the five applications. As our future work, quantitative discussions on the influence of the bank conflicts will be addressed. 6. Optimizations for Vector Caching: Selective Caching and Loop Unrolling In this section, we discuss some optimization techniques for the vector cache to reduce cache miss rates on scientific applications. In particular, we show that the relationship between loop-unrolling and caching. Many compilers automatically optimize programs using the loop-unrolling, which is the most basic loop optimization. 6.1 Effects of Selective Caching The vector cache has a bypass mechanism, and data are selectively cached: selective caching. The simulation methods of the four scientific applications, GPR, APFA, PRF and SFHT, are based on the difference schemes; Finite Difference Time Domain (FDTD) method and Direct Numerical Simulation of flow and heart (DNS), and a part of arrays has a high locality in many cases. As the on-chip cache size is limited, selective caching might be an effective approach to efficient use of the on-chip cache. We evaluate the selective caching for two applications, GPR and PRF, which are typical examples of FDTD and DNS, respectively. Figure 14 shows one of kernel loops of GPR; the loop is a main routine of the FDTD method, and many routines of GPR are similar to this kernel loop. The innermost loop is indexed with j. Usually, FORTRAN compilers interchange loops j and i, however, FORTRAN90/SX compiler does not interchange the loops, because the compiler decides that the computational performance cannot be improved due to the short length of loop i. Loop i has a length of 50, and j is 750. As the size of each array used in GPR is 330 MB, the cache cannot store all the arrays. However, the difference scheme as shown in Figure 14 generally has temporal locality in accessing arrays; H x, H y and H z. We select these arrays for selective caching. Figure 15 shows the effect of selective caching on the 2 B/FLOP system with the vector cache in GPR. Figure 16 shows the recovery rate of the performance from the 2 B/FLOP system without the cache to the 4 B/FLOP system by selective caching in GPR. We discuss the cache sizes from 256 KB up to 8 MB. Here, ‘‘ALL’’ in the figures indicates that all the arrays in the loop are cached. ‘‘Selective’’ shows that some of the arrays are selectively cached. ‘‘Cache Usage Rate’’ indicates the ratio of number of cache hit references to the total memory references. 0 2 4 6 8 10 12 14 16 18 15% 50% 85% 100% Efficiency(%) Cache Latency Case Kernel Loop (1) 64 128 256 0 5 10 15 20 25 30 15% 50% 85% 100% Efficiency(%) Cache Latency Case Kernel Loop (2) 64 128 256 Figure 12. Computation efficiency of two kernel loops on four cache latency cases. 0 2 4 6 8 10 12 14 16 18 20 64 128 256 Efficiency(%) Loop Length Kernel Loop (1) 4B/F 4B/F 2MB 2B/F 2B/F 2MB 0 5 10 15 20 25 30 35 64 128 256 Efficiency(%) Loop Length Kernel Loop (2) 4B/F 4B/F 2MB 2B/F 2B/F 2MB Figure 13. Computation efficiency of two kernel loops on 4 B/FLOP and 2 B/FLOP. 60 MUSA et al.
  • 11. In the 256 KB cache case, the efficiencies of ‘‘ALL’’ and ‘‘Selective’’ are 33.3 and 34.1%, and the cache usage rate are 9.6 and 16.6%, respectively. The arrays H yði; j; kÞ, H yði À 1; j; kÞ (line 08 in Figure 14), H xði; j; kÞ (line 11) and H zði À 1; j; kÞ (line 12) can load data from the cache on ‘‘Selective.’’ In ‘‘ALL,’’ H yði À 1; j; kÞ (line 08) and 01 DO 10 k=0,Nz 02 DO 10 i=0,Nx 03 DO 10 j=0,Ny 04 E_x(i,j,k) = C_x_a(i,j,k)*E_x(i,j,k) 05 & + C_x_b(i,j,k) * (( H_z(i,j,k) - H_z(i,j-1,k))/dy 06 & - (H_y(i,j,k) - H_y(i,j,k-1))/dz - E_x_Current(i,j,k)) 07 E_z(i,j,k) = C_z_a(i,j,k)*E_z(i,j,k) 08 & + C_z_b(i,j,k) * ((H_y(i,j,k)-H_y(i-1,j,k) )/dx 09 & - (H_x(i,j,k)-H_x(i,j-1,k) )/dy - E_z_Current(i,j,k)) 10 E_y(i,j,k) = C_y_a(i,j,k)*E_y(i,j,k) 11 & + C_y_b(i,j,k) * ((H_x(i,j,k) - H_x(i,j,k-1) )/dz 12 & - (H_z(i,j,k) - H_z(i-1,j,k))/dx - E_y_Current(i,j,k)) 13 10 CONTINU Figure 14. A kernel loop of GPR. 0 10 20 30 30 32 34 36 38 40 256K 512K 1M 2M 4M 8M CacheUsageRate(%) Efficiency(%) Cache Size (B) ALL (Performance) Selective (Performance) ALL (Cache usage rate) Selective (Cache usage rate) Figure 15. Efficiency of selective caching and cache hit rate on 2 B/FLOP system (GPR). 0 10 20 30 0 10 20 30 40 256K 512K 1M 2M 4M 8M CacheUsageRate(%) RecoveryRateofPerformanc(%) Cache Size (B) ALL (Recover Rate) Selective (Recover Rate) ALL (Cache usage rate) Selective (Cache usage rate) Figure 16. Recovery rate of performance and cache hit rate on selective caching (GPR). Characteristics of an On-Chip Cache on NEC SX Vector Architecture 61
  • 12. H zði À 1; j; kÞ (line 12) cause cache misses, because the reused data of these arrays are replaced by other array data. Figures 15 and 16 indicate that the performance and the cache usage rate increase as the cache size increase. On the 4 MB cache, the cache hit rate of ‘‘Selective’’ is 24% and its efficiency is 3% higher than that of ‘‘ALL.’’ The recovery rate of performance is mere 34%. However, GPR cannot achieve high cache usage rates and recovery rates of performance by selective caching. This is because this loop has many non-locality arrays; C x a, E x, E x Current etc. Figure 17 shows one of kernel loops of PRF; the loop is a high cost routine. The size of each array in PRF is 18 MB, and many loops are doubly nested loop. Thus, the arrays are treated as two dimensions array, and the size of cached data per array is 2 MB only. wDY1YC1, wDY1YC2 and wDY1YC3 in Figure 17 are defined in the preceding loop. In addition, wDY1YC1 and wDY1YC2 have the spatial locality. We select these arrays for selective caching. Here, wDY1YC3ðI; KK À 1; LÞ (line 04 in Figure 17), wDY1YC2ðI; KK; LÞ (line 05), wDY1YC1ðI; KK; LÞ (line 06) and wDY1YC2ðI; KK; LÞ (line 06) can reuse data on the register files. Thus, these arrays do not access the cache. Figures 18 and 19 show the efficiency and the recovery rate of the performance from the 2 B/FLOP system to the 4 B/FLOP system by selective caching in PRF. Figure 18 indicates that the cache hit rates and efficiency of ‘‘Selective’’ are the same as that of ‘‘ALL’’ until 2 MB cache size. wDY1YC2ðI; KK À 1; LÞ (line 03) and wDY1YC1ðI; KK À 1; LÞ (line 08) are cache hit arrays in each case, and the cache hit rate is 33%. Over 4 MB cache size, ‘‘Selective’’ achieves a 66% cache hit rate, resulting in a 30+% improvement in efficiency. Moreover, Figure 19 indicates that the recovery rate is 78%. The results of GPR and PRF show that the selective caching improves the performance of the applications. Compared with GPR, PRF has higher values regarding both the cache usage rate and improved efficiency, because the ratio of arrays with locality per loop on PRF is higher than that of GPR. In GPR the reused data are only defined in the loop, and the cache miss always occurs in this loop. For example, arrays H yði; j; kÞ (line 08 in Figure 14) hits the data of H yði; j; kÞ (line 06), but H yði; j; kÞ (line 06) cannot hit the data due to their first accesses (cold miss). On the other hand, the cached data of PRF are provided in the immediately preceding loop, thus cache misses do not occur like the first access in PRF. 01 DO KK = 2,NJ 02 DO I = 1,NI 03 wDY1YC3(I,KK-1,L) = wDY1YC3(I,KK-1,L) * wDY1YC2(I,KK-1,L) 04 wDY1YC2(I,KK,L) = wDY1YC2(I,KK,L) – wDY1YC1(I,KK,L) * wDY1YC3(I,KK-1,L) 05 wDY1YC2(I,KK,L) = 1.D0 / wDY1YC2(I,KK,L) 06 wDY1YC(I,KK,L) = (wPHIY12(I,KK) – wDY1YC1(I,KK,L) * wDY1YC1(I,KK-1,L)) * wDY1YC2(I,KK,L) 07 END DO 08 END DO Figure 17. High cost kernel loop of PRF. 0 10 20 30 40 50 60 70 20 25 30 35 256K 512K 1M 2M 4M 8M CacheUsageRate(%) Efficiency(%) Cache Size (B) ALL (Performance) Selective (Performance) ALL (Cache usage rate) Selective (Cache usage rate) Figure 18. Efficiency and cache hit rate of selective caching on performance (PRF). 62 MUSA et al.
  • 13. 6.2 Effects of Loop-unrolling and Caching Loop unrolling is effective for higher utilization of vector pipelines, however, loop unrolling also needs more cache capacity to capture all the arrays of unrolled loops. Therefore, we discuss the relationship between loop unrolling and vector caching on the performance. Figure 20 shows a code outer unrolled 4 times, whose original code is shown in Figure 7. In the outer unrolling case, array wary of the outer loop index i (line 03 in Figure 20) reuses the data of the index i À 1 on the cache and arrays wary (line 04, 05 and 06) reuse the data on register files of wary (line 03), hence, memory references are reduced. However, the cache capacity for gd dip increases due to an increase in the number of arrays reference in the innermost loop. Figure 21 shows the efficiency and the cache hit rate of the outer unrolling case on PBM with the 2 B/FLOP system. ‘‘2B/F’’ indicates the without-cache case, and the efficiency constantly increases as the degree of unrolling increases. The cache hit rate of the 0.5 MB cache case is 49% on the non-unrolling case and becomes poor as the loop unrolling proceeds, because the cached data of wary are evicted from the cache by gd dip. Thus, the efficiency of the 0.5 MB cache case achieves 49% on the non-unrolling case, and it is similar to that of the 2 B/FLOP system with unrolling. In this case, a conflicting effect between caching and unrolling appears. Moreover, the cache hit rates of the 8 MB cache case decrease as the degree of unrolling increases. The cached data remain on the cache in this case, but the cache hit rate decreases due to a decrease in the number of wary (line 03) references. However, the efficiency is approximately constant; 48 or 49% because unrolling covers the losing effect of caching. This case shows that the effects of caching are comparable to that of unrolling. Figure 22 shows a code obtained by unrolling code two times shown in Figure 17. Outer unrolling reduces the memory references of the arrays wDY1YC2ðI; KK; LÞ (line 07 in Figure 22) and wDY1YC1ðI; KK; LÞ (line 10). Figure 23 shows the efficiency and the cache hit rate of outer unrolling on PRF with the 2 B/FLOP system. The cache hit rates of both the 0.5 and 8 MB caches gradually reduce as the degree of unrolling increase, because of an increase in the cache miss rate. In this case, the highest efficiency is 33% on the 8 MB cache with the unrolling degree of 4 since the gain by loop unrolling covers the loss due to the increase in the miss rate. It is higher than the selective caching case; it is 30.7% (Ref. Figure 18). Thus, this case indicates the synergistic effect of both caching and unrolling. 0 10 20 30 40 50 60 70 20 40 60 80 256K 512K 1M 2M 4M 8M CacheUsageRate(%) RecoveryRateofperformance(%) Cache Size (B) ALL (Performance) Selective (Performance) ALL (Cache usage rate) Selective (Cache usage rate) Figure 19. Recovery rate of performance and cache hit rate on selective caching (PRF). 01 do i=1,ncells,4 02 do j=1,ncells 03 wf_dip(i)=wf_dip(i)+gd_dip(j,i)*wary(j) 04 wf_dip(i+1)=wf_dip(i+1)+gd_dip(j,i+1)*wary(j) 05 wf_dip(i+2)=wf_dip(i+2)+gd_dip(j,i+2)*wary(j) 06 wf_dip(i+3)=wf_dip(i+3)+gd_dip(j,i+3)*wary(j) 07 end do 08 end do Figure 20. Outer unrolling loop of PBM. Characteristics of an On-Chip Cache on NEC SX Vector Architecture 63
  • 14. These results indicate that loop unrolling has both conflicting and synergistic effects of caching. Loop unrolling is a most basic optimization and has beneficial effects in various applications. Therefore, caching has to be used carefully as a complementary tuning option for loop unrolling. 7. Conclusions This paper has presented the characteristics of an on-chip vector cache with the bypass mechanism on the vector supercomputer NEC SX architecture. We have demonstrated the relationship between the cache hit rate and the 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Non-Unroll Unroll2 Unroll4 Unroll8 Unroll16 CacheHitRate(%) Efficiency(%) 2B/F 0.5MB 8MB Hit rate 0.5MB Hit rate 8MB Figure 21. Efficiency and cache hit rate of outer unrolling on performance (PBM). 01 DO KK = 2,NJ 02 DO I = 1,NI 03 wDY1YC3(I,KK-1,L) = wDY1YC3(I,KK-1,L) * wDY1YC2(I,KK-1,L) 04 wDY1YC2(I,KK,L) = wDY1YC2(I,KK,L) – wDY1YC(I,KK,L) * wDY1YC3(I,KK-1,L) 05 wDY1YC2(I,KK,L) = 1.D0 / wDY1YC2(I,KK,L) 06 wDY1YC(I,KK,L) = (wPHIY12(I,KK) – wDY1YC1(I,KK,L) * wDY1YC1(I,KK-1,L)) * wDY1YC2(I,KK,L) 07 wDY1YC3(I,KK,L) = wDY1YC3(I,KK,L) * wDY1YC2(I,KK,L) 08 wDY1YC2(I,KK+1,L) = wDY1YC2(I,KK+1,L) – wDY1YC(I,KK+1,L) * wDY1YC3(I,KK,L) 09 wDY1YC2(I,KK+1,L) = 1.D0 / wDY1YC2(I,KK+1,L) 10 wDY1YC(I,KK+1,L) = (wPHIY12(I,KK+1) – wDY1YC1(I,KK+1,L) * wDY1YC1(I,KK,L)) * wDY1YC2(I,KK+1,L) 11 END DO 12 END DO Figure 22. Outer unrolling loop of PRF. 0 10 20 30 40 50 60 0 5 10 15 20 25 30 35 Non-Unroll Unroll 2 Unroll 4 Unroll 8 Unroll 16 CacheHitRate(%) Efficiency(%) 2B/F 0.5MB 8MB hit rate 0.5MB hit rate 8MB Figure 23. Efficiency and cache hit rate of outer unrolling on performance (PRF). 64 MUSA et al.
  • 15. performance using two DAXPY-like loops. Moreover, we have clarified the same tendency of the relationship on the five scientific applications. The vector cache can recover the lack of the memory bandwidth, and boosts the computation efficiency of the 2 B/FLOP and 1 B/FLOP systems. Especially, as cache hit rates are 50% or more, the 2 B/FLOP system can achieve a performance comparable to the 4 B/FLOP system. The vector cache with a bypass mechanism can provide the data from both the memory and the cache in time, and the sustained memory bandwidth for the registers are increased. Therefore, the vector cache has a potential for the future SX systems to cover the shortage of their memory bandwidth. We have also discussed the relationship between performance and cache design parameters such as cache associativity and cache latency. We have examined the effect of the cache associativity setting from 2-way to 8-way. When the associativity is two or more, the cache hit rate is not sensitive to the associativity in the five applications. In addition, we have examined the effect of the cache latency on the performance when changing it to 15, 50, 85 and 100% of the memory latency. We have demonstrated the computational efficiencies of five applications are constant across these latency changes, when the vector loop lengths of the applications are 256 or more. In these cases, the latency is hidden by pipelined vector operations. However, in the case of shorter vector loop lengths, the cache latency affects the performance, and the 15% latency case of Kernel (1) has two times higher efficiency than the 100% case in the 2 B/FLOP system. In addition, the 4 B/FLOP system is also boosted due to the effect of the short latency of the cache. The above results indicate that the performance of cache does not largely depend on the cache associativity, however, the cache latency affects the performance of applications with short vector loop lengths. Finally we have discussed selective caching and the relationship between loop-unrolling and caching. We have shown that selective caching, which is controlled by means of the bypass mechanism, is effective for efficient use of limited on-chip caches. We have examined two cases; the ratios of arrays with locality per loop are higher and lower cases. In each case, the higher performance is obtained by selective caching, compared with all the data caching. In addition, the loop unrolling is useful in the improvement of performance, and caching is complementary to the effect of loop unrolling. In the future work, we will evaluate other scientific applications using the vector cache, and quantitatively discussions the relationship between the bank conflicts and the cache latency. In addition, we will also investigate the mechanism of vector cache on multi core vector processors. Acknowledgments The authors would like to gratefully thank: Dr. M. Sato, Dr. A. Hasegawa, Professor G. Masuya, Professor T. Ota, Professor K. Sawaya of Tohoku University, for their advice about applications. We also would like to thank Matsuo Aoyama and Hirofumi Akiba of NEC Software Tohoku for their assistance in experiments. REFERENCES [1] Abts, D., et al., ‘‘The Cray BlackWidow: A Highly Scalable Vector Multiprocessor,’’ In Proceedings of the ACM/IEEE SC2007 conference (2007). [2] Abts, D., Scott, S., and Lilja, D. J., ‘‘So Many States, So Little Time: Verifying Memory Coherence in the Cray X1,’’ In International Parallel and Distributed Processing Symposium (2003, April). [3] Ariyoshi, K., Matsuzawa, T., and Hasegawa, A., ‘‘The key frictional parameters controlling spatial variation in the speed of postseismic slip propagation on a subduction plate boundary,’’ Earth Planet. Sci. Lett., 256: 136–146 (2007). [4] Dunigan Jr., T. H., Vetter, J. S., White III, J. B., and Worley, P. H., ‘‘Performance Evaluation of The Cray X1 Distributed Shared-Memory Architecture,’’ In IEEE MICRO, 25: 30–40 (2005). [5] Gee, J. D., and Smith, A. J., ‘‘Vector Processor Caches,’’ Technical Report UCB/CSD-92-707, University of California (1992, October). [6] Gee, J. D., and Smith, A. J., ‘‘The Effectiveness of Caches for Vector Processors,’’ In The 8th international Conference on Supercomputing 1994, pp. 333–343 (1994, July). [7] Kitagawa, K., Tagaya, S., Hagihara, Y., and Kanoh, Y., ‘‘A Hardware Overview of SX-6 and SX-7 Supercomputer,’’ NEC Research & Development, 44: 2–7 (2003). [8] Kobayashi, H., ‘‘Implication of Memory Performance in Vector-Parallel and Scalar-Parallel HEC,’’ In M. Resch et al. (Eds.), High Performance Computing on Vector Systems 2006, pp. 21–50, Springer-Verlag (2006). [9] Kobayashi, H., Musa, A., Sato, Y., Takizawa, H., and Okabe, K., ‘‘The potential of On-chip Memory Systems for Future Vector Architectures,’’ In M. Resch et al. (Eds.), High Performance Computing on Vector Systems 2007, pp. 247–264, Springer-Verlag (2007). [10] Kobayashi, T., Feng, X., and Sato, M., ‘‘FDTD simulation on array antenna SAR-GPR for land mine detection,’’ In Proceedings of SSR2003: 1st International Symposium on Systems and Human Science, Osaka, Japan, pp. 279–283 (2003, November). [11] Kontothanassis, L. I., Sugumar, R. A., Faanes, G. J., Smith, J. E., and Scott, M. L., ‘‘Cache Performance in Vector Supercomputers,’’ In Proceedings of the 1994 conference on Supercomputing, Washington D.C., pp. 255–264 (1994). [12] Kunz, K. S., and Luebbers, R. J., ‘‘The Finite Difference Time Domain Method for Electromagnetics,’’ Boca Raton, FL: CRC Press (1993). [13] Musa, A., Sato, Y., Egawa, R., Takizawa, H., Okabe, K., and Kobayashi, H., ‘‘An On-Chip cache Design for Vector Characteristics of an On-Chip Cache on NEC SX Vector Architecture 65
  • 16. Processors,’’ In Proceedings of the 8th MEDEA workshop on Memory performance: Dealing with Applications, systems and architecture, Romania, pp. 17–23 (2007). [14] Musa, A., Takizawa, H., Okabe, K., Soga, T., and Kobayashi, H., ‘‘Implications of Memory Performance for Highly Efficient Supercomputing of Scientific Applications,’’ In Proceedings of the 4th International Symposium on Parallel and Distributed Processing and Application 2006, Italy, pp. 845–858 (2006). [15] Nakajima, M., Yanaoka, H., Yoshikawa, H., and Ota, T., ‘‘Numerical Simulation of Three-Dimensional Separated Flow and Heat Transfer around Staggerd Surface-Mounted Rectangular Blocks in a Channel,’’ Numerical Heat Transfer (PartA), 47: 691–708 (2005). [16] Oliker, L., et al., ‘‘A Performance Evaluation of the Cray X1 for Scientific Applications,’’ In VECPAR’04: 6th International Meeting on High Performance Computing for Computational Science, Valencia, Spain (2004, June). [17] Senta, T., Takahashi, A., Kadoi, T., Itoh, T., Kawaguchi, S., and Kishida, Y., ‘‘Itanium2 32-way Server System Architecture,’’ NEC Research & Development, 44: 8–12 (2003). [18] SGI, ‘‘SGI Altix 4700 Servers and Supercomputers,’’ Technical report, SGI Inc., http://www.sgi.com/pdfs/3867.pdf (2007). [19] Shan, H., and Strohmaier, E., ‘‘Performance Characteristics of the Cray X1 and Their Implications for Application Performance Tuning,’’ In 18th Annual International Conference on Supercomputing, Malo, France, pp. 175–183 (2004). [20] Takagi, Y., Sato, H., Wagatsuma, Y., Mizuno, K., and Sawaya, K., ‘‘Study of High Gain and Broadband Antipodal Fermi Anetnna with Corrugation,’’ In 2004 International Symposium on Antennas and Propagation, Vol. 1, pp. 69–72 (2004). [21] Tsuboi, K., and Masuya, G., ‘‘Direct Numerical Simulations for Instabilities of Remixed Planar Flames,’’ In The Fourth Asia- Pacific Conference on Combustion, Nanjing, China, pp. 136–139 (2003, November). 66 MUSA et al.