IEEExeonmem

1. MICHAEL E. THOMADAKIS 1 Memory Scalabilty and Performance in Intel64 Xeon SMP Platforms MICHAEL E. THOMADAKIS Abstract— cc-NUMA systems based on the Intel Nehalem and the Westmere processors are very popular with the scientific computing communities as they can achieve high floating- point and integer computation rates on HPC workloads. However, a closer analysis of the performance in their memory subsystem reveals that the per core and per thread memory bandwidth of either microprocessor is restricted to almost 1 3 of their ideal values. Multi-threaded memory access bandwith tops out at 2 3 of the maximum limit. In addition to this, the NUMA effect on latencies increasingly worsens as cores try to access larger memory resident data structures and the problem is exacerbated when the regular 4KiB page sizes are used. Moving from Nehalem to Westmere, read performance for data already owned by the same core scales gracefully with the number of cores and core clock speed. However, when data is within the L2 cache or beyond, write performance suffers in Westmere revealing scalability issues in their design when the system from the 4 to 6 cores. This problem gets more acute when multiple streams of data progress concurrently. The Westmere memory subsystem compared to that of the Nehalem, suffers from a worse performance degradation when threads on a core are modifying cache blocks owned by different cores within the same or another processor chip. Applications moving from Nehalem to Westmere based platforms, could experience unexpected memory access degradation even though Westmere was intented as a “drop-in” replacement for Nehalem. In this work we attempt to provide an accurate account of the on-chip and the system level bandwith and latency limitations of Xeons. We study how these two metrics scale as we move from one generation of a platform to subsequent ones where the clock speed, the number of cores and other architecture parameters are different. Here we also analyze how much locality, coherence state and virtual memory page size of data blocks affects memory performance. These last three factors are tra- ditionally overlooked, but if not given adequate attention can affect application performance significantly. We believe the performance analysis presented here can be used by application developers who strive to tune their code to use the underlying resources efficiently and avoid unecessary bottlenecks or surprising slowdowns.

2. Fig. 1. A 2-socket ccNUMA Westmere-EP platform with a 6-core Xeon 5600 in each socket and a QPI for cache coherent data exchange between them. I. INTRODUCTION Xeon 5500 and Xeon 5600 are highly successful processors, based on the recent Intel µ-architectures nicknamed respectively, “Nehalem” and “Westmere”. Nehalem implements the “Intel64” instruction set architecture (ISA), on a 45nm lithography, using high- k metal gate transistor technology [1]–[3]. Westmere is the follow-on implementation of an almost identical µ-architecture but on a 32nm lithography, with a 2nd generation high-k metal gate technology [3]–[5]. Xeons are chip multi-processors (CMPs) designed to support varying numbers of cores per chip according to spe- ciﬁc packaging. A CMP is the paradigm of architect- ing several cores and other components on the same silicon die to utilize the higher numbers of transistors as they become available with each new generation of process technology. CMPs (or multi-core processors) were adopted by processor manufacturers in their effort to support feasible power and thermal limits [6]. The trend of packaging higher numbers of cores on the same chip is expected to continue in the foreseeable future as IC feature size continues to decrease. Intel co-designed the Westmere chip alongside Nehalem making provisions for the increased system resources necessary in a chip with higher number of cores [4]. In this work we focus on the two socket 4-core

3. MICHAEL E. THOMADAKIS 2 Xeon 5500 (“Nehalem-EP”) and 6-core Xeon 5600 (“Westmere-EP”) platforms. Fig. 1 illustrates the basic system components of the 12-core Westmere-EP platform. Nehalem-EP platforms have an almost identical system architecture but with 4 cores per chip. Other differences between them will be discussed in later sections. Xeons have been employed extensively in high- performance computing (HPC) platforms1 as they achieve high floating-point and integer computation rates. Xeons’ high-performance is attributed to a number of enabling technologies incorporated in their design. At the µ-architecture level they support, among others, spec- ulative execution and branch-prediction, wide decode and issue instruction pipelines, multi-scalar out-of-order execution pipelines, native support for single-instruction multiple-data (SIMD) instructions, simultaneous multi- threading and support for a relatively high degree of instruction level parallelism [3], [9]–[11]. Scientific applications usually benefit by higher numbers of execution resources, such as floating point or integer units. These are available in the out-of-order, back-end execution pipeline on Xeon cores. However, in order to sustain high instruction completion rates, the memory subsystem has to provide each core with data and instructions at rates that will keep the pipeline units utilized most of the time. The demand to feed the pipelines is exacerbated in multi-core systems like the Xeons, since the memory system has to keep up with several cores simultaneously. Memory access is almost always in the critical path of a computation. Clever techniques are being devised on the architecture side to mitigate the memory performance bottleneck. Xeons rely on a number of modern architectural features to speed up memory access by the cores. These include an on-chip integrated memory controller (IMC), multi-level hardware and software pre-fetching, deep queues in load and store buffers, store-to-load forwarding, three levels of cache, two levels of Translation- Lookaside Buffers, wide data paths, and high-speed cache coherent inter-chip communication over the QPI fabric [12], [13]. The on-chip integrated memory controller attaches a Xeon chip to a local DRAM through three independent DDR3 memory channels which for Westmere can go up to 1.333GTransfers/s. On the Xeon-EP platform each one of the two processor chips directly connects to physically distinct DRAM space forming a cache-coherent Non-Uniform Memory Access 1 Xeon processors power 65% and 55% of the HPC systems appearing respectively, in the June 2011 [7] and the Nov. 2010 “Top- 500” lists [8]. (ccNUMA) system. Fig. 1 illustrates the cc-NUMA EP organization with two processor sockets, separate on- chip IMC and DRAM per socket and the physical connectivity of the two sockets by the QPI. Separate memory controllers per chip support increased scalability and higher access bandwidths than were possible before with older generations of Intel processors which relied on the (in)famous Front Side Bus architecture. A. Motivation for this Study Even though great progress has been achieved with Xeons in speeding up memory access, a closer performance analysis of the memory subsystem reveals that, the per core and per thread memory bandwidth of either microprocessor is restricted to almost one third of their theoretical values. The aggregate, multi-threaded memory access bandwidth tops out at two thirds of the maximum limit. In addition to this, the NUMA effect on latencies increasingly worsens as cores try to access larger memory-resident data structures and the problem is exacerbated when the regular 4KiB page sizes are used. The Westmere memory subsystem compared to that of the Nehalem, suffers from a worse performance degradation when threads on a core are writing to cache blocks owned by different cores within the same chip or another processor chip. Moving from Nehalem to Westmere, read performance for data already owned by the same core scales gracefully with the number of cores and core clock speed. However, when data is within the L2 cache or beyond, write performance suffers in Westmere revealing scalability issues in their design when the system moved from the 4 to 6 cores. Applications moving from Nehalem to Westmere based platforms, could experience unexpected memory access degradation even though Westmere was intended as a “drop-in” replacement for Nehalem. Application developers are faced with several chal- lenges trying to write efficient code for modern multi- core cc-NUMA platforms, such as those based on Xeon. Developers now typically have to partition the computation into parallel tasks which should utilize the cores and memory efficiently. The cost of memory access is given special attention since memory may quickly become the bottleneck resource. Developers implement cache-conscious code to maximize reuse of data already cached and avoid costlier access to higher levels in the memory hierarchy. Another approach to increase efficiency of multi-threaded applications, such as OMP code in scientific applications, is to fix the location of computation thread to a particular core and allocate data elements from a particular DRAM module. Selecting the

4. MICHAEL E. THOMADAKIS 3 right location to place threads and data on particular system resources is a tedious and at times lengthy trial and error process. When memory access cost changes the code has to be re-tuned. This work attempts to accurately quantify memory access cost by a core to memory locations which are resident in the different levels of memory hierarchy and owned by threads running on the same or other cores. We analyze performance scalability limits as we move from one generation of a platform to subsequent ones where the clock speed, the number of cores and other architecture parameters are different. The analysis presented here can be used by application developers who strive to understand resource access cost and tune the code to use the underlying resources efﬁciently. B. Related Work Babka and T˚uma [14] attempted to experimentally quantify the operating cost of Translation Lookaside Buffers (TLBs) and cache associativity. Peng et al. [15] used a “ping-pong” methodology to analyze the latency of cache-to-cache transfers compare the memory performance of older dual-core processors. The well-know STREAM benchmark [16] measures memory bandwidth at the user level but it disregards the impact in performance of relevant architectural features, such as, NUMA memory. Molka and Hackenberg [17], [18] compared the latency and bandwidth of the memory subsystem on AMD Shanghai and Intel Nehalem when memory blocks are in different cache coherency states. II. XEON MEMORY IDEAL PERFORMANCE LIMITS Ideal data transfer ﬁgures are obtained by multiplying the transfer rates times the data width in bits of each system channel. Vendors, usually, publish this and other more intimate design details partially2 . Xeon processors chips consist of two parts, the “Core” and the “Un-core,” which operate on separate clock and power domains. Fig. 2 illustrates a 6-core West- mere chip, the Core and Un-core parts, intra-chip data paths and some associated ideal date transfer rates. The un-core consists of the Global Queue, the Integrated Memory Controller, a shared Level 3 cache and QPI ports connecting to the other processor chip and to I/O. It also contains performance monitoring and power management logic. The Core part houses the processor cores.

6. !! # !$ %

7. !! ' '()

9. * ()

11. * () '+'+ ( '+'+( # $ # ! % # $ # ! % ' # $ # ! % ! # $ # ! % # # $ # ! % # $ # ! % , ' **- ' , ' **-

12. * Fig. 2. A 6-core Westmere (Xeon 5600) chip illustrating the Core and Un-core parts.

14. #! $%$$ ! !$ $ $ !$ $ $ !$ $ $ !$ $ $

15. #! #'

16. #!

17. #! #' ( )*+

18. (,*+

20. #! * -. Fig. 3. Detail of the Global Queue, the connectivity to IMC, L3, the cores and the QPI on a 4-core Nehalem chip, and associated ideal transfer rates. A. The “Un-Core” Domain The Un-core clock usually operates at twice the speed of the DDR3 channels and for our discussion we will assume it is at 2.667GHz. The L3 on Xeon-EP platforms supports 2MiB per core, which is 8MiB and 12MiB for the Nehalem and Westmere, respectively. The L3 has 32-Byte read and write ports and operates on the Un- core clock. The QPI subsystem operates on a separate ﬁxed clock which for the systems we will be considering supports 6.4giga-transfers/s. The “Global Queue” (GQ) structure is the central switching and buffering mech- anism that schedules data exchanges among the cores, the L3, the IMC and the QPI. Fig. 3 illustrates GQ details on a Nehalem chip. The GQ buffers requests from the Core for memory reads, for write-backs to local memory and remote peer operations with 32, 16 and 12 slot entries, respectively. The GQ plays a central 2 [19] offers a more complete discussion of the memory architecture on Nehalem-EP platforms and ideal performance ﬁgures which also apply to Westmere-EP.

21. MICHAEL E. THOMADAKIS 4

22. ! #$%!

23. ' $ ( )#

24. ( *#$ #

26. ',$-

27. ( # +

28. ,$- ' %

29. ( .

30. - / $+

31. $* - %

32. ( .

33. - /

34. ,+

35. '%0 #

36. ( .

37. - / 1 0 2 + 3 + +

38. - 2 ) + +

39. - 2 0 2 40 5 . . . 6 1 6 1

40. $*7,8

41. ,$

42. ( 9 !

43. ( 9

45. Fig. 4. Cache hierarchy and ideal performance limits in a Xeon core. role in the the operations and performance of the entire chip [20]. However, few technical details are available concerning GQ. Westmere increased the peak CPU and I/O bandwidth to DRAM memory by increasing the per socket un-core buffers to 88 from 64 in Nehalem [4]. This “deeper” buffering was meant to support more outstanding memory access operations per core than possible in Nehalem-EP. Ideally, the IMC can transfer data to the locally attached DRAM at the maximum aggregate bandwidth of the DDR3 paths to the memory DIMMs. The three DDR3 channels to local DRAM support a bandwidth of 31.992GB/s = 3×8×1.333giga-transfers/s. Each core in a socket should be able to capture a major portion of this memory bandwidth. The QPI links are full-duplex and their ideal transfer rate is 12.8 GB/s per direction. When a core accesses memory locations resident at the DRAM attached to the other Xeon chip (see Fig. 1), data is transferred over the QPI link connecting the chips together. The available bandwidth through the QPI link is approximately %40 of the theoretical bandwidth to the local DRAM and is the absolute upper bound to access remote DRAM. The QPI, L3, GQ and IMC include logic to support the “source- snooping” MESIF-QPI cache-coherence protocol thatthe Xeon-EP platform [12], [13] employs. The QPI logic uses separate virtual channels to transport data or cache- coherence messages according to their type. It also pre-allocates fixed numbers of buffers for each source- destination QPI pair. This likely exacerbates congestion between end-points with high traffic. B. The “Core” Domain On the Core domain, each core supports two levels of cache memory, L1 instruction and data, and a unified L2. Fig. 4 presents details of the cache memory hierarchy, associated connectivity and some ideal performance levels in a Nehalem core. Structure of Westmere cores is very similar. The L2s connect to the L3 which is shared by all cores, via the GQ structure. Each core has 2 levels of TLB structures for instructions and one for data. There are separate TLBs for 4KiB and 2MiB size pages. Each core includes a “Memory Order Buffer” with 48, 32 and 10 load, store and fill buffers, respectively. Fill buffers temporarily store new incoming cache blocks. There can be at most ten cache misses in progress at a time, placing an upper bound on data retrieval rates per core. All components in the Core domain operate on the same clock as the processor. This implies that ideal transfer rates scale with the clock frequency. III. XEON MEMORY SYSTEM SCALABILITY ANALYSIS A. Design of Experiments In this Section we analyze performance and scalability limits in the Xeon and the EP platforms memory systems. Conventional performance evaluations measure memory bandwidth and latency regardless of (a) locality or residency, that is, where this data is cached or reside and (b) the cache coherence state it is on at the time of the access. Another factor which is usually overlooked is (c), the virtual memory page size the system is using to map virtual addresses3 into physical ones. In our analysis we take all of these aspects into account since, as we show, bandwidth performance figures vary drastically with locality and cache state. Page size affects mostly latency and to a smaller extend bandwidth. Application developers along with the conventional raw performance and scalability limits, also have to pay increased attention to data locality and coherence state and select a more proper page size to tune their code accordingly. We refer to Fig. 1 to illustrate details of our investigation applying to all of our experiments. We divide the investigation into single-core and aggregate, multi- core performance analysis. The single core focuses on bandwidth and latency figures a single thread, that is fixed on a particular physical core, experiences while accessing memory. The multi-core focuses on the aggregate system-level performance figures when threads on every core are all performing the same memory operation.The 3 The term “effective address” is used for this address.

46. MICHAEL E. THOMADAKIS 5 latter one evaluates how well contention for common resources is handled by the architecture and it reveals limitations and opportunities. A single-core access pattern pins a software thread on Core 1 (or “CPU 0”) and evaluates the bandwidth accessing data on the L1 and L2 cache memories belonging to the same core and on the L1 and L2 memories belonging to each one of the other cores. It also evaluates the bandwidth accessing data on the L3 and the DRAMs attached to the same and to the other processor chips. One each of the experiments the cache blocks the threads access can be in different coherence states per the MESIF-QPI protocol. We investigate the scalability as we move from 4-core Nehalem to 6-core Westmere processors and as we move from cores running at a certain frequency to cores running at higher frequencies. Wherever possible, we compare the attained performance with the ideal performance numbers in that platform and discuss its dependence on locality and on the particular coherence state. We have selected three different Xeon platform configurations on two different working systems. All three configurarions operate in the so called “IA-32e, full 64- bit protected sub-mode” [9] which is the fully Intel64 compliant 64-bit mode. The first system is an IBM iDataPlex cluster called “Eos”, maintained by the Supercomputing Facility at Texas AM University [21]. Eos consists of a mixture of Nehalem-EP (Xeon-5560) and Westmere-EP (Xeon-5660) nodes, with all cores running at 2.8GHz. Each node has 24GiBs of DDR3 DRAM operating at 1.333 GT/s. The second system is Dell PowerEdge M610 blade cluster, called “Lonestar” and is a maintained by the Texas Advanced Computing Center (TACC) in the Uni- versity of Texas at Austin. Lonestar consists only of Westmere-EP (Xeon-5680) nodes with cores running at 3.33GHz. Each node has 24GiBs of DDR3 DRAM operating at 1.333 GT/s. For all experiments we utilize the “BenchIT” [22], [23] open source package4 which was built on the target Xeon EP systems. We used a slightly modified version of a collection of benchmarks called “X86membench” [18]. These kernels use Intel64 assembly instructions to read and write to memory and obtain timming measurements using the “Cycle-counter” hardware register that is available on each Xeon core. B. Xeon Memory Bandwidth Analysis 4 Official web site of the BenchIt project at http://www.benchit.org. 5 10 15 20 25 30 35 40 45 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-reader read bandwidth CPU0 locally read bandwidth CPU0 accessing CPU2 memory read bandwidth CPU0 accessing CPU3 memory read bandwidth CPU0 accessing CPU4 memory read bandwidth CPU0 accessing CPU5 memory read bandwidth CPU0 accessing CPU6 memory read bandwidth CPU0 accessing CPU7 memory Fig. 5. Bandwidth of a Single Reader Thread, Nehalem-EP, 4KiB Pages (EOS) 1) Single Core Data Retrieval: In this experiment we investigate the effective data retrieval rates by a single core from the different levels of memory hierarchy and all possible data localities in the system. This captures the portion of the system capacity a single core can utilize effectively. A single reader thread is pinned on “CPU0” (core 1 in Fig. 1) and reads memory segments with sizes varying successively from 10KiBs to 200MiBs. The reader thread retrieves memory blocks from its own data L1 and L2 caches, then from the L3 and the DRAM associated with its own processor chip. It then retrieves data already cached on the L1, L2 of all other cores on the same chip. Finally it retrieves data cached on the L1, L2 of all cores, the L3 and DRAM associated with the other processor chip. All memory blocks, if already cached, are in the “Exclusive” MESIF-QPI state in the corresponding own- ing core. A data block enters this state when it has been read and cached by exactly one core. By the QPI protocol, a requested block may be retrieved directly out of an L3 instead of each home DRAM, if is already cached on that L3. As soon as a second core caches a data block, the state of the first copy changes to “Shared” and the state in the newly cached one becomes “Forwarding”. MESIF protocol allows exactly one copy to be in the latter state permitting it to quickly forward it to the next requestor. This operation avoids accessing the slower home memory and is called a “cache to cache intervention”. Fig. 5, Fig. 7 and Fig. 9 plot data retrieval bandwidths on Nehalem and Westmere parts of EOS and on Lones- tar, respectively. The top curves plot the BW in GiB/s when the core retrieves data from its own L1, L2, L3 and DRAM associated with its own and the remote chip. The observed bandwidths of 43.7 GB/s, 43.6 GB/s and 51.8

47. MICHAEL E. THOMADAKIS 6 5 10 15 20 25 30 35 40 45 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64_LP.memory_bandwidth.C.pthread.SSE2.single-reader read bandwidth CPU0 locally read bandwidth CPU0 accessing CPU2 memory read bandwidth CPU0 accessing CPU3 memory read bandwidth CPU0 accessing CPU4 memory read bandwidth CPU0 accessing CPU5 memory read bandwidth CPU0 accessing CPU6 memory read bandwidth CPU0 accessing CPU7 memory Fig. 6. Bandwidth of a Single Reader Thread, Nehalem-EP, 2MiB Pages (EOS) 5 10 15 20 25 30 35 40 45 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-reader read bandwidth CPU0 locally read bandwidth CPU0 accessing CPU2 memory read bandwidth CPU0 accessing CPU3 memory read bandwidth CPU0 accessing CPU4 memory read bandwidth CPU0 accessing CPU5 memory read bandwidth CPU0 accessing CPU6 memory read bandwidth CPU0 accessing CPU7 memory read bandwidth CPU0 accessing CPU8 memory read bandwidth CPU0 accessing CPU9 memory read bandwidth CPU0 accessing CPU10 memory read bandwidth CPU0 accessing CPU11 memory Fig. 7. Bandwidth for a Single Reader Thread, Westmere-EP, 4KiB Pages (EOS) 5 10 15 20 25 30 35 40 45 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64_LP.memory_bandwidth.C.pthread.SSE2.single-reader read bandwidth CPU0 locally read bandwidth CPU0 accessing CPU2 memory read bandwidth CPU0 accessing CPU3 memory read bandwidth CPU0 accessing CPU4 memory read bandwidth CPU0 accessing CPU5 memory read bandwidth CPU0 accessing CPU6 memory read bandwidth CPU0 accessing CPU7 memory read bandwidth CPU0 accessing CPU8 memory Fig. 8. Bandwidth of a Single Reader Thread, Westmere-EP, 2MiB Pages (EOS) 5 10 15 20 25 30 35 40 45 50 55 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-reader read bandwidth CPU0 locally read bandwidth CPU0 accessing CPU2 memory read bandwidth CPU0 accessing CPU3 memory read bandwidth CPU0 accessing CPU4 memory read bandwidth CPU0 accessing CPU5 memory read bandwidth CPU0 accessing CPU6 memory read bandwidth CPU0 accessing CPU7 memory read bandwidth CPU0 accessing CPU8 memory read bandwidth CPU0 accessing CPU9 memory read bandwidth CPU0 accessing CPU10 memory read bandwidth CPU0 accessing CPU11 memory Fig. 9. Bandwidth for a Single Reader Thread, Westmere-EP, 4KiB Pages (LoneStar) accessing L1 cache (32 KiB) is very close to the ideal ones. The ideal L1 BW for a 2.8GHZ and a 3.33GHz Xeon is 44.8 GB/s (44.8GB/s = 2.8GHz × 16bytes) and 53.28 GB/s (44.8GB/s = 3.33GHz × 16bytes), respectively. The L2 bandwidths are measured to 29.7 GB/s, 29.7 GB/s and 35.3, respectively. The vendor does not provide ﬁgures on L2 performance except from the latency to retrieve an L2 block. Data retrievals scale well when we move from 4 cores to 6 cores and also they scale well with the clock of core. For instance we can check that 3.33 2.8 29.7 = 35.32 which matches the measured L2 bandwidth on the Westmere running at 3.33GHz. L3 data retrieval ﬁgures are not provided by the vendor and are measured to 23.8 GB/s, 23.1 GB/s and 25.6, respectively. We notice that 3.33 2.8 25.6 23.1, implying that L3 access does not scale linearly with the core clock. This is expected since the L3 is at the Un-Core which operates at twice the DDR3 rates. The local DRAM supports 11.8 GB/s, 10.9 GB/s and 11.1, respectively. The Remote DRAM supports 7.8 GB/s, 7.7 GB/s and 7.7, respectively. Data from remote DRAM traverse the QPI link but the QPI ideal rate does not appear to be the limiting factor. The curves at the middle of the Fig. 5, Fig. 7 and Fig. 9 plot retrieval rates of data items already cached in the L1 or L2 of other cores within the same chip. L3 which is an inclusive cache also caches whatever is cached above in a L2 or L1. Thus L3 uses cache intervention to pass a copy of this block up to core 1. This explains why accessing data already cached by oter cores has the same performance as accessing data from the L3. Comparing 4-core and 6-core systems we see that performance accessing blocks cached by other cores within the same chip is worse for the 6-core system.

48. MICHAEL E. THOMADAKIS 7 5 10 15 20 25 30 35 40 45 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-writer memory bandwidth: CPU0 writing memory used by CPU0 memory bandwidth: CPU0 writing memory used by CPU2 memory bandwidth: CPU0 writing memory used by CPU3 memory bandwidth: CPU0 writing memory used by CPU4 memory bandwidth: CPU0 writing memory used by CPU5 memory bandwidth: CPU0 writing memory used by CPU6 memory bandwidth: CPU0 writing memory used by CPU7 Fig. 10. Bandwidth of Single Writer Thread, Nehalem-EP, 4KiB Pages (EOS) Finally the bottom curves show the rate for when core 1 access blocks already cached by cores on the other chip. Rates start for all cases at around 9 GB/s where data is supplied by the remote L3 and drop to around 7.7 GB/s for larger requests where the remote DRAM has to be accessed. The same experiment has been carried out with 2MiB large VM pages on the Nehalem and Westmere parts of EOS. Fig. 6 and Fig. 8 plot the respective results. Bandwidth figures using large pages are similar to those of the regular 4KiB pages with the only difference that performance starts dropping a little later as we cross boundaries in the memory hierarchies. Overall, measured data retrieval rates are close to the ideal limits and scale well with clock rate and as we move from 4 to 6 cores. It is clear that a single core cannot utilize the entire available bandwidth to the DRAM. Resource limits along the path from a core’s Memory Order Buffer to the IMC are creating this artificial upper bandwidth bound. The difference in bandwidth quickly deterriorates when the cache memories cannot absorb the requests. Application developers need to take this large performance disparities when they tune their code for the architecture. 2) Single Core Data Updates: This experiment is the data modification counterpart of the previous experiment where the writer thread is also pinned on Core 1. All blocks are initialized to the state “Modified” before the measurements. When a core has to write to a data block, the MESIF protocol requires a “Read-for-Ownership” operation which snoops and invalidates this memory block if it is already stored on other caches. Fig. 10, Fig. 11 and Fig. 12 plot measured bandwidths. The L1 rates closely match the retrieval rates. Perfor- mance of L2 and L3 is relatively worse than when data 5 10 15 20 25 30 35 40 45 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-writer memory bandwidth: CPU0 writing memory used by CPU0 memory bandwidth: CPU0 writing memory used by CPU2 memory bandwidth: CPU0 writing memory used by CPU3 memory bandwidth: CPU0 writing memory used by CPU4 memory bandwidth: CPU0 writing memory used by CPU5 memory bandwidth: CPU0 writing memory used by CPU6 memory bandwidth: CPU0 writing memory used by CPU7 memory bandwidth: CPU0 writing memory used by CPU8 memory bandwidth: CPU0 writing memory used by CPU9 memory bandwidth: CPU0 writing memory used by CPU10 memory bandwidth: CPU0 writing memory used by CPU11 Fig. 11. Bandwidth of a Single Writer Thread, Westmere-EP, 4KiB Pages (EOS) 5 10 15 20 25 30 35 40 45 50 55 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-writer memory bandwidth: CPU0 writing memory used by CPU0 memory bandwidth: CPU0 writing memory used by CPU2 memory bandwidth: CPU0 writing memory used by CPU3 memory bandwidth: CPU0 writing memory used by CPU4 memory bandwidth: CPU0 writing memory used by CPU5 memory bandwidth: CPU0 writing memory used by CPU6 memory bandwidth: CPU0 writing memory used by CPU7 memory bandwidth: CPU0 writing memory used by CPU8 memory bandwidth: CPU0 writing memory used by CPU9 memory bandwidth: CPU0 writing memory used by CPU10 memory bandwidth: CPU0 writing memory used by CPU11 Fig. 12. Bandwidth of a Single Writer Thread, Westmere-EP, 4KiB Pages (LoneStar) is retrieved from these caches. Local and remote DRAM access is even worse. Writing local DRAm is 8.8 GB/s, 7.7 GB/s and 8.2 GB/s. Writing to remote DRAM on is at around 5.5 for all three cases. On 6-core systems writing to blocks already cached by other cores on the same chip is a less scalable operation than on a 4-core system. For instance Fig. 10 shows that on the 4-core Nehalem the L3 can absorb gracefully all block updates from within the same chip to a stable performance of 17.6 GB/s. However, as we see on Fig. 11 and Fig. 12, the same scenario on the 6-core Westmere attains 15.5 GB/s and 16.9 GB/s, respectively. The last figure comes from a systems which operates at 3.33GHz, that is 1.19 = 3.33 2.8 faster. However, the 2.8GHz Nehalem still manages to attain 17.6 GB/s. More importandly, on the 6-core systems the top curve (accessing data only cached by itself) quickly deterriorates to 12.9 GB/s and to even below 10 GB/s. The same curve on the 4-core nehalem attains 17.6 GB/s and it appears much more stable. Applications moving from 4-core to 6-core systems will experience unexpected performance degradation.

49. MICHAEL E. THOMADAKIS 8 0 10 20 30 40 50 60 70 80 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-r1w1 bandwidth: CPU0 - CPU0 bandwidth: CPU0 - CPU2 bandwidth: CPU0 - CPU3 bandwidth: CPU0 - CPU4 bandwidth: CPU0 - CPU5 bandwidth: CPU0 - CPU6 bandwidth: CPU0 - CPU7 Fig. 13. Bandwidth of a Single Pair of Reader and Writer Streams, Nehalem-EP, 4KiB Pages (EOS) 0 10 20 30 40 50 60 70 80 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-r1w1 bandwidth: CPU0 - CPU0 bandwidth: CPU0 - CPU2 bandwidth: CPU0 - CPU3 bandwidth: CPU0 - CPU4 bandwidth: CPU0 - CPU5 bandwidth: CPU0 - CPU6 bandwidth: CPU0 - CPU7 bandwidth: CPU0 - CPU8 bandwidth: CPU0 - CPU9 bandwidth: CPU0 - CPU10 bandwidth: CPU0 - CPU11 Fig. 14. Bandwidth of a Single Pair of Reader and Writer Streams, Westmere-EP, 4KiB Pages (EOS) The hardware provisions made at design time moving from 4 to 6 cores per chip restrict the scalability and performance. 3) Combined Retrieval and Update – Single Stream Pair: In this experiment, a single thread pinned on a core, drives simultaneously a retrieval and an update 5 10 15 20 25 30 35 40 45 50 55 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.single-r1w1 bandwidth: CPU0 - CPU0 bandwidth: CPU0 - CPU2 bandwidth: CPU0 - CPU3 bandwidth: CPU0 - CPU4 bandwidth: CPU0 - CPU5 bandwidth: CPU0 - CPU6 bandwidth: CPU0 - CPU7 bandwidth: CPU0 - CPU8 bandwidth: CPU0 - CPU9 bandwidth: CPU0 - CPU10 bandwidth: CPU0 - CPU11 Fig. 15. Bandwidth of a Single Pair of Reader and Writer Streams, Westmere-EP, 4KiB Pages (LoneStar) stream from various localities in the memory hierarchy towards its local DRAM. This investigates the ability of the memory to simultaneously retrieve and update different locations. The bandwidth figures reflect the fact that each data block is used in two memory operations. Fig. 13, Fig. 14 and Fig. 15 plot measured bandwidths from the three experimental configurations. The top curves plot the case where both read and write streams are on the same DRAM module. For data that can fit in the L1s we attain 75.6GB/s, 75.8 GB/s and 53.3 GB/s, respectively. This shows that the two ports of the L1 can be used simultaneously and attain respectable rates. The curves in the middle plot when the source is cached on the L1 and L2 caches of cores within the same chip. The bottom curves plot the bandwidths attained by moving blocks cached or resident on the other chip’s localities. However, since the cache memories end up caching two copies of a data block, performance drops much faster as soon as 1 2 of a cache is filled with the inbound blocks. The two EOS configurations have approximately the same performance, with the Westmere one having somewhat lower levels. The LoneStar system appears to perform with this workload in a more unstable fashion as soon as the L1 is overwhelmed by inbound and outbound copies with bandwidth dropping to 27.5GB/s. However as soon as the segments get larger than 120 KiBs bandwidth jumps back to 41.4 GB/s. With the exception of this abnormal bandwidth drop, the relative bandwidth in L1 and L2s follows the clock frequency ratio. However, the L2 performance drops on Westmere compared to Nehalem as the occupancy in L2 increases. Finally when data is streamed in from the remote chip, while inbound and outbound blocks fit in the L2 and L3 the apparent bandwidth is ≈ 14GB/s–15GB/s, but drops down to ≈ 9GB/s. 4) Aggregate Data Retrieval Rates: In this experiment, nc threads are split evenly across all nc available cores and read simultaneously from disjoint locations, memory segments of sizes up to 200MiBs. This investigates the aggregate throughput a system can provide simultaneously to multiple cores. Here one thread is pinned on each core. Cached memory blocks are in the Exclusive state. Fig. 16, Fig. 17 and Fig. 18 plot the aggregate retriaval rates on our three systems, with the x-axis being the sum of all blocks at a particular size.. L1 rates are 353.3 GB/s, 524.2 GB/s and 620.9 GB/s, respectively giving 44.2 GB/s, 43.7 GB/s and 51.7 GB/s, per core, all close to corresponding ideal bandwidths. Also we notice that 620.9 524.2 3.33 2.8 for the two Westmere systems, implying performance scales with clock speed. For the L2s we performance is at 237.5 GB/s, 368.8

50. MICHAEL E. THOMADAKIS 9 0 50 100 150 200 250 300 350 400 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-reader read bandwidth 0-7 Fig. 16. Aggregate Bandwidth of 8 Reader Threads, Nehalem-EP, 4KiB Pages (EOS) 0 50 100 150 200 250 300 350 400 450 500 550 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-reader read bandwidth 0-11 Fig. 17. Aggregate Bandwidth of 12 Reader Threads, Westmere-EP, 4KiB Pages (EOS) GB/s and 420 GB/s, respectively giving 29.7 GB/s, 30.7 GB/s and 35 GB/s, per core which all match the corresponding single core rates times the number of cores. L3 supports 161.1 GB/s, 172.1 GB/s and 171.8 GB/s, respectively giving 20.2 GB/s, 14.3 GB/s and 14.3 GB/s, per core. Finally, when all requests go directly to DRAM, aggregate read bandwidth settles to approximately 39.8 GB/s, 38.4 GB/s and 38.9 GB/s, respectively, or 19.9 GB/sec, 19.2 GB/s and 19.4 GB/s, correspondingly per socket. This experiment shows that the IMC on a chip delivers data at rates higher than an individual cores can attain. The bottleneck thus is not on the DRAM and IMC side but in the Un-Core cause by artiﬁcial limits on resources dedicated to service each individual core. 5) Aggregate Data Updates: This experiment is the counterpart of the aggregate retrieval experiment of Subsection III-B4. As before, nc threads are split evenly 0 100 200 300 400 500 600 700 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-reader read bandwidth 0-11 Fig. 18. Aggregate Bandwidth of 12 Reader Threads, Westmere-EP, 4KiB Pages (LoneStar) 0 50 100 150 200 250 300 350 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-writer write bandwidth 0-7 Fig. 19. Aggregate Bandwidth of 8 Writer Threads, Nehalem-EP, 4KiB Pages (EOS) 0 100 200 300 400 500 600 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-writer write bandwidth 0-11 Fig. 20. Aggregate Bandwidth of 12 Writer Threads, Westmere-EP, 4KiB Pages (EOS)

51. MICHAEL E. THOMADAKIS 10 0 100 200 300 400 500 600 700 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-writer write bandwidth 0-11 Fig. 21. Aggregate Bandwidth of 12 Writer Threads, Westmere-EP, 4KiB Pages (LoneStar) across all nc available cores and update simultaneously disjoint locations, memory segments of sizes up to 200MiBs. One thread is pinned on each core and memory blocks are in the Modified state. Fig. 19, Fig. 20 and Fig. 21 plot the aggregate update performance of the three different confiqurations, with the x-axis being the sum of all blocks at a particular size. Updates to L1 attain 347.1GB/s, 527.3 GB/s and 617.3 GB/s, respectively giving 43.4 GB/s, 43.94 GB/s and 51.4 GB/s, per core, all close to corresponding ideal bandwidths. The L2s can update their contents at 222.9GB/s, 325.4 GB/s and 384 GB/s, respectively giving 27.8 GB/s, 27.1 GB/s and 32 GB/s, per core, all closely following L2 retrieval rates. However, when we update L3, we see that the attained rates are 52GB/s, 51.1 GB/s and 50.6 GB/s, respectively which are considerably lower than the L3 retrieval rates. The per core average rate of the aggregate updates is 1 3 to 1 4 that of the single core update rates of Subsection III-B2. This slowdown is somehow expected since L3 are shared among all cores and all updates to L3 are serialized by the MESIF protocol. It is clear that not enough bandwidth has been provisioned on all three configurations to sustain simultenous updates by all cores. Finally update rates for DRAM are much worse than those of the aggregate retrieval case. Here all aggregate rates are at ≈ 20 GB/s or ≈ 10 GB/s per socket. Looking at the individual update rates of Subsection III-B2 we can see that with more than two individual update streams evenly split across IMCs, the memory system becomes the bottleneck. Aggregate rates are only a little higher than those a single core attains. 6) Aggregate Combined Retrieval and Update – Mul- tiple Stream Pairs: This experiment investigates the 0 100 200 300 400 500 600 700 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-r1w1 bandwidth 0-7 Fig. 22. Aggregate Bandwidth of 8 Read and Write Stream Pairs, Nehalem-EP, 4KiB Pages (EOS) 0 100 200 300 400 500 600 700 800 900 1000 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-r1w1 bandwidth 0-11 Fig. 23. Aggregate Bandwidth of 12 Read and Write Stream Pairs, Westmere-EP, 4KiB Pages (EOS) 0 250 500 750 1000 1250 10k 100k 1M 10M 100M 1G bandwidth[GB/s] data set size [Byte] new_arch_x86_64.memory_bandwidth.C.pthread.SSE2.multiple-r1w1 bandwidth 0-11 Fig. 24. Aggregate Bandwidth of 12 Read and Write Stream Pairs, Westmere-EP, 4KiB Pages (LoneStar)

52. MICHAEL E. THOMADAKIS 11 limits of the aggregate system ability to retrieve and update concurrently multiple data streams. A thread is pinned on each one of the nc cores on the system and drives each stream pair. This particular memory access pattern stresses all parts in the entire memory infrastructure. Fig. 22, Fig. 23 and Fig. 24 plot the aggregate update performance of the three different configurations, with the x-axis being the sum of all blocks at a particular size. Updates to L1 attain 615.2 GB/s, 918.7 GB/s and 1076.7 GB/s, respectively giving 76.9 GB/s, 76.6 GB/s and 89.7 GB/s, average per core, all close to the sums reported by aggregate retrieval and updates of Subsections III-B4 and III-B5. Scaling per core count and clock frequency for L1 is attained. For aggregate L2 we obtain 264.9 GB/s, 394 GB/s and 472 GB/s, respectively giving 33.1 GB/s, 32.8 GB/s and 39.3 GB/s, average per core. In each of these three cases the attained bi-directional bandwidth is only a little higher than the one from the corresponding aggregate retrieval or update case. Here the L2 cache memory quickly shows its limitation to handle bi-directional streams of blocks. Making L2 dual-port memory could mitigate this problem. The performance of the L3 caches with bi-directional traffic is even further disappointing as the achieved figures are 72.5 GB/s, 69.9 GB/s and 68.1 GB/s, respectively giving 9 GB/s, 5.8 GB/s and 5.7 GB/s, average per core. These figures demonstrate that concurrent bi- directional streams cannot be handled adequately by the L3 sub-system. Finally, bi-directional block streams are serviced by DRAM access at 26.4 GB/s, 25.4 GB/s and 26.4 GB/s, respectively. This investigation reveals that Xeons cannot handle gracefully bi-directional streams beyond the L1 cache. Further investigation would require digging into unavail- able GQ and QPI details but there is definitely room for improvement here. C. Memory Hierachy Access Latencies We explore the effect page size has on the latencies to access the memory hierarchy. We focus on the 4 KiB and 2 MiB, available in the Xeons and where 4KiB is the most widely used size. For all experiments a thread is pinned on Core 1 and accesses memory cached or homed at the various localities. In all subsequent figures, the bottom curve plots cost to access L1, L2, L3 and local DRAM. The middle ones plot latencies to access L1, L2 and L3 on another core within the same chip. The top curves plot latencies to 0 20 40 60 80 100 120 140 160 180 10k 100k 1M 10M 100M 1G Timeinnano-seconds data set size [Byte] new_arch_x86_64.memory_latency.C.pthread.0.read memory latency CPU0 locally (time) memory latency CPU0 accessing CPU1 memory (time) memory latency CPU0 accessing CPU2 memory (time) memory latency CPU0 accessing CPU3 memory (time) memory latency CPU0 accessing CPU4 memory (time) memory latency CPU0 accessing CPU5 memory (time) memory latency CPU0 accessing CPU6 memory (time) Fig. 25. Latency to Read a Data Block in nano-secs, Nehalem-EP, 4KiB Pages (EOS) 0 20 40 60 80 100 120 10k 100k 1M 10M 100M 1G Timeinnano-seconds data set size [Byte] new_arch_x86_64_LP.memory_latency.C.pthread.0.read memory latency CPU0 locally (time) memory latency CPU0 accessing CPU1 memory (time) memory latency CPU0 accessing CPU2 memory (time) memory latency CPU0 accessing CPU3 memory (time) memory latency CPU0 accessing CPU4 memory (time) memory latency CPU0 accessing CPU5 memory (time) memory latency CPU0 accessing CPU6 memory (time) Fig. 26. Latency to Read a Data Block in Seconds, Nehalem-EP, 2MiB Pages (EOS) 0 20 40 60 80 100 120 140 160 180 200 10k 100k 1M 10M 100M 1G Timeinnano-seconds data set size [Byte] new_arch_x86_64.memory_latency.C.pthread.0.read memory latency CPU0 locally (time) memory latency CPU0 accessing CPU1 memory (time) memory latency CPU0 accessing CPU2 memory (time) memory latency CPU0 accessing CPU3 memory (time) memory latency CPU0 accessing CPU4 memory (time) memory latency CPU0 accessing CPU5 memory (time) memory latency CPU0 accessing CPU6 memory (time) memory latency CPU0 accessing CPU7 memory (time) memory latency CPU0 accessing CPU8 memory (time) memory latency CPU0 accessing CPU9 memory (time) memory latency CPU0 accessing CPU10 memory (time) Fig. 27. Latency to Read a Data Block in Seconds, Westmere-EP, 4KiB Pages (EOS)

53. MICHAEL E. THOMADAKIS 12 0 20 40 60 80 100 120 10k 100k 1M 10M 100M 1G Timeinnano-seconds data set size [Byte] new_arch_x86_64_LP.memory_latency.C.pthread.0.read memory latency CPU0 locally (time) memory latency CPU0 accessing CPU1 memory (time) memory latency CPU0 accessing CPU2 memory (time) memory latency CPU0 accessing CPU3 memory (time) memory latency CPU0 accessing CPU4 memory (time) memory latency CPU0 accessing CPU5 memory (time) memory latency CPU0 accessing CPU6 memory (time) memory latency CPU0 accessing CPU7 memory (time) memory latency CPU0 accessing CPU8 memory (time) memory latency CPU0 accessing CPU9 memory (time) memory latency CPU0 accessing CPU10 memory (time) Fig. 28. Latency to Read a Data Block in Seconds, Westmere-EP, 2MiB Pages (EOS) 0 20 40 60 80 100 120 140 160 180 10k 100k 1M 10M 100M 1G Timeinnano-seconds data set size [Byte] new_arch_x86_64.memory_latency.C.pthread.0.read memory latency CPU0 locally (time) memory latency CPU0 accessing CPU1 memory (time) memory latency CPU0 accessing CPU2 memory (time) memory latency CPU0 accessing CPU3 memory (time) memory latency CPU0 accessing CPU4 memory (time) memory latency CPU0 accessing CPU5 memory (time) memory latency CPU0 accessing CPU6 memory (time) memory latency CPU0 accessing CPU7 memory (time) memory latency CPU0 accessing CPU8 memory (time) memory latency CPU0 accessing CPU9 memory (time) memory latency CPU0 accessing CPU10 memory (time) Fig. 29. Latency to Read a Data Block in Seconds, Westmere-EP, 4KiB Pages (LoneStar) access L1, L2 and L3 of data already cached by cores on the other chip and finally by remote DRAM. All times are in nano-seconds. Fig. 25, Fig. 27 and Fig. 29 plot latencies when 4 KiB size pages are used. Access to local DRAM can take up to 114.29 ns, 118.58 ns and 114.95 ns, respectively. The latency accessing from remote DRAM is, respectively 170.37 ns, 181.8 ns and 173.3 GB/s. The difference in core clock frequency does not make any significant difference. The NUMA effect, that is the disparity in the cost to access local vs. remote DRAM is 56.08 ns, 63.22 ns and 58.35 ns, respectively. In terms of percentage, remote DRAM latency is higher by %49, %53.3 and %51, which are rather significant. Using 2 MiB size, as Fig. 26 and Fig. 28 show, latencies to local DRAM can take up to 70.7 ns and 74.7 ns, respectively, for the two EOS configurations. Accessing remote DRAM takes, respectively 112.14 ns and 114.31 ns. The NUMA latency disparity is 41.44 ns and 40 ns or remote access is longer by %59 and %54, respectively. The important observation is that by using 2MiB pages we can shorten the latency to local DRAM by 43.59 ns and 44.28 ns and to the remote DRAM by 58.23 ns and 67.49 ns, respectively. Percentage-wise we shorten the latencies to local DRAM by %38 and %37 and to remote one, respectively by %34 and %37. There are also more minor improvements when the capacities of the cache memories are reached. It is clear that the platform does not provide sufficient address translation resources, such as TLBs, for the regular 4KiB page sizes. Applications required to access long lists of memory locations, as in pointer chasing, will definitely suffer performance degradations with the 4 KiB pages. IV. CONCLUSIONS In this work we analyzed and quantified in detail per- core and system-wide performance and scalability limits of the memory in recent Xeon platforms. We focused on a number of fundamental access patterns, with varying degrees of concurrency and considered blocks in certain coherence states. Overall, data retrievals scale well with system size and stream count but when data updates are involved, the platforms exhibit behavior which merits system improvements to correct performance and scalability issues. There is a disparity in memory performance according to the locality, concurrency and coherence state of each data block which requires adequate attention by system designers and application developers. Specifically, for data retrieval of blocks in Exclusive state, the platforms scale well moving from 4 to 6 cores and when the clock frequency increases as long as the data fits in the various levels of cache hierarchy. The per-core retrieval rates from local DRAM range from 10.9 to 11.8 GB/s and are ≈ 1 2 those available (19.2 to 19.9 GB/s) and a little than ≈ 1 4 of aggregate (38.4 to 39.8 GB/s). This applies to all core counts and clock frequencies and it is due to resource scarcities on the DRAM to core path, likely in the Un-Core. This can be alleviated by providing more resources to service each core, such as deeper GQ or per-core IMC buffers. Single core retrievals from remote DRAMs attain ≈ %64 to %70 of the local DRAM bandwidth and the QPI is not the bottleneck. Multi-stream data retrievals scale well with core count and clock frequency However, when updates are involved with blocks in the Modified state, the results are mixed. 4-core chips handle updates more gracefully as data sizes increase. 6-core systems, however, experience unexpected slow downs and unstable performance as soon as L3 is involved. Code tuned for a 4-core system will experience performance drop when it starts using a 6-core system, likely requiring 6-core specific tunings. The bandwidth

54. MICHAEL E. THOMADAKIS 13 available to update local or remote DRAM is significantly lower than when data is retrieved from them. Multi-stream updates scale well until the L3 is engaged. At this stage performance drops significantly pointing to an inability of the platform to scale with concurrent update streams, likely due to resource scarcity in the Un-Core or QPI. 2 streams can already saturate the memory system. Single or multiple pairs of retrieve-update streams scale well across core count and clock until L2 is engaged at which point performance drops significantly. L2s can alleviate this by adding more ports. When L3s or DRAM are involved performance drops further, pointing to issues with handling single or concurrent bi- directional streams. The platforms will have to include further provisions for this type of access pattern which is not uncommon in HPC applications. Since updates present problematic performance on 6-core systems, moving from a 4-core to a 6-core system reveals inadequate system provisioning at the design stage. As core counts are expected to increase this issue has to be addressed in a scalable way. The areas which need improvement should include increasing the efficiency of L3’s, the QPI coherence protocol and the GQ structures. In particular, their ability to handle concurrent streams should be increased to allow more memory operations to proceed concurrently. This will require reworking L3 and GQ structure and restricting unnecessary coherence broadcast operations with snoop filters or other mechanisms. With the widely used 4 KiB pages, access latency to DRAM suffers, due to scarcity in TLB and other address translation resources. The use of large 2 MiB pages could mitigate the latency problem and reduce the cost by %34 to %38. System designers will have to increase the translation resources for smaller page sizes in future platforms. Conventionally, people focus on the cost in remote vs. local memory access or the various levels in cache hierarchy. A class of “communications avoiding” algorithms has been devised to take this into consideration and im- prove performance. However updates are much costlier than retrievals, especially so when multiple streams are in progress concurrently, a situation which is common with HPC workloads. Worse as applications move to platforms with more cores the disparities may become even worse requiring another round of lengthy tuning. Both of these have to be considered We hope that this work provides to application developers some tools to understand the cost of accessing system resources in a more quantifiable way and tune their code accordingly. ACKNOWLEDGEMETS We are most grateful to the Supercomputing Facility at Texas AM University and TACC center in the University of Texas at Austin for allowing us to use their HPC resources for this investigation. REFERENCES [1] S. Gunther and R. Singhal, “Next generation intel R microarchitecture (nehalem) family: Architectural insights and power management,” in Intel Developer Forum. San Francisco: Intel, Mar. 2008. [2] R. Singhal, “Inside intel next generation nehalem microarchitecture,” in Intel Developer Forum. San Francisco: Intel, Mar. 2008. [3] Intel R 64 and IA-32 Architectures Software Developer’s Man- ual Volume 1:Basic Architecture, Intel, May 2011. [4] D. Hill and M. Chowdhury, “Westmere xeon56xx “tick” cpu,” IEEE, Palo Alto, CA, Tech. Rep., Aug. 2010. [5] N. A. Kurd, S. Bhamidipati, C. Mozak, J. L. Miller, P. Mosa- likanti, T. M. Wilson, A. M. El-Husseini, M. Neidengard, R. E. Aly, M. Nemani, M. Chowdhury, and R. Kumar, “A family of 32-nm ia processors,” IEEE Journal of Solid-State Circuits, vol. 46, no. 1, pp. 119–130, Jan. 2011. [6] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick, “A view of the parallel computing landscape,” Communications of the ACM, vol. 52, pp. 56–67, October 2009. [Online]. Available: http://doi.acm.org/10.1145/ 1562764.1562783 [7] “Top 500 supercomputers, jun. 2011,” Jun. 2011. [Online]. Available: http://www.top500.org/lists/2011/06 [8] “Top 500 supercomputers, nov. 2010,” Nov. 2010. [Online]. Available: http://www.top500.org/lists/2010/11 [9] Intel R 64 and IA-32 Architectures Software Developer’s Man- ual Volume 3A: System Programming Guide, Part 1, Intel, May 2011. [10] Intel R 64 and IA-32 Architectures Software Developer’s Man- ual Volume 3B: System Programming Guide, Part 2, Intel, May 2011. [11] Intel R 64 and IA-32 Architectures Optimization Reference Manual, Intel, May 2010. [12] R. A. Maddox, G. Singh, and R. J. Safranek, Weaving High Performance Multiprocessor Fabric. Hillsboro, OR: Intel Corporation, 2009. [13] Intel, “An introduction to the intel R quickpath interconnect,” Intel White-Paper, Tech. Rep. Document Number: 320412- 001US, Jan. 2009. [14] V. Babka and P. T˚uma, “Investigating cache parameters of x86 family of processors,” in SPEC Benchmark Workshop 2009, ser. LNCS, D. Kaeli and K. Sachs, Eds., no. 5419. Heidelberg: Springer-Verlag Berlin, 2009, pp. 77–96. [15] L. Peng, J.-K. Peir, T. K. Prakash, C. Staelin, Y.-K. Chen, and D. Koppelman, “Memory hierarchy performance measurement of commercial dual-core desktop processors,” J. Syst. Archit., vol. 54, pp. 816–828, August 2008. [Online]. Available: http://portal.acm.org/citation.cfm?id=1399642.1399665 [16] J. D. McCalpin, “STREAM: Sustainable memory bandwidth in high performance computers,” University of Virginia, Charlottesville, Virginia, Tech. Rep., 1991–2007. [Online]. Available: http://www.cs.virginia.edu/stream/

55. MICHAEL E. THOMADAKIS 14 [17] D. Molka, D. Hackenberg, R. Schone, and M. S. Muller, “Memory performance and cache coherency effects on an intel nehalem multiprocessor system,” in Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques. Washington, DC, USA: IEEE Computer Society, 2009, pp. 261–270. [Online]. Available: http://portal.acm.org/citation.cfm?id=1636712.1637764 [18] D. Hackenberg, D. Molka, and W. E. Nagel, “Comparing cache architectures and coherency protocols on x86-64 multicore smp systems,” in Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 42. New York, NY, USA: ACM, 2009, pp. 413–422. [Online]. Available: http://doi.acm.org/10.1145/1669112.1669165 [19] M. E. Thomadakis, “The architecture of the Nehalem processor and Nehalem-EP smp platforms,” Texas AM University, College Station, TX, Technical Report, 2011. [Online]. Available: http://sc.tamu.edu/systems/eos/nehalem.pdf [20] D. Levinthal, “Performance analysis guide for intel R core(TM) i7 processor and intel xeon(TM) 5500 processors,” Intel, Tech. Rep. Version 1.0, 2009. [21] M. E. Thomadakis, “A High-Performance Nehalem iDataPlex Cluster and DDN S2A9900 Storage for Texas AM University,” Texas AM University, College Station, TX, Technical Report, 2010. [Online]. Available: http://sc.tamu.edu/ systems/eos/ [22] G. Juckeland, S.Börner, M. Kluge, S. Kölling, W. Nagel, S. Pflüger, H. Röding, S. Seidl, T. William, and R. Wloch, “BenchIT – performance measurement and comparison for scientific applications,” in Parallel Computing - Software Technology, Algorithms, Architectures and Applications, ser. Advances in Parallel Computing, F. P. G.R. Joubert, W.E. Nagel and W. Walter, Eds. North-Holland, 2004, vol. 13, pp. 501–508. [On- line]. Available: http://www.sciencedirect.com/science/article/ B8G4S-4PGPT1D-28/2/ceadfb7c3312956a6b713e0536929408 [23] G. Juckeland, M. Kluge, W. E. Nagel, and S. Pflüger, “Perfor- mance analysis with BenchIT: Portable, flexible, easy to use,” in Proc. of the Inte’l Conf. on Quantitative Evaluation of Systems, QEST2004. Los Alamitos, CA, USA: IEEE Computer Society, 2004, pp. 320–321. [24] P. P. Gelsinger, “Intel architecture press briefing,” in Intel Developer Forum. San Francisco: Intel, Mar. 2008.

IEEExeonmem

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (17)

Similaire à IEEExeonmem

Similaire à IEEExeonmem (20)

IEEExeonmem