Accueil
Explorer
Soumettre la recherche
Mettre en ligne
S’identifier
S’inscrire
Publicité
WN Memory Tiering WP Mar2023.pdf
Signaler
RochanSankar1
Suivre
28 Mar 2023
•
0 j'aime
0 j'aime
×
Soyez le premier à aimer ceci
afficher plus
•
730 vues
vues
×
Nombre de vues
0
Sur Slideshare
0
À partir des intégrations
0
Nombre d'intégrations
0
Publicité
Prochain SlideShare
Structure of p53 protein
Chargement dans ... 3
1
sur
7
Top clipped slide
WN Memory Tiering WP Mar2023.pdf
28 Mar 2023
•
0 j'aime
0 j'aime
×
Soyez le premier à aimer ceci
afficher plus
•
730 vues
vues
×
Nombre de vues
0
Sur Slideshare
0
À partir des intégrations
0
Nombre d'intégrations
0
Signaler
Technologie
"Memory Tiering at Scale" white paper by Bob Wheeler, Principal Analyst at Wheeler's Network.
RochanSankar1
Suivre
Publicité
Publicité
Publicité
Recommandé
Structure of p53 protein
Shruthi Krishnaswamy
1.9K vues
•
15 diapositives
Cancer Pathways
Eman El-Attar
15.7K vues
•
32 diapositives
Oncogenes
VISHAKHA UPADHYAY
29.6K vues
•
22 diapositives
Basic photochemistry
Harish Chopra
394 vues
•
32 diapositives
Post translational modification of protien
kamilKhan63
387 vues
•
31 diapositives
Ruthenium based anti-cancer drugs
hope4revolution
7.4K vues
•
31 diapositives
Contenu connexe
Présentations pour vous
(17)
Raman spectroscopy
BASANTKUMAR123
•
13K vues
Telomerase Inhibition as Novel Cancer Therapeutic Method
Vincensanicko
•
2K vues
oncogenes and tumour supressor genes
Nivedha Vedhina
•
606 vues
DNA methylation_ understanding the language of DNA 20130806
abizarl
•
4.8K vues
Flow cytometry
bilal musharaf
•
60.4K vues
X-ray Crystallography & Its Applications in Proteomics
Akash Arora
•
33.2K vues
Late-Stage Fluorination
BASF
•
3.4K vues
Focus on aggregation: types, causes, characterization, and impact
KBI Biopharma
•
968 vues
Fundamentals of cancer
ramarao malla
•
925 vues
Oncogenes
HayaSahi
•
863 vues
The Cre-LoxP System
Kalyani Rajalingham
•
8.4K vues
C13 nmr
Prashant Patel
•
32.9K vues
Constituent of animal tissue culture media and their specific application
KAUSHAL SAHU
•
635 vues
CAR- T Cell
Achyut Bora
•
7.1K vues
Ppt On Cancer P53 By Swati Seervi
swati seervi
•
6.7K vues
P53
Charles University in Prague
•
4.2K vues
Circular dichroism spectroscopy seminar ppt
Shrutika Hodawdekar
•
17.5K vues
Similaire à WN Memory Tiering WP Mar2023.pdf
(20)
Cache performance-x86-2009
Léia de Sousa
•
567 vues
IEEExeonmem
Michael Thomadakis
•
258 vues
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
ijcsit
•
21 vues
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
AIRCC Publishing Corporation
•
50 vues
Multi-Core on Chip Architecture *doc - IK
Ilgın Kavaklıoğulları
•
995 vues
From Rack scale computers to Warehouse scale computers
Ryousei Takano
•
5.2K vues
Cluster Computers
shopnil786
•
19.7K vues
Nehalem
Ajmal Ak
•
1.2K vues
Increasing Throughput per Node for Content Delivery Networks
DESMOND YUEN
•
35 vues
Analysis of Multicore Performance Degradation of Scientific Applications
James McGalliard
•
336 vues
Conference Paper: Universal Node: Towards a high-performance NFV environment
Ericsson
•
790 vues
A Comparison of IBM’s New Flex System Environment to Traditional Blade Archit...
IBM India Smarter Computing
•
1.1K vues
Memory consistency models
palani kumar
•
141 vues
PowerAlluxio
Chi-fan Chu
•
166 vues
How to choose a server for your data center's needs
IT Tech
•
51 vues
Cluster Computing
NIKHIL NAIR
•
5.1K vues
International Journal of Engineering Research and Development
IJERD Editor
•
500 vues
directCell - Cell/B.E. tightly coupled via PCI Express
Heiko Joerg Schick
•
784 vues
1.multicore processors
Hebeon1
•
200 vues
Multicore Computers
A B Shinde
•
9.2K vues
Publicité
Dernier
(20)
Unveiling the Versatility of NEMA 1-15 Power Cords
Sf Cable, Inc
•
0 vue
National Pharmaceutical Pricing Authority-WPS Office.pptx
Sudipta Roy
•
0 vue
拉筹伯大学毕业证办理|La Trobe文凭购买
eunha17
•
0 vue
different-os.pptx
leilibarekatin
•
0 vue
TRANSACTION CONCEPTppt.pptx
DummyTest9
•
0 vue
AI PROJECT .... (1).pptx
VishnuDubey14
•
0 vue
Iguana - openSUSE Conf 2023
Ondrej Holecek
•
0 vue
NahamConEU2022.pdf
seed4mexyz
•
0 vue
Antenna_Design__Measurements_Laboratory_Lectures.pdf
Fredrick Isingo
•
0 vue
Intelligent Document Processing IDP.pdf
JamieDornan2
•
0 vue
MRM Presentation-Mar 21 (002).pptx
AkberLakhani3
•
0 vue
Role of MV Cable Accessories.pdf
Compaq International (P) Limited
•
0 vue
Innovation to startup.pptx
ravikumark42
•
0 vue
Intro_Net_91407.ppt
SuhailParakkal5
•
0 vue
How Your Business Can Leverage AI/ML in the Cloud for Competitive Advantage?
Kaspar Lavik
•
0 vue
Photoshop Tools.pptx
yelnatz
•
0 vue
Medical Termination of Pregnancy Act.pptx
Sudipta Roy
•
0 vue
DIMT '23 Session_Demo_ Latest Innovations Breakout.pdf
confluent
•
0 vue
SpinoS Corporate Presentation.pptx
Vaishak26
•
0 vue
AusCERT 2023 - Non-linear, decentralised and multi-stakeholder incident respo...
Pukhraj Singh
•
0 vue
WN Memory Tiering WP Mar2023.pdf
©2023 Wheeler’s Network Page
1 The Evolution of Memory Tiering at Scale By Bob Wheeler, Principal Analyst March 2023 www.wheelersnetwork.com
1 The Evolution of
Memory Tiering at Scale ©2023 Wheeler’s Network With first-generation chips now available, the early hype around CXL is giving way to realistic performance expectations. At the same time, software support for memory tiering is advancing, building on prior work around NUMA and persistent memory. Finally, operators have deployed RDMA to enable storage disaggregation and high-performance workloads. Thanks to these advancements, main-memory disaggregation is now within reach. Enfabrica sponsored the creation of this white paper, but the opinions and analysis are those of the author. Tiering Addresses the Memory Crunch Memory tiering is undergoing major advancements with the recent AMD and Intel server-processor introductions. Both AMD’s new Epyc (codenamed Genoa) and Intel’s new Xeon Scalable (codenamed Sapphire Rapids) introduce Compute Express Link (CXL), marking the beginning of new memory- interconnect architectures. The first generation of CXL-enabled processors handle Revision 1.1 of the specification, however, whereas the CXL Consortium released Revision 3.0 in August 2022. When CXL launched, hyperbolic statements about main-memory disaggregation appeared, ignoring the realities of access and time-of-flight latencies. With first-generation CXL chips now shipping, customers are left to address requirements for software to become tier-aware. Operators or vendors must also develop orchestration software to manage pooled and shared memory. In parallel with software, the CXL-hardware ecosystem will take years to fully develop, particularly CXL 3.x com- ponents including CPUs, GPUs, switches, and memory expanders. Eventually, CXL promises to mature into a true fabric that can connect CPUs and GPUs to shared memories, but network-attached memory still has a role. As Figure 1 shows, the memory hierarchy is becoming more granular, trading access latency against capacity and flexibility. The top of the pyramid serves the performance tier, where hot pages must be stored for maximum performance. Cold pages may be demoted to the capacity tier, which storage devices traditionally served. In recent years, however, developers have optimized software to improve performance when pages reside in different NUMA domains in multi-socket servers as well as in persistent (non-volatile) memories such as Intel’s Optane. Although Intel discontinued Optane development, its large software investment still applies to CXL-attached memories. FIGURE 1. MEMORY HIERARCHY (Data source: University of Michigan and Meta Inc.)
2 The Evolution of
Memory Tiering at Scale ©2023 Wheeler’s Network Swapping memory pages to SSD introduces a massive performance penalty, creating an opportunity for new DRAM-based capacity tiers. Sometimes referred to as “far memory,” this DRAM may reside in another server or in a memory appliance. Over the last two decades, software developers advanced the concept of network-based swap, which enables a server to access remote memory located in another server on the network. By using network interface cards that support remote DMA (RDMA), system architects can reduce the access latency to network-attached memory to less than four microseconds, as Figure 1 shows. As a result, network swap can greatly improve the performance of some workloads compared with traditional swap to storage. Memory Expansion Drives Initial CXL Adoption Although it’s little more than three years old, CXL has already achieved industry support exceeding that of previous coherent-interconnect standards such as CCIX, OpenCAPI, and HyperTransport. Crucially, AMD supported and implemented CXL despite Intel developing the original specification. The growing CXL ecosystem includes memory controllers (or expanders) that connect DDR4 or DDR5 DRAM to a CXL-enabled server (or host). An important factor in CXL’s early adoption is its reuse of the PCI Express physical layer, enabling I/O flexibility without adding to processor pin counts. This flexibility extends to add-in cards and modules, which use the same slots as PCIe devices. For the server designer, adding CXL support requires only the latest Epyc or Xeon processor and some attention to PCIe-lane assignments. The CXL specification defines three device types and three protocols required for different use cases. Here, we focus on the Type 3 device used for memory expansion, and the CXL.mem protocol for cache- coherent memory access. All three device types require the CXL.io protocol, but Type 3 devices use this only for configuration and control. Compared with CXL.io as well as PCIe, the CXL.mem protocol stack uses different link and transaction layers. The crucial difference is that CXL.mem (and CXL.cache) adopt fixed-length messages, whereas CXL.io uses variable-length packets like PCIe. In Revisions 1.1 and 2.0, CXL.mem uses a 68-byte flow-control unit (or flit), which handles a 64-byte cache line. CXL 3.0 adopts the 256-byte flit introduced in PCIe 6.0 to accommodate forward-error correction (FEC), but it adds a latency-optimized flit that splits error checking (CRC) into two 128-byte blocks. Fundamentally, CXL.mem brings load/store semantics to the PCIe interface, enabling expansion of both memory bandwidth and capacity. As Figure 2 shows at left, the first CXL use cases revolve around memory expansion, starting with single-host configurations. The simplest example is a CXL memory module, such as Samsung's 512GB DDR5 memory expander with a PCIe Gen5 x8 interface in an EDSFF form factor. This module uses a CXL memory controller from Montage Technology, and the vendors claim support for CXL 2.0. Similarly, Astera Labs offers a DDR5 controller chip with a CXL 2.0 x16 interface. The company developed a PCIe add-in card combining its Leo controller chip with four RDIMM slots that handle up to a combined 2TB of DDR5 DRAM. Unloaded access latency to CXL-attached DRAM should be around 100ns greater than that of DRAM attached to a processor’s integrated memory controllers. The memory channel appears as a single logical device (SLD), which can be allocated to only a single host. Memory expansion using a single processor and SLD represents the best case for CXL-memory performance, assuming a direct connection without intermediate devices or layers such as retimers and switches.
3 The Evolution of
Memory Tiering at Scale ©2023 Wheeler’s Network FIGURE 2. CXL 1.1/2.0 USE CASES The next use case is pooled memory, which enables flexible allocation of memory regions to specific hosts. In pooling, memory is assigned and accessible to only a single host—that is, a memory region is not shared by multiple hosts simultaneously. When connecting multiple processors or servers to a memory pool, CXL enables two approaches. The original approach added a CXL switch component between the hosts and one or more expanders (Type 3 devices). The downside of this method is that the switch adds latency, which we estimate at around 80ns. Although customers can design such a system, we do not expect this use case will achieve high-volume adoption, as the added latency decreases system performance. An alternative approach instead uses a multi-headed (MH) expander to directly connect a small number of hosts to a memory pool, as shown in the center of Figure 2. For example, startup Tanzanite Silicon Solutions demonstrated an FPGA-based prototype with four heads prior to its acquisition by Marvell, which later disclosed a forthcoming chip with eight x8 hosts ports. These multi-headed controllers can form the heart of a memory appliance offering a pool of DRAM to a small number of servers. The command interface for managing an MH expander wasn’t standardized until CXL 3.0, however, meaning early demonstrations used proprietary fabric management. CXL 3.x Enables Shared-Memory Fabrics Although it enables small-scale memory pooling, CXL 2.0 has numerous limitations. In terms of topology, it’s limited to 16 hosts and a single-level switch hierarchy. More important for connecting GPUs and other accelerators, each host supports only a single Type 2 device, which means CXL 2.0 can’t be used to build a coherent GPU server. CXL 3.0 enables up to 16 accelerators per host, allowing it to serve as a standardized coherent interconnect for GPUs. It also adds peer-to-peer (P2P) communications, multi-level switching, and fabrics with up to 4,096 nodes. Whereas memory pooling enables flexible allocation of DRAM to servers, CXL 3.0 enables true shared memory. The shared-memory expander is called a global fabric-attached memory (G-FAM) device, and it allows multiple hosts or accelerators to coherently share memory regions. The 3.0 specification also adds up to eight dynamic capacity (DC) regions for more granular memory allocation. Figure 3 shows a simple example using a single switch to connect an arbitrary number of hosts to shared memory. In this case, either the hosts or the devices may manage cache coherence.
4 The Evolution of
Memory Tiering at Scale ©2023 Wheeler’s Network FIGURE 3. CXL 3.X SHARED MEMORY For an accelerator to directly access shared memory, however, the expander must implement coher- ence with back invalidation (HDM-DB), which is new to the 3.0 specification. In other words, for CXL- connected GPUs to share memory, the expander must implement an inclusive snoop filter. This approach introduces potential blocking, as the specification enforces strict ordering for certain CXL.mem trans- actions. The shared-memory fabric will experience congestion, leading to less-predictable latency and the potential for much greater tail latency. Although the specification includes QoS Telemetry features, host-based rate throttling is optional, and these capabilities are unproven in practice. RDMA Enables Far Memory As CXL fabrics grow in size and heterogeneity, the performance concerns expand as well. For example, putting a switch in each shelf of a disaggregated rack is elegant, but it adds a switch hop to every transaction between different resources (compute, memory, storage, and network). Scaling to pods and beyond adds link-reach challenges, and even time-of-flight latency becomes meaningful. When multiple factors cause latency to exceed 600ns, system errors may occur. Finally, although load/store semantics are attractive for small transactions, DMA is generally more efficient for bulk-data transfers such as page swapping or VM migration. Ultimately, the coherency domain need be extended only so far. Beyond the practical limits of CXL, Ethernet can serve the need for high-capacity disaggregated memory. From a data-center perspective, Ethernet’s reach is unlimited, and hyperscalers have scaled RDMA-over-Ethernet (RoCE) networks to thousands of server nodes. Operators have deployed these large RoCE networks for storage disaggregation using SSDs, however, not DRAM. Figure 3 shows an example implementation of memory swap over RDMA, in this case, the Infiniswap design from the University of Michigan. The researchers’ goal was to disaggregate free memory across servers, addressing memory underutilization, also known as stranding. Their approach used off-the- shelf RDMA hardware (RNICs) and avoided application modification. The system software uses an Infiniswap block device, which appears to the virtual memory manager (VMM) as conventional storage. The VMM handles the Infiniswap device as a swap partition, just as it would use a local SSD partition for page swapping.
5 The Evolution of
Memory Tiering at Scale ©2023 Wheeler’s Network FIGURE 4. MEMORY SWAP OVER ETHERNET The target server runs an Infiniswap daemon in user space, handling only the mapping of local memory to remote block devices. Once memory is mapped, read and write requests bypass the target server’s CPU using RDMA, resulting in a zero-overhead data plane. In the researchers’ system, every server loaded both software components so they could serve as both requestors and targets, but the concept extends to a memory appliance that serves only the target side. The University of Michigan team built a 32-node cluster using 56Gbps InfiniBand RNICs, although Ethernet RNICs should operate identically. They tested several memory-intensive applications, including VoltDB running the TPC-C benchmark and Memcached running Facebook workloads. With only 50% of the working set stored in local DRAM and the remainder served by network swap, VoltDB and Memcached delivered 66% and 77%, respectively, the performance of the same workloads with the complete working set in local DRAM. By comparison, disk-based swap with the 50% working set delivered only 4% and 6%, respectively, of baseline performance. Thus, network swap provided an order of magnitude speedup compared with swap to disk. Other researchers, including teams at Alibaba and Google, advocate for modifying the application to directly access a remote memory pool, leaving the operating system unmodified. This approach can deliver greater performance than the more generalized design presented by the University of Michigan. Hyperscalers have the resources to develop custom applications, whereas the broader market requires support for unmodified applications. Given the implementation complexities of network swap at scale, the application-centric approach will likely be deployed first. Either way, Ethernet provides low latency and overhead using RDMA, and its reach easily handles row- or pod-scale fabrics. The fastest available Ethernet-NIC ports can also deliver enough bandwidth to handle one DDR5 DRAM channel. When using a jumbo frame to transfer a 4KB memory page, 400G Ethernet has only 1% overhead, yielding 49GB/s of effective bandwidth. That figure well exceeds the 31GB/s of effective bandwidth delivered by one 64-bit DDR5-4800 channel. Although 400G RNICs represent the leading edge, Nvidia shipped its ConnectX-7 adapter in volume during 2022. The Long Road to Memory Fabrics Cloud data centers succeeded in disaggregating storage and network functions from CPUs, but main- memory disaggregation remained elusive. Pooled memory was on the roadmap for Intel’s Rack-Scale Architecture a decade ago but never came to fruition. The Gen-Z Consortium formed in 2016 to pursue
6 The Evolution of
Memory Tiering at Scale ©2023 Wheeler’s Network a memory-centric fabric architecture, but system designs reached only the prototype stage. History tells us that as industry standards add complexity and optional features, their likelihood of volume adoption drops. CXL offers incremental steps along the architectural-evolution path, allowing the technology to ramp quickly while offering future iterations that promise truly composable systems. Workloads that benefit from memory expansion include in-memory databases such as SAP HANA and Redis, in-memory caches such as Memcached, and large virtual machines, as well as AI training and inference, which must handle ever-growing large-language models. These workloads fall off a performance cliff when their working sets don’t fully fit in local DRAM. Memory pooling can alleviate the problem of stranded memory, which impacts the capital expenditures of hyperscale data-centers operators. A Microsoft study, detailed in a March 2022 paper, found that up to 25% of server DRAM was stranded in highly utilized Azure clusters. The company modeled memory pooling across different numbers of CPU sockets and estimated it could reduce overall DRAM requirements by about 10%. The case for pure-play CXL 3.x fabric adoption is less compelling, in part because of GPU-market dynamics. Current data-center GPUs from Nvidia, AMD, and Intel implement proprietary coherent interconnects for GPU-to-GPU communications, alongside PCIe for host connectivity. Nvidia’s top- end Tesla GPUs already support memory pooling over the proprietary NVLink interface, solving the stranded-memory problem for high-bandwidth memory (HBM). The market leader is likely to favor NVLink, but it may also support CXL by sharing lanes (serdes) between the two protocols. Similarly, AMD and Intel could adopt CXL in addition to Infinity and Xe-Link, respectively, in future GPUs. The absence of disclosed GPU support, however, creates uncertainty around adoption of advanced CXL 3.0 features, whereas the move to PCIe Gen6 lane rates for existing use cases is undisputed. In any case, we expect it will be 2027 before CXL 3.x shared-memory expanders achieve high-volume shipments. In the meantime, multiple hyperscalers adopted RDMA to handle storage disaggregation as well as high-performance computing. Although the challenges of deploying RoCE at scale are widely recognized, these large customers are capable of solving the performance and reliability concerns. They can extend this deployed and understood technology into new use cases, such as network-based memory disaggregation. Research has demonstrated that a network-attached capacity tier can deliver strong performance when system architects apply it to appropriate workloads. We view CXL and RDMA as complementary technologies, with the former delivering the greatest bandwidth and lowest latency, whereas the latter offering greater scale. Enfabrica developed an architecture it calls an Accelerated Compute Fabric (ACF), which collapses CXL/PCIe–switch and RNIC functions into a single device. When instantiated in a multiterabit chip, the ACF can connect coherent local memory while scaling across chassis and racks using up to 800G Ethernet ports. Crucially, this approach removes dependencies on advanced CXL features that will take years to reach the market. Data-center operators will take multiple paths to memory disaggregation, as each has different priorities and unique workloads. Those with well-defined internal workloads will likely lead, whereas others that prioritize public-cloud instances are apt to be more conservative. Early adopters create opportunities for vendors that can solve a particular customer’s most pressing need. Bob Wheeler is an independent industry analyst covering semiconductors and networking for more than two decades. He is currently principal analyst at Wheeler’s Network, established in 2022. Previously, Wheeler was a principal analyst at The Linley Group and a senior editor for Microprocessor Report. Joining the company in 2001, he authored articles, reports, and white papers covering a range of chips including Ethernet switches, DPUs, server processors, and embedded processors, as well as emerging technologies. Wheeler’s Network offers white papers, strategic consulting, roadmap reviews, and custom reports. Our free blog is available at www.wheelersnetwork.com.
Publicité