Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCEBY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper.

•Télécharger en tant que PPTX, PDF•

0 j'aime•2 vues

Technologie

MICROREC: EFFICIENT RECOMMENDATION INFERENCE
BY HARDWARE AND DATA STRUCTURE SOLUTIONS
4th MLSys
2024/2/21 1

Recommendation model for Click-Through Rate Prediction
• Input Feature
◦ Dense features
• E.g., Age and gender
◦ Sparse features
• Location and advertisement category
• Dense feature
◦ May apply a neural feature extractor
• Sparse Feature
◦ Translated into a dense feature embedding
By Looking up it in an embedding table
2024/2/21 3
Introduction

Challenges in a CPU-based System
• Embedding table lookups are costly
◦ Induce massive random DRAM accesses on
CPU servers.
• Embedding lookups & computation can be
expensive if one resorts to ML frameworks such
as TensorFlow and PyTorch.
◦ The embedding layer involves 37 types of
operators, resulting in significant time
consumption especially in small batches.
2024/2/21 4
Introduction

Approach in this paper
• Employ more suitable hardware architecture
◦ Hybrid memory system containing High
Bandwidth Memory (HBM)
◦ deeply pipelined dataflow
• Revisit the data structures used for embedding
tables to reduce the number of memory accesses
◦ Apply Cartesian products to combine some of
the tables
2024/2/21 5
Introduction

System Overview
① Host server streams dense and sparse features to the FPGA.
② Embedding lookup unit translates the sparse features to dense vectors.
③ Computation unit takes the concatenated vector as input and finishes inference.
2024/2/21 7
Method

Boost Emebdding Lookup Concurrency by Increased Memory Channels
2024/2/21 8
Method
Embedding table lookup operations can be
parallelized when multiple memory channels are
available.
High-Bandwidth Memory(HBM)
Offers improved concurrency and bandwidth
compared to conventional DRAMs.
Utilize HBM with 8GB and 32 memory banks
Each bank only contains one or a few tables, and
up to 32 tables can be looked up concurrently.

Reduce Memory Accesses by Cartesian Products
2024/2/21 9
Method
 Reduce the number of memory accesses by combining
tables
 Each memory access can retrieve multiple
embedding vectors.
 Two embedding tables are joined into a single larger one.
 Instead of two separate lookup operations now only
one access is needed to retrieve the two vectors.
 Reduce the memory access by half can lead to a
speedup of almost 2 times.
 Help balance the workload on each memory channel.

A Rule-based Algorithm for Table Combination and Allocation
2024/2/21 10
Method
Heuristic rule-based Search
Rule1: large tables are not considered.
Rule 2: Cartesian products for table pairs of two
Rule 3: Within the product candidates, the
smallest tables are paired with the largest.
Rule 4: cache smallest tables on chip.

Reduce Latency by Deeply Pipelined Dataflow
2024/2/21 12
Method
 The embedding lookup stage and three computation stages are pipelined. Each DNN computation
module is further divided into three pipeline stages.
 Latency concerns are eliminated.

DNN Computation Module
2024/2/21 13
Method
 The computation flow for a single FC layer: : input feature broadcasting, general matrix-matrix
multiplication (GEMM) computation, and result gathering.
 Partial GEMM is allocated to each processing unit (PE) for better routing design and potentially
higher performance

Contenu connexe

Similaire à Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCEBY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper.

2014 IEEE JAVA NETWORKING COMPUTING PROJECT Boundary cutting for packet class...IEEEFINALYEARSTUDENTPROJECT

Chap2 slidesashishmulchandani

week_2Lec02_CS422.pptxmivomi1

Relational cloud, A Database-as-a-Service for the CloudHossein Riasati

NoSQL ConseptsMaynooth University

boundary cutting for packet classificationswathi78

Implementation Issue with ORDBMSSandeep Poudel

Storage Area Networks Unit 1 NotesSudarshan Dhondaley

Computer architecture abhmailOm Prakash

Computer architecture Ashish Kumar

Grid computingshweta-sharma99

Percona FT / TokuDBVadim Tkachenko

JPJ1434 Boundary Cutting for Packet Classificationchennaijp

Different Approaches in Energy Efficient Cache MemoryDhritiman Halder

Emerging Cloud Storage Trends for EnterprisesRebekah Rodriguez

2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...Nexgen Technology

August 2013 HUG: Removing the NameNode's memory limitation Yahoo Developer Network

Lecture 3.31 3.32.pptxRATISHKUMAR32

Limitations of memory system performanceSyed Zaid Irshad

Cassandra CLuster Management by Japan Cassandra CommunityHiromitsu Komatsu

Similaire à Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCEBY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper. (20)

2014 IEEE JAVA NETWORKING COMPUTING PROJECT Boundary cutting for packet class...

Chap2 slides

week_2Lec02_CS422.pptx

Relational cloud, A Database-as-a-Service for the Cloud

NoSQL Consepts

boundary cutting for packet classification

Implementation Issue with ORDBMS

Storage Area Networks Unit 1 Notes

Computer architecture abhmail

Computer architecture

Grid computing

Percona FT / TokuDB

JPJ1434 Boundary Cutting for Packet Classification

Different Approaches in Energy Efficient Cache Memory

Emerging Cloud Storage Trends for Enterprises

2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...

August 2013 HUG: Removing the NameNode's memory limitation

Lecture 3.31 3.32.pptx

Limitations of memory system performance

Cassandra CLuster Management by Japan Cassandra Community

Dernier

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Slack Application Development 101 Slidespraypatel2

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Key Features Of Token Development (1).pptxLBM Solutions

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Install Stable Diffusion in windows machinePadma Pradeep

GenCyber Cyber Security Day PresentationMichael W. Hawkins

AI as an Interface for Commercial BuildingsMemoori

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Dernier (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Human Factors of XR: Using Human Factors to Design XR Systems

How to Remove Document Management Hurdles with X-Docs?

Slack Application Development 101 Slides

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

[2024]Digital Global Overview Report 2024 Meltwater.pdf

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

The Codex of Business Writing Software for Real-World Solutions 2.pptx

08448380779 Call Girls In Friends Colony Women Seeking Men

Key Features Of Token Development (1).pptx

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Install Stable Diffusion in windows machine

GenCyber Cyber Security Day Presentation

AI as an Interface for Commercial Buildings

A Domino Admins Adventures (Engage 2024)

Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCEBY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper.

1. MICROREC: EFFICIENT RECOMMENDATION INFERENCE BY HARDWARE AND DATA STRUCTURE SOLUTIONS 4th MLSys 2024/2/21 1

2. Introduction 2024/2/21 2

3. Recommendation model for Click-Through Rate Prediction • Input Feature ◦ Dense features • E.g., Age and gender ◦ Sparse features • Location and advertisement category • Dense feature ◦ May apply a neural feature extractor • Sparse Feature ◦ Translated into a dense feature embedding By Looking up it in an embedding table 2024/2/21 3 Introduction

4. Challenges in a CPU-based System • Embedding table lookups are costly ◦ Induce massive random DRAM accesses on CPU servers. • Embedding lookups & computation can be expensive if one resorts to ML frameworks such as TensorFlow and PyTorch. ◦ The embedding layer involves 37 types of operators, resulting in significant time consumption especially in small batches. 2024/2/21 4 Introduction

5. Approach in this paper • Employ more suitable hardware architecture ◦ Hybrid memory system containing High Bandwidth Memory (HBM) ◦ deeply pipelined dataflow • Revisit the data structures used for embedding tables to reduce the number of memory accesses ◦ Apply Cartesian products to combine some of the tables 2024/2/21 5 Introduction

6. Method: MICROREC 2024/2/21 6

7. System Overview ① Host server streams dense and sparse features to the FPGA. ② Embedding lookup unit translates the sparse features to dense vectors. ③ Computation unit takes the concatenated vector as input and finishes inference. 2024/2/21 7 Method

8. Boost Emebdding Lookup Concurrency by Increased Memory Channels 2024/2/21 8 Method Embedding table lookup operations can be parallelized when multiple memory channels are available. High-Bandwidth Memory(HBM) Offers improved concurrency and bandwidth compared to conventional DRAMs. Utilize HBM with 8GB and 32 memory banks Each bank only contains one or a few tables, and up to 32 tables can be looked up concurrently.

9. Reduce Memory Accesses by Cartesian Products 2024/2/21 9 Method  Reduce the number of memory accesses by combining tables  Each memory access can retrieve multiple embedding vectors.  Two embedding tables are joined into a single larger one.  Instead of two separate lookup operations now only one access is needed to retrieve the two vectors.  Reduce the memory access by half can lead to a speedup of almost 2 times.  Help balance the workload on each memory channel.

10. A Rule-based Algorithm for Table Combination and Allocation 2024/2/21 10 Method Heuristic rule-based Search Rule1: large tables are not considered. Rule 2: Cartesian products for table pairs of two Rule 3: Within the product candidates, the smallest tables are paired with the largest. Rule 4: cache smallest tables on chip.

11. FPGA IMPLEMENTATION 2024/2/21 11

12. Reduce Latency by Deeply Pipelined Dataflow 2024/2/21 12 Method  The embedding lookup stage and three computation stages are pipelined. Each DNN computation module is further divided into three pipeline stages.  Latency concerns are eliminated.

13. DNN Computation Module 2024/2/21 13 Method  The computation flow for a single FC layer: : input feature broadcasting, general matrix-matrix multiplication (GEMM) computation, and result gathering.  Partial GEMM is allocated to each processing unit (PE) for better routing design and potentially higher performance

14. 2024/2/21 14 Thank you for listening

Notes de l'éditeur

Today I would like to introduce one paper MICROREC: EFFICIENT RECOMMENDATION INFERENCE BY HARDWARE AND DATA STRUCTURE SOLUTIONS
First is the introduction part.
This Figure illustrates a classical deep recommendation model for Click-Through Rate (CTR) prediction. An input feature vector consists of dense features (e.g., age and gender) and sparse features (e.g., location and advertisement category). Over the dense feature vector, some systems apply a neural feature extractor that consists of multiple fully connected (FC) layers, while some design does not contain the bottom FC layers. Over the sparse feature vector, the system translates each feature into a dense feature embedding by looking up it in an embedding table. These features are then combined (e.g., concatenated) and fed to a neural classification model consisting of multiple fully connected layers.
First, embedding table lookups are costly because they induce massive random DRAM accesses on CPU servers. Production recommendation models usually consist of at least tens of embedding tables, thus each inference requires the corresponding lookup operations. Due to the tiny size of each embedding vector, the resulting DRAM accesses are nearly random rather than sequential. Since CPU servers have only a few memory channels, these random DRAM accesses are expensive. Second, both embedding lookups and computation can be expensive if one resorts to ML frameworks such as TensorFlow and PyTorch. According to our observations, on TensorFlow Serving which is optimized for inference, the embedding layer involves 37 types of operators (e.g., concatenation and slice) and these operators are invoked multiple times during inference, resulting in significant time consumption especially in small batches.
So to address the challenges mentioned before, method proposed in this paper are rooted in two sources. First, we employ more suitable hardware architecture for recommendation with (a) hybrid memory system containing High Bandwidth Memory (HBM) for highly concurrent embedding lookups; and (b) deeply pipelined dataflow on FPGA for low-latency neural network inference. Second, we revisit the data structures used for embedding tables to reduce the number of memory accesses. By applying Cartesian products to combine some of the tables, the number of DRAM accesses required to finish the lookups are significantly reduced .
Then we come to some details of this method.
This Figure overviews the hardware design of MicroRec. Embedding tables are distributed over both on-chip memory (BRAM and URAM) and off-chip memory (HBM and DDR). Neural network inference is taken cared by the DNN computation units which contain both on-chip buffers storing weights of the model and computation resources for fast inference. To conduct inference, the host server first streams dense and sparse features to the FPGA. Then, the embedding lookup unit translates the sparse features to dense vectors by looking up embedding tables from both on-chip and off-chip memory. Finally, the computation unit takes the concatenated dense vector as input and finishes inference before returning the predicted CTR to the host
The tens of embedding table lookup operations during inference can be parallelized when multiple memory channelsare available. MicroRec resorts to high-bandwidth memory as the main force supporting highly concurrent embedding lookups. As an attractive solution for high-performance systems, HBM offers improved concurrency and bandwidth compared to conventional DRAMs. Specifically, the HBM here consists of 32 memory banks, which can be accessed concurrently by independent pseudochannels. Thus, embedding tables can be distributed to these banks so that each bank only contains one or a few tables, and up to 32 tables can be looked up concurrently.
We can reduce the number of memory accesses by combining tables so that each memory access can retrieve multiple embedding vectors. As shown in this Figure, two embedding tables can be joined into a single larger one through a relation Cartesian Product. Since tables A and B have two entries, the product table ends up with four entries: each of them is a longer vector obtained by concatenating an entry from table A and another from table B. Using such a representation, the number of memory accesses is reduced by half: instead of two separate lookup operations now only one access is needed to retrieve the two vectors. In this way , we can reduce the memory accesses by half can lead to a speedup of almost 2× and Help balance the workload on each memory channel.
Which two tables can be combined? Here the authors use a heuristic search, and design four rules. Algorithm 1 sketches the heuristic-rule-based search for table combination and allocation. It starts by iterating over the number of tables selected as Cartesian product candidates. Within each iteration, the candidates are quickly combined by applying the first three heuristic rules. All tables are then allocated to memory banks efficiently by rule 4. The algorithm ends by returning the searched solution that achieves the lowest embedding lookup latency. Considering the outer loop iterating over Cartesian candidate numbers, the total time complexity of the heuristic algorithm is as low as O(N2 ).
In this section, we describe the implementation of MicroRec on an FPGA with an emphasis on its low inference latency
As shown in this Figure, the authors apply a highly pipelined accelerator architecture where multiple items are processed by the accelerator concurrently in different stages. In this design, the embedding lookup stage and three computation stages are pipelined. Each DNN computation module is further divided into three pipeline stages: feature broadcasting, computation, and result gathering. Latency are eliminated by this highly pipelined design for two reasons. First, input items are processed item by item instead of batch by batch, thus the time to wait and aggregate a batch of recommendation queries is removed. Second, the end-to-end inference latency of a single item is much less than a large batch.
The computation flow for a single FC layer, which consists of three pipeline stages: input feature broadcasting, general matrix-matrix multiplication (GEMM) computation, and result gathering. Partial GEMM is allocated to each processing unit (PE) for better routing design and potentially higher performance (de Fine Licht et al., 2020). Each PE conducts partial GEMM through parallelized multiplications followed by an add tree

Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCEBY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper.

Recommandé

Recommandé

Contenu connexe

Similaire à Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCEBY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper.

Similaire à Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCEBY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper. (20)

Dernier

Dernier (20)

Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCEBY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper.

Notes de l'éditeur

Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCE BY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper.

Recommandé

Recommandé

Contenu connexe

Similaire à Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCE BY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper.

Similaire à Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCE BY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper. (20)

Dernier

Dernier (20)

Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCE BY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper.

Notes de l'éditeur

Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCEBY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper.

Similaire à Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCEBY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper.

Similaire à Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCEBY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper. (20)

Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCEBY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper.