SlideShare une entreprise Scribd logo
1  sur  14
MICROREC: EFFICIENT RECOMMENDATION INFERENCE
BY HARDWARE AND DATA STRUCTURE SOLUTIONS
4th MLSys
2024/2/21 1
Introduction
2024/2/21 2
Recommendation model for Click-Through Rate Prediction
• Input Feature
◦ Dense features
• E.g., Age and gender
◦ Sparse features
• Location and advertisement category
• Dense feature
◦ May apply a neural feature extractor
• Sparse Feature
◦ Translated into a dense feature embedding
By Looking up it in an embedding table
2024/2/21 3
Introduction
Challenges in a CPU-based System
• Embedding table lookups are costly
◦ Induce massive random DRAM accesses on
CPU servers.
• Embedding lookups & computation can be
expensive if one resorts to ML frameworks such
as TensorFlow and PyTorch.
◦ The embedding layer involves 37 types of
operators, resulting in significant time
consumption especially in small batches.
2024/2/21 4
Introduction
Approach in this paper
• Employ more suitable hardware architecture
◦ Hybrid memory system containing High
Bandwidth Memory (HBM)
◦ deeply pipelined dataflow
• Revisit the data structures used for embedding
tables to reduce the number of memory accesses
◦ Apply Cartesian products to combine some of
the tables
2024/2/21 5
Introduction
Method: MICROREC
2024/2/21 6
System Overview
① Host server streams dense and sparse features to the FPGA.
② Embedding lookup unit translates the sparse features to dense vectors.
③ Computation unit takes the concatenated vector as input and finishes inference.
2024/2/21 7
Method
Boost Emebdding Lookup Concurrency by Increased Memory Channels
2024/2/21 8
Method
Embedding table lookup operations can be
parallelized when multiple memory channels are
available.
High-Bandwidth Memory(HBM)
Offers improved concurrency and bandwidth
compared to conventional DRAMs.
Utilize HBM with 8GB and 32 memory banks
Each bank only contains one or a few tables, and
up to 32 tables can be looked up concurrently.
Reduce Memory Accesses by Cartesian Products
2024/2/21 9
Method
 Reduce the number of memory accesses by combining
tables
 Each memory access can retrieve multiple
embedding vectors.
 Two embedding tables are joined into a single larger one.
 Instead of two separate lookup operations now only
one access is needed to retrieve the two vectors.
 Reduce the memory access by half can lead to a
speedup of almost 2 times.
 Help balance the workload on each memory channel.
A Rule-based Algorithm for Table Combination and Allocation
2024/2/21 10
Method
Heuristic rule-based Search
Rule1: large tables are not considered.
Rule 2: Cartesian products for table pairs of two
Rule 3: Within the product candidates, the
smallest tables are paired with the largest.
Rule 4: cache smallest tables on chip.
FPGA IMPLEMENTATION
2024/2/21 11
Reduce Latency by Deeply Pipelined Dataflow
2024/2/21 12
Method
 The embedding lookup stage and three computation stages are pipelined. Each DNN computation
module is further divided into three pipeline stages.
 Latency concerns are eliminated.
DNN Computation Module
2024/2/21 13
Method
 The computation flow for a single FC layer: : input feature broadcasting, general matrix-matrix
multiplication (GEMM) computation, and result gathering.
 Partial GEMM is allocated to each processing unit (PE) for better routing design and potentially
higher performance
2024/2/21 14
Thank you for listening

Contenu connexe

Similaire à Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCE BY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper.

2014 IEEE JAVA NETWORKING COMPUTING PROJECT Boundary cutting for packet class...
2014 IEEE JAVA NETWORKING COMPUTING PROJECT Boundary cutting for packet class...2014 IEEE JAVA NETWORKING COMPUTING PROJECT Boundary cutting for packet class...
2014 IEEE JAVA NETWORKING COMPUTING PROJECT Boundary cutting for packet class...IEEEFINALYEARSTUDENTPROJECT
 
week_2Lec02_CS422.pptx
week_2Lec02_CS422.pptxweek_2Lec02_CS422.pptx
week_2Lec02_CS422.pptxmivomi1
 
Relational cloud, A Database-as-a-Service for the Cloud
Relational cloud, A Database-as-a-Service for the CloudRelational cloud, A Database-as-a-Service for the Cloud
Relational cloud, A Database-as-a-Service for the CloudHossein Riasati
 
boundary cutting for packet classification
boundary cutting for packet classificationboundary cutting for packet classification
boundary cutting for packet classificationswathi78
 
Implementation Issue with ORDBMS
Implementation Issue with ORDBMSImplementation Issue with ORDBMS
Implementation Issue with ORDBMSSandeep Poudel
 
Storage Area Networks Unit 1 Notes
Storage Area Networks Unit 1 NotesStorage Area Networks Unit 1 Notes
Storage Area Networks Unit 1 NotesSudarshan Dhondaley
 
Computer architecture abhmail
Computer architecture abhmailComputer architecture abhmail
Computer architecture abhmailOm Prakash
 
Computer architecture
Computer architecture Computer architecture
Computer architecture Ashish Kumar
 
JPJ1434 Boundary Cutting for Packet Classification
JPJ1434  Boundary Cutting for Packet ClassificationJPJ1434  Boundary Cutting for Packet Classification
JPJ1434 Boundary Cutting for Packet Classificationchennaijp
 
Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDhritiman Halder
 
Emerging Cloud Storage Trends for Enterprises
Emerging Cloud Storage Trends for EnterprisesEmerging Cloud Storage Trends for Enterprises
Emerging Cloud Storage Trends for EnterprisesRebekah Rodriguez
 
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...Nexgen Technology
 
August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation Yahoo Developer Network
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptxRATISHKUMAR32
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performanceSyed Zaid Irshad
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityHiromitsu Komatsu
 

Similaire à Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCE BY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper. (20)

2014 IEEE JAVA NETWORKING COMPUTING PROJECT Boundary cutting for packet class...
2014 IEEE JAVA NETWORKING COMPUTING PROJECT Boundary cutting for packet class...2014 IEEE JAVA NETWORKING COMPUTING PROJECT Boundary cutting for packet class...
2014 IEEE JAVA NETWORKING COMPUTING PROJECT Boundary cutting for packet class...
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
week_2Lec02_CS422.pptx
week_2Lec02_CS422.pptxweek_2Lec02_CS422.pptx
week_2Lec02_CS422.pptx
 
Relational cloud, A Database-as-a-Service for the Cloud
Relational cloud, A Database-as-a-Service for the CloudRelational cloud, A Database-as-a-Service for the Cloud
Relational cloud, A Database-as-a-Service for the Cloud
 
NoSQL Consepts
NoSQL ConseptsNoSQL Consepts
NoSQL Consepts
 
boundary cutting for packet classification
boundary cutting for packet classificationboundary cutting for packet classification
boundary cutting for packet classification
 
Implementation Issue with ORDBMS
Implementation Issue with ORDBMSImplementation Issue with ORDBMS
Implementation Issue with ORDBMS
 
Storage Area Networks Unit 1 Notes
Storage Area Networks Unit 1 NotesStorage Area Networks Unit 1 Notes
Storage Area Networks Unit 1 Notes
 
Computer architecture abhmail
Computer architecture abhmailComputer architecture abhmail
Computer architecture abhmail
 
Computer architecture
Computer architecture Computer architecture
Computer architecture
 
Grid computing
Grid computingGrid computing
Grid computing
 
Percona FT / TokuDB
Percona FT / TokuDBPercona FT / TokuDB
Percona FT / TokuDB
 
JPJ1434 Boundary Cutting for Packet Classification
JPJ1434  Boundary Cutting for Packet ClassificationJPJ1434  Boundary Cutting for Packet Classification
JPJ1434 Boundary Cutting for Packet Classification
 
Different Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache MemoryDifferent Approaches in Energy Efficient Cache Memory
Different Approaches in Energy Efficient Cache Memory
 
Emerging Cloud Storage Trends for Enterprises
Emerging Cloud Storage Trends for EnterprisesEmerging Cloud Storage Trends for Enterprises
Emerging Cloud Storage Trends for Enterprises
 
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...
 
August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptx
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 

Dernier

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Dernier (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Some notes of 'MICROREC: EFFICIENT RECOMMENDATION INFERENCE BY HARDWARE AND DATA STRUCTURE SOLUTIONS' when reading this paper.

  • 1. MICROREC: EFFICIENT RECOMMENDATION INFERENCE BY HARDWARE AND DATA STRUCTURE SOLUTIONS 4th MLSys 2024/2/21 1
  • 3. Recommendation model for Click-Through Rate Prediction • Input Feature ◦ Dense features • E.g., Age and gender ◦ Sparse features • Location and advertisement category • Dense feature ◦ May apply a neural feature extractor • Sparse Feature ◦ Translated into a dense feature embedding By Looking up it in an embedding table 2024/2/21 3 Introduction
  • 4. Challenges in a CPU-based System • Embedding table lookups are costly ◦ Induce massive random DRAM accesses on CPU servers. • Embedding lookups & computation can be expensive if one resorts to ML frameworks such as TensorFlow and PyTorch. ◦ The embedding layer involves 37 types of operators, resulting in significant time consumption especially in small batches. 2024/2/21 4 Introduction
  • 5. Approach in this paper • Employ more suitable hardware architecture ◦ Hybrid memory system containing High Bandwidth Memory (HBM) ◦ deeply pipelined dataflow • Revisit the data structures used for embedding tables to reduce the number of memory accesses ◦ Apply Cartesian products to combine some of the tables 2024/2/21 5 Introduction
  • 7. System Overview ① Host server streams dense and sparse features to the FPGA. ② Embedding lookup unit translates the sparse features to dense vectors. ③ Computation unit takes the concatenated vector as input and finishes inference. 2024/2/21 7 Method
  • 8. Boost Emebdding Lookup Concurrency by Increased Memory Channels 2024/2/21 8 Method Embedding table lookup operations can be parallelized when multiple memory channels are available. High-Bandwidth Memory(HBM) Offers improved concurrency and bandwidth compared to conventional DRAMs. Utilize HBM with 8GB and 32 memory banks Each bank only contains one or a few tables, and up to 32 tables can be looked up concurrently.
  • 9. Reduce Memory Accesses by Cartesian Products 2024/2/21 9 Method  Reduce the number of memory accesses by combining tables  Each memory access can retrieve multiple embedding vectors.  Two embedding tables are joined into a single larger one.  Instead of two separate lookup operations now only one access is needed to retrieve the two vectors.  Reduce the memory access by half can lead to a speedup of almost 2 times.  Help balance the workload on each memory channel.
  • 10. A Rule-based Algorithm for Table Combination and Allocation 2024/2/21 10 Method Heuristic rule-based Search Rule1: large tables are not considered. Rule 2: Cartesian products for table pairs of two Rule 3: Within the product candidates, the smallest tables are paired with the largest. Rule 4: cache smallest tables on chip.
  • 12. Reduce Latency by Deeply Pipelined Dataflow 2024/2/21 12 Method  The embedding lookup stage and three computation stages are pipelined. Each DNN computation module is further divided into three pipeline stages.  Latency concerns are eliminated.
  • 13. DNN Computation Module 2024/2/21 13 Method  The computation flow for a single FC layer: : input feature broadcasting, general matrix-matrix multiplication (GEMM) computation, and result gathering.  Partial GEMM is allocated to each processing unit (PE) for better routing design and potentially higher performance
  • 14. 2024/2/21 14 Thank you for listening

Notes de l'éditeur

  1. Today I would like to introduce one paper MICROREC: EFFICIENT RECOMMENDATION INFERENCE BY HARDWARE AND DATA STRUCTURE SOLUTIONS
  2. First is the introduction part.
  3. This Figure illustrates a classical deep recommendation model for Click-Through Rate (CTR) prediction. An input feature vector consists of dense features (e.g., age and gender) and sparse features (e.g., location and advertisement category). Over the dense feature vector, some systems apply a neural feature extractor that consists of multiple fully connected (FC) layers, while some design does not contain the bottom FC layers. Over the sparse feature vector, the system translates each feature into a dense feature embedding by looking up it in an embedding table. These features are then combined (e.g., concatenated) and fed to a neural classification model consisting of multiple fully connected layers.
  4. First, embedding table lookups are costly because they induce massive random DRAM accesses on CPU servers. Production recommendation models usually consist of at least tens of embedding tables, thus each inference requires the corresponding lookup operations. Due to the tiny size of each embedding vector, the resulting DRAM accesses are nearly random rather than sequential. Since CPU servers have only a few memory channels, these random DRAM accesses are expensive. Second, both embedding lookups and computation can be expensive if one resorts to ML frameworks such as TensorFlow and PyTorch. According to our observations, on TensorFlow Serving which is optimized for inference, the embedding layer involves 37 types of operators (e.g., concatenation and slice) and these operators are invoked multiple times during inference, resulting in significant time consumption especially in small batches.
  5. So to address the challenges mentioned before, method proposed in this paper are rooted in two sources. First, we employ more suitable hardware architecture for recommendation with (a) hybrid memory system containing High Bandwidth Memory (HBM) for highly concurrent embedding lookups; and (b) deeply pipelined dataflow on FPGA for low-latency neural network inference. Second, we revisit the data structures used for embedding tables to reduce the number of memory accesses. By applying Cartesian products to combine some of the tables, the number of DRAM accesses required to finish the lookups are significantly reduced .
  6. Then we come to some details of this method.
  7. This Figure overviews the hardware design of MicroRec. Embedding tables are distributed over both on-chip memory (BRAM and URAM) and off-chip memory (HBM and DDR). Neural network inference is taken cared by the DNN computation units which contain both on-chip buffers storing weights of the model and computation resources for fast inference. To conduct inference, the host server first streams dense and sparse features to the FPGA. Then, the embedding lookup unit translates the sparse features to dense vectors by looking up embedding tables from both on-chip and off-chip memory. Finally, the computation unit takes the concatenated dense vector as input and finishes inference before returning the predicted CTR to the host
  8. The tens of embedding table lookup operations during inference can be parallelized when multiple memory channelsare available. MicroRec resorts to high-bandwidth memory as the main force supporting highly concurrent embedding lookups. As an attractive solution for high-performance systems, HBM offers improved concurrency and bandwidth compared to conventional DRAMs. Specifically, the HBM here consists of 32 memory banks, which can be accessed concurrently by independent pseudochannels. Thus, embedding tables can be distributed to these banks so that each bank only contains one or a few tables, and up to 32 tables can be looked up concurrently.
  9. We can reduce the number of memory accesses by combining tables so that each memory access can retrieve multiple embedding vectors. As shown in this Figure, two embedding tables can be joined into a single larger one through a relation Cartesian Product. Since tables A and B have two entries, the product table ends up with four entries: each of them is a longer vector obtained by concatenating an entry from table A and another from table B. Using such a representation, the number of memory accesses is reduced by half: instead of two separate lookup operations now only one access is needed to retrieve the two vectors. In this way , we can reduce the memory accesses by half can lead to a speedup of almost 2× and Help balance the workload on each memory channel.
  10. Which two tables can be combined? Here the authors use a heuristic search, and design four rules. Algorithm 1 sketches the heuristic-rule-based search for table combination and allocation. It starts by iterating over the number of tables selected as Cartesian product candidates. Within each iteration, the candidates are quickly combined by applying the first three heuristic rules. All tables are then allocated to memory banks efficiently by rule 4. The algorithm ends by returning the searched solution that achieves the lowest embedding lookup latency. Considering the outer loop iterating over Cartesian candidate numbers, the total time complexity of the heuristic algorithm is as low as O(N2 ).
  11. In this section, we describe the implementation of MicroRec on an FPGA with an emphasis on its low inference latency
  12. As shown in this Figure, the authors apply a highly pipelined accelerator architecture where multiple items are processed by the accelerator concurrently in different stages. In this design, the embedding lookup stage and three computation stages are pipelined. Each DNN computation module is further divided into three pipeline stages: feature broadcasting, computation, and result gathering. Latency are eliminated by this highly pipelined design for two reasons. First, input items are processed item by item instead of batch by batch, thus the time to wait and aggregate a batch of recommendation queries is removed. Second, the end-to-end inference latency of a single item is much less than a large batch.
  13. The computation flow for a single FC layer, which consists of three pipeline stages: input feature broadcasting, general matrix-matrix multiplication (GEMM) computation, and result gathering. Partial GEMM is allocated to each processing unit (PE) for better routing design and potentially higher performance (de Fine Licht et al., 2020). Each PE conducts partial GEMM through parallelized multiplications followed by an add tree