Contenu connexe



Project Slides for Website 2020-22.pptx

  1. Micro-architecture of 64-bit 4-wide Out-of-Order Processor core • RV64IMAFDCSU Instruction set support • 9 Stage pipeline (Fetch, Predecode, Checking, Decode, Rename, Dispatch, Issue, Execute and Retire) • Decode and Issue up to 4 instructions per cycle • Out-of-Order execution and In-Order commit using ROB • Six execution units with split functionality units in each of them • Branch prediction using BTB and 2-bit direction predictor using G-Share algorithm; RAS for return address fetching • Register renaming to avoid dependency (64 Integer and 48 floating-point Registers), mapped via Register Alias Tables (RATs), branch misprediction recovery using restoring store RAT states. • 16 KiB 4-Way set-associative VIPT I-Cache & PIPT D-Cache • Fully Associative I-TLB & D-TLB with 3-level page table walk • Supervisor and user mode implementation
  2. Linux-Build Process SoC architecture designed around the modified processor core
  3. SoC Design around Dual-Core Processor • 64 KiB internal SRAM • 16 KiB BootROM hardcoded with Zeroth Stage BootLoader (ZSBL) supporting loading binary image via Xmodem • Memory mapped peripherals like GPIO, UART and I2C using standard Xilinx IPs • Implemented in HTG-K800 board using Kintex UltraScale FPGA
  4. • A slight reduction in performance was observed compared to the previous design. • This performance reduction is expected due to the arbitration between the read and write requests from the processor Results: Throughput and Benchmarking of Multi-Core Compatible Single core
  5. • Dual-Core processor is 1.66 and 1.54 times faster than single-core processor for matrix multiplication and quicksort application. Results: Performance Improvement in Dual-Core Processor
  6. RISC-V for AI Applications Challenges: • Support to RVV ISA. • Hardware support for vector instructions execution. • Interface Unit for data flow between Scalar and Vector. • Stall Generation Unit to maintain in-order issue. Approach: • Vector decoder unit is designed to support ISA, which works parallelly with scalar decoder. • Vector Unit is designed to execute the vector operations. • Native-Bus Interface unit is designed to reduce latency. • For In-Order issue, only vector unit or scalar core will work at any given point of time. Requirements of Hardware: • A FIFO is required to store control signals from Vector Decoder. • Read FSM controller to read control signals from FIFO. • RVV ISA has 32 Vector Registers each register of length VLEN. • VLEN = Number of Lanes x ELEN. • Vector Register File is required to support ISA. • Vector Execution Unit (ALU and LSU) for computation. • Vector Memory Unit for Data transfer between Registers and Memory. • Vector Write Back Unit for writing back to register file. 6
  7. Single Lane Vector Unit 1 2 3 4 5 6 Cycle 1 • Decode Signals are read from FIFO. Cycle 2 • Read Indexes are generated. Cycle 3 - RD Stage • Data is read from VRF. Cycle 4 - EX Stage • Address generation and operations are performed. Cycle 5 - MEM Stage • For memory type instruction. Cycle 6 - WB Stage • Data is written back into VRF. 7
  8. Design of Reduction Unit Challenges: • In most of the instructions, elements in single register are independent to each other. • For Reduction instructions, elements are dependent to each other. • Example: Sum of elements in an array. Solution: • Elements are passed through reduction logic to reduce to one output. • Latency is reduced, if multiple units are used. Three clock cycles are consumed. • Stalling is done in vector unit. 8
  9. 9 Integration for Vector Unit
  10. 10 Implementation Results Timing Analysis Processor Worst Negative Slack Worst Hold Slack 4-Lane 0.13 ns 0.044 ns 8-Lane 0.021 ns 0.022 ns 16-Lane 0.031 ns 0.01 ns 32-Lane 0.02 ns 0.027 ns Resource Utilization Power Estimation On chip Power​ Static Power Dynamic Power Total Power 8-Lane 270 mW 429 mW 699mW Vectored Power Estimation using Xilinx Power Analyzer by running CNN on CPU Parameter​ Specification​ CPU​ Single-Core, single-issue, in-order, 5 stage pipeline​ Frequency​ 50MHz​ Memory​ I-Cache : 8KB, 2 way Set Associative D-Cache : 8KB, 2-way Set Associative Main Memory : 1MB Peripheral​ UART​ ISA Support​ RV32gV​ Accelerator​ 4, 8, 16, 32 Lane Vector Unit​ Vector Memory: 256KB/512KB Scratchpad memory​ Implemented on Xilinx FPGA Vertex-7 xc7vx485tffg1761-2 RISC-V Vector CPU Main Memory Peripherals Core Scalar Pipeline I-Cache D-Cache TLB ALU, FPU Vector Pipeline VRF Execution Unit Vector Memory
  11. Processor CONV8_3X3 CONV16_3X3 CONV32_3X3 CONV32_5X5 MATMUL64X64 8-Lane RISC-V Vector 2.1 us 9.7 us 43.16 us 115.59 us 1.742 ms Klessydra 9.91 us 21.18 us 59.54 us 113 us 2.741 ms CE32V40 46.46 us 165.07 us 252.42 us 1969.37 us 14.88 ms ZeroRI5Cy 69.2 us 623.85 us 970.93 us 2720.99 us 2.53 ms 11 Comparison with other Data-parallel Processors • RISC-V Vector CPU performs better than Klessydra Vector Processor, but for Convolution with 5x5 filter the performance is in the range of Klessydra. • The speedup is because of more packaging to utilize vector unit completely.
  12. Hardware Accelerator for Short Read Alignment
  13. 13 • 512 elements are provided as input to the sorter. Hardware Accelerator for Short Read Alignment
  14. 14 • Performance comparison for a number of test cases implemented on 4-parallel dual-rate merge tree are specified. • Proposed hardware achieves around 2.5 times sorting performance improvement as compared to Intel core i7-10700 CPU operating at 2.90 GHz. Number of Input Elements Hardware Execution Time (in μs) Software Execution Time (in μs) 512 5 13-14 4,096 54.7 134-137 32,768 558.6 1290-1320 • The functionality of traditional merge tree and dual-rate merge tree has been tested on hardware. • Parallelization of merge tree is implemented and has been tested on hardware. • Performance comparison of traditional merge tree, dual-rate merge tree and 4-parallel merge tree for random 512 input elements: Traditional merge tree Dual-rate merge tree 4-Parallel dual- rate merge tree 14.28 μs 9.1 μs 5 μs Hardware Accelerator for Short Read Alignment
  15. Remote Lab Control and Monitoring Platform By: Animesh Jain Ch Kalyan Kumar Prusty Guided by: Kuruvilla Varghese Prof. L Umanand Haresh Dagale
  16. PS SECTION Zybo-z7 REMOTE LAB Post Data AXI PL SECTION DUT USER SIDE Remote Lab Control and Monitoring Platform
  17. User Interface
  18. Digital output (DUT's digital input configuration). Programming FPGA Digital input (DUT's digital output configuration).
  19. 1. Write HDL code 2. Generate Output file (.bit) for the project 1 0 1 1 1 0 User System 3. Send Output .bit file and control bits DIGITAL DUT Internet 5. Sample the output generated by DUT with Logic Analyzer 7. Display Waveform in the User system 4. Program (or) control the DUT using the output file 6. Send the sampled output to the user
  20. Resource Efficient Implementation of Ternary Content Addressable Memories Presentation by: Madhu Ennampelli -17994 Microelectronics and VLSI Design Project Guide: Kuruvilla Varghese Masters Project Presentation
  21. TCAM Block Diagram • All N input sub-words of size w bits are simultaneously applied to every layer. • From layers outputs priority encoder will select the highest priority matched address. 21 Resource Efficient Implementation of Ternary Content Addressable Memories
  22. Pipelined layer architecture • For this design data path was appropriately divided to introduce the pipelining. • By pipelining, the design frequency of operation is increased. 22
  23. FPGA Implementation Results ▪ The resource Efficient TCAM is implemented with a size 64 x 32 with L=4 and N=4. ▪ Design parameters were improved in all corners like speed by 10.52%, power by 7.62 %, and resource utilization by 50 % compared to existing designs. 23 Implementation BRAMs (18K) FFs LUTs Speed (MHz) Power (mW) Z-TCAM 32 198 447 190 35.69 Efficient TCAM 16 164 233 210 33.00
  24. ASIC implementation Results 24 • ASIC design for the same architecture was implemented using Cadence tool with GPDK library on 45-nm CMOS technology node. • In ASIC implementation, design parameters were improved in all corners like speed by 121.10 %, power by 70.12 %, and area uses reduced by 18.7 times compared to existing designs. Implementation Area (µm2) Speed (MHz) Power (W) Z-TCAM 18707925 226 0.376 Efficient TCAM 1000000 493.82 0.111
  25. ASIC Implementation Results 25 Implementation Area (µm2) Speed (MHz) Power (W) Z-TCAM 18707925 226 0.376 Efficient TCAM 1000000 493.82 0.111 • In ASIC implementation, design parameters were improved in all corners like speed by 121.10 %, power by 70.12 %, and area uses reduced by 18.7 times compared to existing designs.

Notes de l'éditeur

  1. PL part in the board has the ADC module but it does not contain DAC.. So ADC should be inside and DAC should be outside the FPGA.