SlideShare une entreprise Scribd logo
1  sur  18
Information Classification: General
December 8-10 | Virtual Event
Klessydra-T: Designing Vector Coprocessors for
Multi-Threaded Edge-Computing Cores
Mauro Olivieri
Professor
Sapienza University of Rome
#RISCVSUMMIT
Information Classification: General
Francesco Lannutti
collaborator @Synopsys
DIGITAL SYSTEM LAB @ SAPIENZA UNIVERSITY OF ROME
Marcello Barbirotta
PhD candidate
Mauro Olivieri
Associate Professor
Francesco Menichelli
Assistant Professor
Antonio Mastrandrea
Research Fellow
Abdallah Cheikh
Research Fellow
Luigi Blasi
PhD cand. @DSI Gmbh
Francesco Vigli
PhD cand. @ ELT Spa
Stefano Sordillo
PhD candidate
Information Classification: General
INTRODUCTION & MOTIVATION
THE KLESSYDRA-T ARCHITECTURE
• Interleaved Multi-Threading baseline
• Parameterized vector acceleration schemes
• Klessydra vector intrinsic functions
BENCHMARK WORKLOADS
• Convolution, Matmul, FFT
• Homogeneous and composite workload
RESULTS
• Cycle count and absolute execution time
• Maximum clock frequency and hardware resource utilization
• Energy efficiency
CONCLUSIONS
OUTLINE
Information Classification: General
19/04/2021 Page 4
APPLICATION CONTEXT AND MOTIVATION
 There are recognized drives towards (extreme)
edge computing: availability, energy saving,
security, etc., having implications on both SW
design and HW design
 HW design challenges of extreme edge computing
devices:
• Local energy budget
• Cost & size
• Computing power
 General setting:
• Possibly taking advantage of inherently
multi-threaded application routines
• Inevitability of hardware acceleration support
Information Classification: General
• “space-qualified” core,
• T0 microarchitecture
• + configurable HW/SW fault-
tolerance support
• “edge computing” core
• extends T0 microarchitecture
• RV32IM
• + configurable multiple
scratchpad memories
• + configurable vector unit
• extended ISA
• Starting point
• M mode v1.10
• RV32I user ISA
• single hart
• M mode v1.10
• RV32I user ISA
• Atomic ext. (partial)
• multiple PC & CSR
• multiple interleaved
harts
PULPino
feat.
Klessydra S0
core
PULPino
feat.
Klessydra
T0 cores
PULPino
feat.
Klessydra F0
cores
PULPino
feat.
Klessydra T1
cores
19/04/2021 Page 5
core
courtesy of
THE PULPINO-COMPATIBLE KLESSYDRA CORE
FAMILY
Information Classification: General
THE KLESSYDRA IMT MICROARCHITECTURE
 Baseline Klessydra T03 core features:
• Thread context switch at each clock cycle
• in-order, single issue instruction execution
• feed-forward pipeline structure (no hardware support
for pipeline hazard handling)
• bare metal execution (RISCV M mode)
 The vector-accelerated Klessydra-T13 core has been
designed as a superset of the basic Klessydra-T03
microarchitecture.
Regfile
Decode
PC
PC
CSR
Data Mem
WB
Debug
Updater
harc
Updater
hart a
hart b
hart c
Fetch
Prg Mem
Execute
Program memory
Data memory
Information Classification: General
THE KLESSYDRA-T1 MICROARCHITECTURE
FAMILY
Input Mapping
Add
Sub
Shft Mul Accum Relu
MFU
Bank Intrlv
Bank1
Bank0 BankN
SPMI
Data reorder
Output Mapping
MAU_busy
MAU_req
EXEC
Regfile
Decode
Fetch
PC
PC
CSR
Data Mem
WB
Debug
Prg Mem
Updater
harc
Updater
DSP Initialization
Control / Mapping
Add
Sub
Shft Mul Accum Relu
Accl Exec
MFU
Accl Init
hart a
hart a,
b, or c
hart c SPMI
B0 B1 B2
LSU
x F
x D
SPM
SPM
SPM
x D
bank
bank
bank
…
x N
bank
bank
bank
bank
bank
bank
SPM0 SPM1
SPMN-1
Regfile
Decode
PC
PC
CSR
Data Mem
WB
Debug
Updater
harc
Updater
hart a
hart b
hart c
Fetch
Prg Mem
Execute
Program memory
Data memory
Execute MFU
SPMI
LSU
Klessydra T13 core features
 multiple units in the execution stage
• scalar execution unit (EXEC)
• vector-oriented multi-purpose functional
unit (MFU) with Scratchpad Memory
support
• Load/Store unit (LSU)
 possible concurrent execution of instructions
of different types
Information Classification: General
HARDWARE ACCELERATION PARAMETRIC
SCHEMES
The parametric coprocessor architecture in T13 cores,
comprised of the MFU and the SPMIs, can be
configured at synthesis level according to the following
values:
• the number of parallel lanes D in the MFU, which
defines the DLP degree and also corresponds to the
number of SPM banks in each SMPI block
• the number of MFUs F
• the SPM bank capacity B
• the number of SPMs N
• the number of SPMIs M
• The sharing scheme of MFUs and SMPI among the
harts, i.e. heterogeneous or symmetric
19/04/2021 Titolo Presentazione Pagina 8
 M=1, F=1, D=1: SISD
 M=1, F=1, D=2,4,8: Pure SIMD
 M=3, F=3, D=1: Symmetric MIMD
 M=3, F=3, D=2,4,8: Symmetric MIMD + SIMD
 M=3, F=1, D=1: Heterogeneous MIMD
 M=3, F=1, D=2,4,8: Heterogeneous MIMD + SIMD
Information Classification: General
KLESSYDRA VECTOR EXTENSION AND INTRINSIC
FUNCTIONS
Assembly syntax – (r) denotes
memoryaddressing via register r
Short description
kmemld (rd),(rs1),(rs2) load vector into scratchpad region
kmemstr (rd),(rs1),(rs2) store vector into main memory
kaddv (rd),(rs1),(rs2) adds vectors in scratchpad region
ksubv (rd),(rs1),(rs2) subtract vectors in scratchpad region
kvmul (rd),(rs1),(rs2) multiply vectors in scratchpad region
kvred (rd),(rs1),(rs2) reduce vector by addition
kdotp (rd),(rs1),(rs2) vector dot product into register
ksvaddsc (rd),(rs1),(rs2) add vector + scalar into scratchpad
ksvaddrf (rd),(rs1),rs2 add vector + scalar into register
ksvmulsc (rd),(rs1),(rs2) multiply vector + scalar into scratchpad
ksvmulrf (rd),(rs1),rs2 multiply vector + scalar into register
kdotpps (rd),(rs1),(rs2) vector dot product and post scaling
ksrlv (rd),(rs1),rs2 vector logic shift within scratchpad
ksrav (rd),(rs1),rs2 vector arithmetic shift within scratchpad
krelu (rd),(rs1) vector ReLu within scratchpad
kvslt (rd),(rs1),(rs2) compare vectors and create mask vector
ksvslt (rd),(rs1),rs2 compare vector-scalar and create mask
kvcp (rd),(rs1) copy vector within scratchpad region
The instructions supported by the coprocessor sub-
system are exposed to the programmer in the form of
very simple intrinsic functions, fully integrated in the
RISC-V gcc compiler toolchain.
CSR_MVSIZE(Row_size); //set vector length
for( i = Zeropad_offset; i < Row_size-Zeropad_offset;i++) { //scan the Output Matrix rows
k_element = 0;
for ( FM_row_pointer = -Zeropad_offset; FM_row_pointer <= Zeropad_offset; FM_row_pointer++) {
for ( column_offset = 0; column_offset < kernel_size; column_offset++){
FM_offset = (i+FM_row_pointer)*Row_size + column_offset; // set pointer in SPM space
ksvmulsc( SPM_D, (SPM_A + FM_offset), (SPM_B + k_element++) ); // temporary vector result
ksrav( SPM_D, SPM_D, scaling_factor ); //scaling for fixed point alignment
OM_offset = (Row_size*i) + Zeropad_offset; // set pointer in SPM space
kaddv( (SPM_C + OM_offset), (SPM_C + OM_offset), SPM_D ); // update Output Matrix row
}
}
}
Information Classification: General
BENCHMARK WORKLOADS AND EVALUATION SETUP
 2D convolution
• 32-bit data elements in fixed-point representation
• 3x3 filter size
• matrix sizes of 4x4, 8x8, 16x16, and 32x32 elements
• additional analysis of larger than 3x3 filter sizes on 32x32 matrices
 FFT
• 256 complex samples
 Matmul
• Square matrices of 64x64 elements
• Homogeneous workload (3 harts running same program)
• Composite workload (3 harts running different programs)
19/04/2021 Titolo Presentazione Pagina 10
ANALYZED PERFORMANCE FIGURES
ON FPGA SOFT-CORE
IMPLEMENTATION
• Average total cycle count per hart
• Maximum clock frequency
• Absolute execution time
• Hardware Resource Utilization
• Average energy per algorithmic
operation
Information Classification: General
SUMMARY OF PERFORMANCE RESULTS
 3X cycle count speed-up relative a RV32IM IMT core without acceleration (Klessydra T03)
 2X cycle count speed-up when compared to the single-threaded, DSP-extended RI5CY core
MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL
Core
Config
uration
DLP
FPGA Element Utilization Max
freq
MHz
Homogeneous Workload Composite Workload
FF LUT
B-
RAM
DSP
LUT-
RAM
Conv
4x4
Conv
8x8
Conv
16x16
Conv
32x32
FFT
256
MatMul
64x64
Conv
32x32
FFT MatMul
256 64x64
Klessydra
T13
SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771
SIMD
2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705
4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773
8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420
Sym.
MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564
Sym.
MIMD +
SIMD
2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370
4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580
8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031
Het.
MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567
Het.
MIMD +
SIMD
2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201
4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290
8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877
Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779
RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572
ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376
Information Classification: General
SUMMARY OF PERFORMANCE RESULTS
 3X cycle count speed-up relative a RV32IM IMT core without acceleration (Klessydra T03)
 2X cycle count speed-up when compared to the single-threaded, DSP-extended RI5CY core
MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL
Core
Config
uration
DLP
FPGA Element Utilization Max
freq
MHz
Homogeneous Workload Composite Workload
FF LUT
B-
RAM
DSP
LUT-
RAM
Conv
4x4
Conv
8x8
Conv
16x16
Conv
32x32
FFT
256
MatMul
64x64
Conv
32x32
FFT MatMul
256 64x64
Klessydra
T13
SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771
SIMD
2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705
4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773
8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420
Sym.
MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564
Sym.
MIMD +
SIMD
2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370
4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580
8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031
Het.
MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567
Het.
MIMD +
SIMD
2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201
4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290
8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877
Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779
RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572
ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376
• The clock speed exhibited the sharpest drops as the DLP grew larger.
• In the symmetric MIMD scheme, the large HW overhead forced FPGA
slices on the same critical path to be placed far from each other, thus
increasing interconnect delay.
• Pipelining the heterogeneous MIMD crossbar to reduce the critical path,
introduces additional HW overhead, compromising the area advantage.
Information Classification: General
MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL
Core
Config
uration
DLP
FPGA Element Utilization Max
freq
MHz
Homogeneous Workload Composite Workload
FF LUT
B-
RAM
DSP
LUT-
RAM
Conv
4x4
Conv
8x8
Conv
16x16
Conv
32x32
FFT
256
MatMul
64x64
Conv
32x32
FFT MatMul
256 64x64
Klessydra
T13
SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771
SIMD
2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705
4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773
8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420
Sym.
MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564
Sym.
MIMD +
SIMD
2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370
4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580
8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031
Het.
MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567
Het.
MIMD +
SIMD
2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201
4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290
8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877
Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779
RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572
ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376
SUMMARY OF PERFORMANCE RESULTS
• Small matrix convolutions and FFT on
the accelerated core reached up to
2X cycle count reduction over the
single-threaded, DSP-extended
RI5CY core.
• Large matrix convolutions and
MatMul obtain advantage from
vector-acceleration reaching 9X cycle
count reduction relative to RI5CY.
Information Classification: General
• Assuming maximum clock frequency for each core
• Zeroriscy core taken as common reference
• In pure SIMD configurations, the speed-up grows linearly
with the DLP
• Going from a SISD/SIMD to MIMD+SIMD improved the
speedup in all cases, despite the frequency drop
associated to the MIMD hardware.
• The symmetric MIMD+SIMD schemes exhibit up to 17X
speed-up over Zeroriscy for Convolution 32x32 and up to
13X speed-up for the composite workload.
• Heterogeneous MIMD configurations maintain an almost
perfect overlap with the symmetric MIMD.
• The non-accelerated Klessydra-T03, exhibits an absolute
performance gain over RI5CY and ZeroRiscy
Pagina 14
ABSOLUTE EXECUTION TIME SPEED-UP
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
SISD,
DLP
1
pure
SIMD,
DLP
2
pure
SIMD,
DLP
4
pure
SIMD,
DLP
8
Sym.
MIMD,
DLP
1
Sym.
MIMD+SIMD,
DLP
2
Sym.
MIMD+SIMD,
DLP
4
Sym.
MIMD+SIMD,
DLP
8
Het.
MIMD,
DLP
1
Het.
MIMD+SIMD,
DLP
2
Het.
MIMD+SIMD,
DLP
4
Het.
MIMD+SIMD,
DLP
8
Klessydra
T03
(no
accel.)
RI5CY
(DSP
extension)
ZeroRiscy
(no
accel.)
Conv.2D 4x4
Conv.2D 8x8
Conv.2D 16x16
Conv.2D 32x32
FFT 256
MatMul 64x64
Composite
Information Classification: General
ENERGY EFFICIENCY
• The result of this analysis is expressed as energy
per algorithmic operation, for the FPGA soft-core
implementations, normalized to Zeroriscy, taken as
reference.
• The most energy efficient designs resulted to be
the T13 symmetric MIMD configurations
• The heterogenous MIMD approach exhibited an
almost complete overlap in energy consumption
with the symmetric MIMD
• The pure SIMD schemes resulted in a larger
energy consumption than other schemes, due to
the impossibility of efficiently exploiting TLP.
Pagina 15
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3 SISD,
DLP
1
pure
SIMD,
DLP
2
pure
SIMD,
DLP
4
pure
SIMD,
DLP
8
Sym.
MIMD,
DLP
1
Sym.
MIMD+SIMD,
DLP
2
Sym.
MIMD+SIMD,
DLP
4
Sym.
MIMD+SIMD,
DLP
8
Het.
MIMD,
DLP
1
Het.
MIMD+SIMD,
DLP
2
Het.
MIMD+SIMD,
DLP
4
Het.
MIMD+SIMD,
DLP
8
Klessydra
T03
(no
accel.)
RI5CY
(DSP
extension)
ZeroRiscy
(no
accel.)
Conv.2D 4x4 Conv.2D 8x8
Conv.2D 16x16 Conv.2D 32x32
FFT 256 MatMul 64x64
Composite
Information Classification: General
Pagina 16
LARGER CONVOLUTION FILTERS
Core DLP
Filter (5x5) Filter (7x7) Filter (9x9) Filter (11x11)
Cycle
Cnt
X1000
T (us) E [uJ]
Cycle
Cnt
X1000
T (us) E [uJ]
Cycle
Cnt
X1000
T
(us)
E [uJ]
Cycle
Cnt
X1000
T
(us)
E [uJ]
T13 SIMD 2 52.7 362 50.6 101.2 694 97.1 165.8 1136 159.1 246.5 1689 236.6
T13 SIMD 8 24.6 179 34.4 46.1 335 64.5 74.7 543 104.7 110.6 803 154.8
T13 Sym MIMD 2 19.5 148 26.9 35.8 272 49.4 57.4 436 79.2 84.4 641 116.5
T13 Sym MIMD 8 11.8 113 28.9 19.2 183 46.9 29.8 284 72.7 42.9 408 104.7
T13 Het MIMD 2 20.5 159 28.3 37.5 291 51.8 60.2 467 83.1 88.5 687 122.1
T03 (no accel.) - 247 1120 215.5 514.8 2328 447.9 881.2 3985 766.6 1369.1 6191 1191.1
RISCY - 180 1971 252.0 385.3 4218 539.4 662.5 7252 927.5 1000.2 10949 1400.3
ZeroRiscy - 318.9 2721 226.4 674.5 5754 478.9 1129.7 9637 802.1 1697.8 14482 1205.4
• The matrix being convoluted is 32x32 elements
• The speed-up and energy efficiency trends continue as the filter dimensions grow, reaching X35 speedup over the Zeroriscy reference
Information Classification: General
 The MIMD-SIMD vector coprocessor schemes enable tuning the TLP and DLP
• >15X absolute time speed-up , -85% energy per operation.
 Kernels that are less effectively vectorizable can still take benefit SPMs and TLP, in an IMT core,
• 2X-3X speed-up.
 Fully symmetric MIMD and heterogeneous MIMD give very similar results,
• functional unit contention is less impacting than SPM contention.
• coprocessor contention can be effectively mitigated by functional unit heterogeneity
 Pure DLP acceleration always give inferior results than a balanced TLP/DLP acceleration.
• The IMT microarchitecture benefits from TLP and DLP acceleration in a single core.
 In the absence of hardware acceleration, IMT still exhibits a performance advantage over single-thread execution
• Simplified hardware structure phylosophy
19/04/2021 Pagina 17
CONCLUSIONS
Information Classification: General
December 8-10 | Virtual Event
Thank you for joining
Contribute to the RISC-V conversation on social!
#RISCVSUMMIT #KLESSYDRA @mauro_olivieri_
https://github.com/klessydra
Mauro.Olivieri@uniroma1.it

Contenu connexe

Tendances

An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
An Open Discussion of RISC-V BitManip, trends, and comparisons _ CuffRISC-V International
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VRISC-V International
 
Andes andes clarity for risc-v vector processor
Andes andes clarity for risc-v vector processorAndes andes clarity for risc-v vector processor
Andes andes clarity for risc-v vector processorRISC-V International
 
RISC-V growth and successes in technology and industry - embedded world 2021
RISC-V growth and successes in technology and industry - embedded world 2021RISC-V growth and successes in technology and industry - embedded world 2021
RISC-V growth and successes in technology and industry - embedded world 2021RISC-V International
 
Reverse Engineering of Rocket Chip
Reverse Engineering of Rocket ChipReverse Engineering of Rocket Chip
Reverse Engineering of Rocket ChipRISC-V International
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V International
 
Educating the computer architects of tomorrow's critical systems with RISC-V
Educating the computer architects of tomorrow's critical systems with RISC-VEducating the computer architects of tomorrow's critical systems with RISC-V
Educating the computer architects of tomorrow's critical systems with RISC-VRISC-V International
 
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53KarthiSugumar
 
LAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLinaro
 
LAS16-TR03: Upstreaming 201
LAS16-TR03: Upstreaming 201LAS16-TR03: Upstreaming 201
LAS16-TR03: Upstreaming 201Linaro
 
BUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCBUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCLinaro
 
Semi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V coresSemi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V coresRISC-V International
 
Closing the RISC-V compliance gap via fuzzing
Closing the RISC-V compliance gap via fuzzingClosing the RISC-V compliance gap via fuzzing
Closing the RISC-V compliance gap via fuzzingRISC-V International
 
BKK16-400A LuvOS and ACPI Compliance Testing
BKK16-400A LuvOS and ACPI Compliance TestingBKK16-400A LuvOS and ACPI Compliance Testing
BKK16-400A LuvOS and ACPI Compliance TestingLinaro
 
Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a...
Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a...Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a...
Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a...RISC-V International
 

Tendances (20)

An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Cuff
 
RISC-V 30908 patra
RISC-V 30908 patraRISC-V 30908 patra
RISC-V 30908 patra
 
Static partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-VStatic partitioning virtualization on RISC-V
Static partitioning virtualization on RISC-V
 
Andes open cl for RISC-V
Andes open cl for RISC-VAndes open cl for RISC-V
Andes open cl for RISC-V
 
Andes andes clarity for risc-v vector processor
Andes andes clarity for risc-v vector processorAndes andes clarity for risc-v vector processor
Andes andes clarity for risc-v vector processor
 
Andes RISC-V processor solutions
Andes RISC-V processor solutionsAndes RISC-V processor solutions
Andes RISC-V processor solutions
 
RISC-V growth and successes in technology and industry - embedded world 2021
RISC-V growth and successes in technology and industry - embedded world 2021RISC-V growth and successes in technology and industry - embedded world 2021
RISC-V growth and successes in technology and industry - embedded world 2021
 
Reverse Engineering of Rocket Chip
Reverse Engineering of Rocket ChipReverse Engineering of Rocket Chip
Reverse Engineering of Rocket Chip
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
 
Educating the computer architects of tomorrow's critical systems with RISC-V
Educating the computer architects of tomorrow's critical systems with RISC-VEducating the computer architects of tomorrow's critical systems with RISC-V
Educating the computer architects of tomorrow's critical systems with RISC-V
 
RISC-V assembly
RISC-V assemblyRISC-V assembly
RISC-V assembly
 
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
Architecture Exploration of RISC-V Processor and Comparison with ARM Cortex-A53
 
LAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel AwarenessLAS16-403: GDB Linux Kernel Awareness
LAS16-403: GDB Linux Kernel Awareness
 
LAS16-TR03: Upstreaming 201
LAS16-TR03: Upstreaming 201LAS16-TR03: Upstreaming 201
LAS16-TR03: Upstreaming 201
 
System Design on Zynq using SDSoC
System Design on Zynq using SDSoCSystem Design on Zynq using SDSoC
System Design on Zynq using SDSoC
 
BUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoCBUD17 Socionext SC2A11 ARM Server SoC
BUD17 Socionext SC2A11 ARM Server SoC
 
Semi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V coresSemi dynamics high bandwidth vector capable RISC-V cores
Semi dynamics high bandwidth vector capable RISC-V cores
 
Closing the RISC-V compliance gap via fuzzing
Closing the RISC-V compliance gap via fuzzingClosing the RISC-V compliance gap via fuzzing
Closing the RISC-V compliance gap via fuzzing
 
BKK16-400A LuvOS and ACPI Compliance Testing
BKK16-400A LuvOS and ACPI Compliance TestingBKK16-400A LuvOS and ACPI Compliance Testing
BKK16-400A LuvOS and ACPI Compliance Testing
 
Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a...
Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a...Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a...
Esperanto accelerates machine learning with 1000+ low power RISC-V cores on a...
 

Similaire à Klessydra t - designing vector coprocessors for multi-threaded edge-computing cores

Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
2012 Techniques for Verification and Debugging of LPDDR3 Memory Designs.pdf
2012 Techniques for Verification and Debugging of LPDDR3 Memory Designs.pdf2012 Techniques for Verification and Debugging of LPDDR3 Memory Designs.pdf
2012 Techniques for Verification and Debugging of LPDDR3 Memory Designs.pdfssuser2a2430
 
Iaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fftIaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fftIaetsd Iaetsd
 
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca..."Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...Enrique Monzo Solves
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
DPDK layer for porting IPS-IDS
DPDK layer for porting IPS-IDSDPDK layer for porting IPS-IDS
DPDK layer for porting IPS-IDSVipin Varghese
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
Iaetsd vlsi based implementation of a digital
Iaetsd vlsi based implementation of a digitalIaetsd vlsi based implementation of a digital
Iaetsd vlsi based implementation of a digitalIaetsd Iaetsd
 
TULIPP overview
TULIPP overviewTULIPP overview
TULIPP overviewTulipp. Eu
 
BFSK RT In FPGA Thesis Pres Jps
BFSK RT In FPGA Thesis Pres JpsBFSK RT In FPGA Thesis Pres Jps
BFSK RT In FPGA Thesis Pres Jpsjpsvenn
 
A continuous time adc and digital signal processing system for smart dust and...
A continuous time adc and digital signal processing system for smart dust and...A continuous time adc and digital signal processing system for smart dust and...
A continuous time adc and digital signal processing system for smart dust and...eSAT Journals
 
A continuous time adc and digital signal processing system for smart dust and...
A continuous time adc and digital signal processing system for smart dust and...A continuous time adc and digital signal processing system for smart dust and...
A continuous time adc and digital signal processing system for smart dust and...eSAT Publishing House
 

Similaire à Klessydra t - designing vector coprocessors for multi-threaded edge-computing cores (20)

Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Thesis
ThesisThesis
Thesis
 
Thesis
ThesisThesis
Thesis
 
2012 Techniques for Verification and Debugging of LPDDR3 Memory Designs.pdf
2012 Techniques for Verification and Debugging of LPDDR3 Memory Designs.pdf2012 Techniques for Verification and Debugging of LPDDR3 Memory Designs.pdf
2012 Techniques for Verification and Debugging of LPDDR3 Memory Designs.pdf
 
Iaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fftIaetsd finger print recognition by cordic algorithm and pipelined fft
Iaetsd finger print recognition by cordic algorithm and pipelined fft
 
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca..."Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
"Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Ca...
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
DPDK layer for porting IPS-IDS
DPDK layer for porting IPS-IDSDPDK layer for porting IPS-IDS
DPDK layer for porting IPS-IDS
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
MaPU-HPCA2016
MaPU-HPCA2016MaPU-HPCA2016
MaPU-HPCA2016
 
Brkdct 3101
Brkdct 3101Brkdct 3101
Brkdct 3101
 
Iaetsd vlsi based implementation of a digital
Iaetsd vlsi based implementation of a digitalIaetsd vlsi based implementation of a digital
Iaetsd vlsi based implementation of a digital
 
TULIPP overview
TULIPP overviewTULIPP overview
TULIPP overview
 
BFSK RT In FPGA Thesis Pres Jps
BFSK RT In FPGA Thesis Pres JpsBFSK RT In FPGA Thesis Pres Jps
BFSK RT In FPGA Thesis Pres Jps
 
A continuous time adc and digital signal processing system for smart dust and...
A continuous time adc and digital signal processing system for smart dust and...A continuous time adc and digital signal processing system for smart dust and...
A continuous time adc and digital signal processing system for smart dust and...
 
A continuous time adc and digital signal processing system for smart dust and...
A continuous time adc and digital signal processing system for smart dust and...A continuous time adc and digital signal processing system for smart dust and...
A continuous time adc and digital signal processing system for smart dust and...
 

Plus de RISC-V International

London Open Source Meetup for RISC-V
London Open Source Meetup for RISC-VLondon Open Source Meetup for RISC-V
London Open Source Meetup for RISC-VRISC-V International
 
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...
Ziptillion   boosting RISC-V with an efficient and os transparent memory comp...Ziptillion   boosting RISC-V with an efficient and os transparent memory comp...
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...RISC-V International
 
Standardizing the tee with global platform and RISC-V
Standardizing the tee with global platform and RISC-VStandardizing the tee with global platform and RISC-V
Standardizing the tee with global platform and RISC-VRISC-V International
 
RISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V 30910 kassem_ summit 2020 - so_c_genRISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V 30910 kassem_ summit 2020 - so_c_genRISC-V International
 
RISC-V 30906 hex five multi_zone iot firmware
RISC-V 30906 hex five multi_zone iot firmwareRISC-V 30906 hex five multi_zone iot firmware
RISC-V 30906 hex five multi_zone iot firmwareRISC-V International
 
RISC-V 30946 manuel_offenberg_v3_notes
RISC-V 30946 manuel_offenberg_v3_notesRISC-V 30946 manuel_offenberg_v3_notes
RISC-V 30946 manuel_offenberg_v3_notesRISC-V International
 
RISC-V software state of the union
RISC-V software state of the unionRISC-V software state of the union
RISC-V software state of the unionRISC-V International
 
Ripes tracking computer architecture throught visual and interactive simula...
Ripes   tracking computer architecture throught visual and interactive simula...Ripes   tracking computer architecture throught visual and interactive simula...
Ripes tracking computer architecture throught visual and interactive simula...RISC-V International
 
Open source manufacturable pdk for sky water 130nm process node
Open source manufacturable pdk for sky water 130nm process nodeOpen source manufacturable pdk for sky water 130nm process node
Open source manufacturable pdk for sky water 130nm process nodeRISC-V International
 
Gernot heiser unsw sydney and se l4 foundation
Gernot heiser unsw sydney and se l4 foundationGernot heiser unsw sydney and se l4 foundation
Gernot heiser unsw sydney and se l4 foundationRISC-V International
 
Fueling the datasphere how RISC-V enables the storage ecosystem
Fueling the datasphere   how RISC-V enables the storage ecosystemFueling the datasphere   how RISC-V enables the storage ecosystem
Fueling the datasphere how RISC-V enables the storage ecosystemRISC-V International
 
Easily emulating full systems on amazon fpg as
Easily emulating full systems on amazon fpg asEasily emulating full systems on amazon fpg as
Easily emulating full systems on amazon fpg asRISC-V International
 

Plus de RISC-V International (20)

WD RISC-V inliner work effort
WD RISC-V inliner work effortWD RISC-V inliner work effort
WD RISC-V inliner work effort
 
RISC-V Online Tutor
RISC-V Online TutorRISC-V Online Tutor
RISC-V Online Tutor
 
London Open Source Meetup for RISC-V
London Open Source Meetup for RISC-VLondon Open Source Meetup for RISC-V
London Open Source Meetup for RISC-V
 
RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V Introduction
 
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...
Ziptillion   boosting RISC-V with an efficient and os transparent memory comp...Ziptillion   boosting RISC-V with an efficient and os transparent memory comp...
Ziptillion boosting RISC-V with an efficient and os transparent memory comp...
 
Standardizing the tee with global platform and RISC-V
Standardizing the tee with global platform and RISC-VStandardizing the tee with global platform and RISC-V
Standardizing the tee with global platform and RISC-V
 
Security and functional safety
Security and functional safetySecurity and functional safety
Security and functional safety
 
RISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V 30910 kassem_ summit 2020 - so_c_genRISC-V 30910 kassem_ summit 2020 - so_c_gen
RISC-V 30910 kassem_ summit 2020 - so_c_gen
 
RISC-V 30906 hex five multi_zone iot firmware
RISC-V 30906 hex five multi_zone iot firmwareRISC-V 30906 hex five multi_zone iot firmware
RISC-V 30906 hex five multi_zone iot firmware
 
RISC-V 30946 manuel_offenberg_v3_notes
RISC-V 30946 manuel_offenberg_v3_notesRISC-V 30946 manuel_offenberg_v3_notes
RISC-V 30946 manuel_offenberg_v3_notes
 
RISC-V software state of the union
RISC-V software state of the unionRISC-V software state of the union
RISC-V software state of the union
 
Ripes tracking computer architecture throught visual and interactive simula...
Ripes   tracking computer architecture throught visual and interactive simula...Ripes   tracking computer architecture throught visual and interactive simula...
Ripes tracking computer architecture throught visual and interactive simula...
 
Porting tock to open titan
Porting tock to open titanPorting tock to open titan
Porting tock to open titan
 
Open j9 jdk on RISC-V
Open j9 jdk on RISC-VOpen j9 jdk on RISC-V
Open j9 jdk on RISC-V
 
Open source manufacturable pdk for sky water 130nm process node
Open source manufacturable pdk for sky water 130nm process nodeOpen source manufacturable pdk for sky water 130nm process node
Open source manufacturable pdk for sky water 130nm process node
 
Gernot heiser unsw sydney and se l4 foundation
Gernot heiser unsw sydney and se l4 foundationGernot heiser unsw sydney and se l4 foundation
Gernot heiser unsw sydney and se l4 foundation
 
Fueling the datasphere how RISC-V enables the storage ecosystem
Fueling the datasphere   how RISC-V enables the storage ecosystemFueling the datasphere   how RISC-V enables the storage ecosystem
Fueling the datasphere how RISC-V enables the storage ecosystem
 
Easily emulating full systems on amazon fpg as
Easily emulating full systems on amazon fpg asEasily emulating full systems on amazon fpg as
Easily emulating full systems on amazon fpg as
 
Developing for polar fire soc
Developing for polar fire socDeveloping for polar fire soc
Developing for polar fire soc
 
Data trustworthiness at the edge
Data trustworthiness at the edgeData trustworthiness at the edge
Data trustworthiness at the edge
 

Dernier

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Klessydra t - designing vector coprocessors for multi-threaded edge-computing cores

  • 1. Information Classification: General December 8-10 | Virtual Event Klessydra-T: Designing Vector Coprocessors for Multi-Threaded Edge-Computing Cores Mauro Olivieri Professor Sapienza University of Rome #RISCVSUMMIT
  • 2. Information Classification: General Francesco Lannutti collaborator @Synopsys DIGITAL SYSTEM LAB @ SAPIENZA UNIVERSITY OF ROME Marcello Barbirotta PhD candidate Mauro Olivieri Associate Professor Francesco Menichelli Assistant Professor Antonio Mastrandrea Research Fellow Abdallah Cheikh Research Fellow Luigi Blasi PhD cand. @DSI Gmbh Francesco Vigli PhD cand. @ ELT Spa Stefano Sordillo PhD candidate
  • 3. Information Classification: General INTRODUCTION & MOTIVATION THE KLESSYDRA-T ARCHITECTURE • Interleaved Multi-Threading baseline • Parameterized vector acceleration schemes • Klessydra vector intrinsic functions BENCHMARK WORKLOADS • Convolution, Matmul, FFT • Homogeneous and composite workload RESULTS • Cycle count and absolute execution time • Maximum clock frequency and hardware resource utilization • Energy efficiency CONCLUSIONS OUTLINE
  • 4. Information Classification: General 19/04/2021 Page 4 APPLICATION CONTEXT AND MOTIVATION  There are recognized drives towards (extreme) edge computing: availability, energy saving, security, etc., having implications on both SW design and HW design  HW design challenges of extreme edge computing devices: • Local energy budget • Cost & size • Computing power  General setting: • Possibly taking advantage of inherently multi-threaded application routines • Inevitability of hardware acceleration support
  • 5. Information Classification: General • “space-qualified” core, • T0 microarchitecture • + configurable HW/SW fault- tolerance support • “edge computing” core • extends T0 microarchitecture • RV32IM • + configurable multiple scratchpad memories • + configurable vector unit • extended ISA • Starting point • M mode v1.10 • RV32I user ISA • single hart • M mode v1.10 • RV32I user ISA • Atomic ext. (partial) • multiple PC & CSR • multiple interleaved harts PULPino feat. Klessydra S0 core PULPino feat. Klessydra T0 cores PULPino feat. Klessydra F0 cores PULPino feat. Klessydra T1 cores 19/04/2021 Page 5 core courtesy of THE PULPINO-COMPATIBLE KLESSYDRA CORE FAMILY
  • 6. Information Classification: General THE KLESSYDRA IMT MICROARCHITECTURE  Baseline Klessydra T03 core features: • Thread context switch at each clock cycle • in-order, single issue instruction execution • feed-forward pipeline structure (no hardware support for pipeline hazard handling) • bare metal execution (RISCV M mode)  The vector-accelerated Klessydra-T13 core has been designed as a superset of the basic Klessydra-T03 microarchitecture. Regfile Decode PC PC CSR Data Mem WB Debug Updater harc Updater hart a hart b hart c Fetch Prg Mem Execute Program memory Data memory
  • 7. Information Classification: General THE KLESSYDRA-T1 MICROARCHITECTURE FAMILY Input Mapping Add Sub Shft Mul Accum Relu MFU Bank Intrlv Bank1 Bank0 BankN SPMI Data reorder Output Mapping MAU_busy MAU_req EXEC Regfile Decode Fetch PC PC CSR Data Mem WB Debug Prg Mem Updater harc Updater DSP Initialization Control / Mapping Add Sub Shft Mul Accum Relu Accl Exec MFU Accl Init hart a hart a, b, or c hart c SPMI B0 B1 B2 LSU x F x D SPM SPM SPM x D bank bank bank … x N bank bank bank bank bank bank SPM0 SPM1 SPMN-1 Regfile Decode PC PC CSR Data Mem WB Debug Updater harc Updater hart a hart b hart c Fetch Prg Mem Execute Program memory Data memory Execute MFU SPMI LSU Klessydra T13 core features  multiple units in the execution stage • scalar execution unit (EXEC) • vector-oriented multi-purpose functional unit (MFU) with Scratchpad Memory support • Load/Store unit (LSU)  possible concurrent execution of instructions of different types
  • 8. Information Classification: General HARDWARE ACCELERATION PARAMETRIC SCHEMES The parametric coprocessor architecture in T13 cores, comprised of the MFU and the SPMIs, can be configured at synthesis level according to the following values: • the number of parallel lanes D in the MFU, which defines the DLP degree and also corresponds to the number of SPM banks in each SMPI block • the number of MFUs F • the SPM bank capacity B • the number of SPMs N • the number of SPMIs M • The sharing scheme of MFUs and SMPI among the harts, i.e. heterogeneous or symmetric 19/04/2021 Titolo Presentazione Pagina 8  M=1, F=1, D=1: SISD  M=1, F=1, D=2,4,8: Pure SIMD  M=3, F=3, D=1: Symmetric MIMD  M=3, F=3, D=2,4,8: Symmetric MIMD + SIMD  M=3, F=1, D=1: Heterogeneous MIMD  M=3, F=1, D=2,4,8: Heterogeneous MIMD + SIMD
  • 9. Information Classification: General KLESSYDRA VECTOR EXTENSION AND INTRINSIC FUNCTIONS Assembly syntax – (r) denotes memoryaddressing via register r Short description kmemld (rd),(rs1),(rs2) load vector into scratchpad region kmemstr (rd),(rs1),(rs2) store vector into main memory kaddv (rd),(rs1),(rs2) adds vectors in scratchpad region ksubv (rd),(rs1),(rs2) subtract vectors in scratchpad region kvmul (rd),(rs1),(rs2) multiply vectors in scratchpad region kvred (rd),(rs1),(rs2) reduce vector by addition kdotp (rd),(rs1),(rs2) vector dot product into register ksvaddsc (rd),(rs1),(rs2) add vector + scalar into scratchpad ksvaddrf (rd),(rs1),rs2 add vector + scalar into register ksvmulsc (rd),(rs1),(rs2) multiply vector + scalar into scratchpad ksvmulrf (rd),(rs1),rs2 multiply vector + scalar into register kdotpps (rd),(rs1),(rs2) vector dot product and post scaling ksrlv (rd),(rs1),rs2 vector logic shift within scratchpad ksrav (rd),(rs1),rs2 vector arithmetic shift within scratchpad krelu (rd),(rs1) vector ReLu within scratchpad kvslt (rd),(rs1),(rs2) compare vectors and create mask vector ksvslt (rd),(rs1),rs2 compare vector-scalar and create mask kvcp (rd),(rs1) copy vector within scratchpad region The instructions supported by the coprocessor sub- system are exposed to the programmer in the form of very simple intrinsic functions, fully integrated in the RISC-V gcc compiler toolchain. CSR_MVSIZE(Row_size); //set vector length for( i = Zeropad_offset; i < Row_size-Zeropad_offset;i++) { //scan the Output Matrix rows k_element = 0; for ( FM_row_pointer = -Zeropad_offset; FM_row_pointer <= Zeropad_offset; FM_row_pointer++) { for ( column_offset = 0; column_offset < kernel_size; column_offset++){ FM_offset = (i+FM_row_pointer)*Row_size + column_offset; // set pointer in SPM space ksvmulsc( SPM_D, (SPM_A + FM_offset), (SPM_B + k_element++) ); // temporary vector result ksrav( SPM_D, SPM_D, scaling_factor ); //scaling for fixed point alignment OM_offset = (Row_size*i) + Zeropad_offset; // set pointer in SPM space kaddv( (SPM_C + OM_offset), (SPM_C + OM_offset), SPM_D ); // update Output Matrix row } } }
  • 10. Information Classification: General BENCHMARK WORKLOADS AND EVALUATION SETUP  2D convolution • 32-bit data elements in fixed-point representation • 3x3 filter size • matrix sizes of 4x4, 8x8, 16x16, and 32x32 elements • additional analysis of larger than 3x3 filter sizes on 32x32 matrices  FFT • 256 complex samples  Matmul • Square matrices of 64x64 elements • Homogeneous workload (3 harts running same program) • Composite workload (3 harts running different programs) 19/04/2021 Titolo Presentazione Pagina 10 ANALYZED PERFORMANCE FIGURES ON FPGA SOFT-CORE IMPLEMENTATION • Average total cycle count per hart • Maximum clock frequency • Absolute execution time • Hardware Resource Utilization • Average energy per algorithmic operation
  • 11. Information Classification: General SUMMARY OF PERFORMANCE RESULTS  3X cycle count speed-up relative a RV32IM IMT core without acceleration (Klessydra T03)  2X cycle count speed-up when compared to the single-threaded, DSP-extended RI5CY core MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL Core Config uration DLP FPGA Element Utilization Max freq MHz Homogeneous Workload Composite Workload FF LUT B- RAM DSP LUT- RAM Conv 4x4 Conv 8x8 Conv 16x16 Conv 32x32 FFT 256 MatMul 64x64 Conv 32x32 FFT MatMul 256 64x64 Klessydra T13 SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771 SIMD 2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705 4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773 8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420 Sym. MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564 Sym. MIMD + SIMD 2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370 4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580 8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031 Het. MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567 Het. MIMD + SIMD 2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201 4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290 8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877 Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779 RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572 ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376
  • 12. Information Classification: General SUMMARY OF PERFORMANCE RESULTS  3X cycle count speed-up relative a RV32IM IMT core without acceleration (Klessydra T03)  2X cycle count speed-up when compared to the single-threaded, DSP-extended RI5CY core MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL Core Config uration DLP FPGA Element Utilization Max freq MHz Homogeneous Workload Composite Workload FF LUT B- RAM DSP LUT- RAM Conv 4x4 Conv 8x8 Conv 16x16 Conv 32x32 FFT 256 MatMul 64x64 Conv 32x32 FFT MatMul 256 64x64 Klessydra T13 SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771 SIMD 2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705 4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773 8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420 Sym. MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564 Sym. MIMD + SIMD 2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370 4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580 8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031 Het. MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567 Het. MIMD + SIMD 2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201 4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290 8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877 Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779 RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572 ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376 • The clock speed exhibited the sharpest drops as the DLP grew larger. • In the symmetric MIMD scheme, the large HW overhead forced FPGA slices on the same critical path to be placed far from each other, thus increasing interconnect delay. • Pipelining the heterogeneous MIMD crossbar to reduce the critical path, introduces additional HW overhead, compromising the area advantage.
  • 13. Information Classification: General MICROARCHITECTURE SYNTHESIS RESULTS AVERAGE CYCLE COUNT PER COMPUTATION KERNEL Core Config uration DLP FPGA Element Utilization Max freq MHz Homogeneous Workload Composite Workload FF LUT B- RAM DSP LUT- RAM Conv 4x4 Conv 8x8 Conv 16x16 Conv 32x32 FFT 256 MatMul 64x64 Conv 32x32 FFT MatMul 256 64x64 Klessydra T13 SISD 1 2488 6982 6 11 264 144.4 1105 3060 9727 34201 33033 728187 66043 80874 476771 SIMD 2 2627 8400 6 15 264 146.0 895 2245 6261 20374 25647 602458 21976 60019 645705 4 3301 11366 6 23 264 137.2 824 1768 4607 13444 22812 543164 16850 29144 431773 8 4800 17331 12 39 264 137.7 824 1613 3692 10069 21555 484436 11324 22482 414420 Sym. MIMD 1 3512 10458 18 19 264 148.2 626 1493 3887 13536 18726 462066 20953 17824 292564 Sym. MIMD + SIMD 2 4712 15943 18 31 264 131.7 629 1190 3123 8681 16827 378748 16144 15839 222370 4 6753 25089 18 55 264 120.0 560 1190 2543 7148 15993 328962 15868 14942 182580 8 10854 43419 36 103 264 105.1 560 1152 2543 6006 15726 316270 15581 14613 168031 Het. MIMD 1 3012 10182 18 11 264 117.2 663 1521 4153 13565 22839 556463 27155 37111 265567 Het. MIMD + SIMD 2 3871 15577 18 15 264 128.9 638 1274 3280 9167 18468 425978 15973 24611 251201 4 5015 23282 18 23 264 122.0 573 1213 2688 7473 16887 360863 16042 19175 181290 8 7325 42944 36 39 264 108.6 573 1079 2580 6285 17604 328178 13921 17298 187877 Klessydra T03 1418 4281 0 7 176 221.1 1819 5737 20714 79230 47256 2679304 138959 46733 2775779 RI5CY 2527 7674 0 6 0 91.4 1377 4247 15088 57020 37344 1360854 81534 37350 1369572 ZeroRiscy 1933 5275 0 1 0 117.2 2510 8111 29583 113793 61158 4006241 197010 61163 4043376 SUMMARY OF PERFORMANCE RESULTS • Small matrix convolutions and FFT on the accelerated core reached up to 2X cycle count reduction over the single-threaded, DSP-extended RI5CY core. • Large matrix convolutions and MatMul obtain advantage from vector-acceleration reaching 9X cycle count reduction relative to RI5CY.
  • 14. Information Classification: General • Assuming maximum clock frequency for each core • Zeroriscy core taken as common reference • In pure SIMD configurations, the speed-up grows linearly with the DLP • Going from a SISD/SIMD to MIMD+SIMD improved the speedup in all cases, despite the frequency drop associated to the MIMD hardware. • The symmetric MIMD+SIMD schemes exhibit up to 17X speed-up over Zeroriscy for Convolution 32x32 and up to 13X speed-up for the composite workload. • Heterogeneous MIMD configurations maintain an almost perfect overlap with the symmetric MIMD. • The non-accelerated Klessydra-T03, exhibits an absolute performance gain over RI5CY and ZeroRiscy Pagina 14 ABSOLUTE EXECUTION TIME SPEED-UP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 SISD, DLP 1 pure SIMD, DLP 2 pure SIMD, DLP 4 pure SIMD, DLP 8 Sym. MIMD, DLP 1 Sym. MIMD+SIMD, DLP 2 Sym. MIMD+SIMD, DLP 4 Sym. MIMD+SIMD, DLP 8 Het. MIMD, DLP 1 Het. MIMD+SIMD, DLP 2 Het. MIMD+SIMD, DLP 4 Het. MIMD+SIMD, DLP 8 Klessydra T03 (no accel.) RI5CY (DSP extension) ZeroRiscy (no accel.) Conv.2D 4x4 Conv.2D 8x8 Conv.2D 16x16 Conv.2D 32x32 FFT 256 MatMul 64x64 Composite
  • 15. Information Classification: General ENERGY EFFICIENCY • The result of this analysis is expressed as energy per algorithmic operation, for the FPGA soft-core implementations, normalized to Zeroriscy, taken as reference. • The most energy efficient designs resulted to be the T13 symmetric MIMD configurations • The heterogenous MIMD approach exhibited an almost complete overlap in energy consumption with the symmetric MIMD • The pure SIMD schemes resulted in a larger energy consumption than other schemes, due to the impossibility of efficiently exploiting TLP. Pagina 15 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 SISD, DLP 1 pure SIMD, DLP 2 pure SIMD, DLP 4 pure SIMD, DLP 8 Sym. MIMD, DLP 1 Sym. MIMD+SIMD, DLP 2 Sym. MIMD+SIMD, DLP 4 Sym. MIMD+SIMD, DLP 8 Het. MIMD, DLP 1 Het. MIMD+SIMD, DLP 2 Het. MIMD+SIMD, DLP 4 Het. MIMD+SIMD, DLP 8 Klessydra T03 (no accel.) RI5CY (DSP extension) ZeroRiscy (no accel.) Conv.2D 4x4 Conv.2D 8x8 Conv.2D 16x16 Conv.2D 32x32 FFT 256 MatMul 64x64 Composite
  • 16. Information Classification: General Pagina 16 LARGER CONVOLUTION FILTERS Core DLP Filter (5x5) Filter (7x7) Filter (9x9) Filter (11x11) Cycle Cnt X1000 T (us) E [uJ] Cycle Cnt X1000 T (us) E [uJ] Cycle Cnt X1000 T (us) E [uJ] Cycle Cnt X1000 T (us) E [uJ] T13 SIMD 2 52.7 362 50.6 101.2 694 97.1 165.8 1136 159.1 246.5 1689 236.6 T13 SIMD 8 24.6 179 34.4 46.1 335 64.5 74.7 543 104.7 110.6 803 154.8 T13 Sym MIMD 2 19.5 148 26.9 35.8 272 49.4 57.4 436 79.2 84.4 641 116.5 T13 Sym MIMD 8 11.8 113 28.9 19.2 183 46.9 29.8 284 72.7 42.9 408 104.7 T13 Het MIMD 2 20.5 159 28.3 37.5 291 51.8 60.2 467 83.1 88.5 687 122.1 T03 (no accel.) - 247 1120 215.5 514.8 2328 447.9 881.2 3985 766.6 1369.1 6191 1191.1 RISCY - 180 1971 252.0 385.3 4218 539.4 662.5 7252 927.5 1000.2 10949 1400.3 ZeroRiscy - 318.9 2721 226.4 674.5 5754 478.9 1129.7 9637 802.1 1697.8 14482 1205.4 • The matrix being convoluted is 32x32 elements • The speed-up and energy efficiency trends continue as the filter dimensions grow, reaching X35 speedup over the Zeroriscy reference
  • 17. Information Classification: General  The MIMD-SIMD vector coprocessor schemes enable tuning the TLP and DLP • >15X absolute time speed-up , -85% energy per operation.  Kernels that are less effectively vectorizable can still take benefit SPMs and TLP, in an IMT core, • 2X-3X speed-up.  Fully symmetric MIMD and heterogeneous MIMD give very similar results, • functional unit contention is less impacting than SPM contention. • coprocessor contention can be effectively mitigated by functional unit heterogeneity  Pure DLP acceleration always give inferior results than a balanced TLP/DLP acceleration. • The IMT microarchitecture benefits from TLP and DLP acceleration in a single core.  In the absence of hardware acceleration, IMT still exhibits a performance advantage over single-thread execution • Simplified hardware structure phylosophy 19/04/2021 Pagina 17 CONCLUSIONS
  • 18. Information Classification: General December 8-10 | Virtual Event Thank you for joining Contribute to the RISC-V conversation on social! #RISCVSUMMIT #KLESSYDRA @mauro_olivieri_ https://github.com/klessydra Mauro.Olivieri@uniroma1.it