32. DNN Processing Units
EFFICIENCYFLEXIBILITY
Soft DPU
(FPGA)
Contro
l Unit
(CU)
Registers
Arithmeti
c Logic
Unit
(ALU)
CPUs GPUs
ASICsHard
DPU
Cerebras
Google TPU
Graphcore
Groq
Intel Nervana
Movidius
Wave Computing
Etc.
BrainWave
Baidu SDA
Deephi Tech
ESE
Teradeep
Etc.
33. スケーラブルなDNN H/Wマイクロサービス
F F F
L0
L1
F F F
L0
Pretrained DNN Model
in CNTK, etc.
Scalable DNN Hardware
Microservice
BrainWave
Soft DPU
Instr Decoder
& Control
Neural FU
33
Network switches
FPGAs
38. A framework-neutral federated compiler and runtime for
compiling pretrained DNN models to soft DPUs
Adaptive ISA for narrow precision DNN inference
Flexible and extensible to support fast-changing AI algorithms
38
39. A framework-neutral federated compiler and runtime for
compiling pretrained DNN models to soft DPUs
Adaptive ISA for narrow precision DNN inference
Flexible and extensible to support fast-changing AI algorithms
BrainWave Soft DPU microarchitecture
Highly optimized for narrow precision and low batch
39
40. A framework-neutral federated compiler and runtime for
compiling pretrained DNN models to soft DPUs
Adaptive ISA for narrow precision DNN inference
Flexible and extensible to support fast-changing AI algorithms
BrainWave Soft DPU microarchitecture
Highly optimized for narrow precision and low batch
Persist model parameters entirely in FPGA on-chip memories
Support large models by scaling across many FPGAs
40
41. A framework-neutral federated compiler and runtime for
compiling pretrained DNN models to soft DPUs
Adaptive ISA for narrow precision DNN inference
Flexible and extensible to support fast-changing AI algorithms
BrainWave Soft DPU microarchitecture
Highly optimized for narrow precision and low batch
Persist model parameters entirely in FPGA on-chip memories
Support large models by scaling across many FPGAs
Intel FPGAs deployed at scale with HW microservices
[MICRO’16]
41
47. Web search
ranking
Traditional software (CPU) server plane
QPICPU
QSFP
40Gb/s ToR
FPGA
CPU
40Gb/s
QSFP QSFP
Hardware acceleration plane
相互接続されたFPGAが従来のソ
フトウェアレイヤーとは分離さ
れて動作
CPUから独立して管理・使用が
可能
Web search
ranking
Deep neural
networks
SDN offload
SQL
47
CPUs
FPGAs
Routers