Vc4c development of opencl compiler for videocore4

VC4C: Development of OpenCL
Compiler for VideoCore4
RaspberryPiのGPUを使うOSS OpenCL
コンパイラ開発の現状と課題
2018/11/10　コンパイラ勉強会@fixstars

私は誰
・光のインターネットの闇
　@no_maddo
・ideinのエンジニア

本日のトピック
- VideoCore IV（以下VC4)の紹介
- Architecture
- Memory characteristics
- VC4Cの紹介
- ドイツの方のOSSプロジェクト、私じゃないよ
- Master論文のためのプロジェクトだったらしい
- VC4Cならではの考慮点
- VC4きつい
- OpenCLきつい

Idein’s technology
・Execute mobilenet v2 1.0 224x224: 8.4 FPS ~= 140ms

Why we can archive the performance?
・Performance maximam performance of VideoCore IV
・Hand-assembling parallel GPU code
・Run only on GPU
・No return during execution of the inference
・CPU usage is very low
・Pi Zero ($ 5 computer!!!) archive the performance

For the detail…..
See our president presentation

Why we “try” the OSS compiler?
・We don’t use VC4C in production now
・Tunning assembly is “hard”
・Diesel, TensorComprehension
・In near future, happy to write good performance
mathematical kernels in compiler…...

QPU / Quad Processing Unit
・general purpose register A/B x32 (=64 registers)
・accumulator register r[0-3] (= 4 registers)

TMU / Texture and Memory Lookup Unit

Assembly Example: Hello World
mov(r0, uniform) # load from uniform & set it to `r0
setup_vpm_write() # prepare for vpm write
mov(vpm, 1) # write 1 row (16 elements) to vpm
setup_dma_store(nrows=1)# declaration to output 1 row
mov(vpm_st_addr, r0) # start write to the address of `r0
wait_dma_store() # sync dma_store
exit()
See the repository: py-videocore

Ex: A = A * 2 + 1
ldi(ra1, 64)
ldi(rb1, 16)
mov(rb0, uniform)
mov(ra0, uniform)
imul24(r1, element_number,4)
iadd(r1, uniform, r1)
L.loop
iadd(r1, r1, ra1).mov(tmu0_s, r1)
mutex_acquire()
setup_vpm_write(nrows=1)
nop(sig=’load tmu0’)
fmul(r0, r4, 2.0)
fadd(vpm, r0, 1.0)
setup_dma_store(nrows=1)
mov(vpm_st_addr, rb0)
wait_dma_store()
mutex_release()
isub(ra0, ra0, rb1, set_flags=True)
jzc(L.loop)
iadd(rb0, rb0, ra1); nop(); nop()
exit()

Flow of execution
・allocate GPU memory
・build uniforms
・for each thread
・run driver
with Driver () as drv :
n_threads = 12
r = drv.alloc((n_threads, 128),
’float32’)
a = drv.alloc((n_threads, 128),
’float32’)
………
code = drv.program(mul2)
uniforms = drv.alloc((n_threads, 3),
‘uint32’)
uniforms[:, 0] = r.address()[:, 0]
uniforms[:, 1] = 128
uniforms[:, 2] = a.address()[0][0]
drv.execute(n_threads=n_threads,
program=code, uniforms=uniforms)

performance example: qmkl
$ sudo ./qmkl/test/sgemm 224 224 224
GPU: 6.17614e+09 [flop/s]
CPU: 9.78483e+08 [flop/s]
NEON: 1.06783e+09 [flop/s]
https://github.com/idein/qmkl
・mathematical kernels using VC4 cation: no-trans, no-trans

Performance issue
・low memory band-width:
・4.48 GBPS v.s. 98 GBPS in my computer...
・TMU Latecy (cycle):
・TMU cache hit: 9
・L2 cache hit: 12
・Memory: 20 (if v3d_freq=250 [MHz])
・cache incoherency

Cache incoherency 1
②
③
④
⑤
①

・Parallel programming framework for
heterogene computing (GPU, DSP, FPGA, etc...)
・Support data paralle computing model
Recap: OpenCL
kernel void mul2(global float * a)
{
int id = get_global_id(0);
a[id] = a[id] * 2 + 1;
}
clCreateContext
clCreateProgramWithSource
clCreateBuffer
clEnqueueWriteBuffer
global_item_size = { 4, 8, 12 };
clEnqueueNDRangeKernel
compile at runtime
enqueue kernel
Host program

VC4C Overview
OpenCL runtime
offline compiler

Asm structure kernel void mul2(global float * a) {
a[id] = a[id] * 2 + 1;
}
・make implicit loop
・OpenCL parameters are
passed via uniform
・Loop exit are passed via
uniform

VC4C demo
Let’s check the output….

Current status: works if registers are enough
・Works fine if register-allocation is successful
・Lack of register-spilling
・Performance issue
・better instruction scheduling
・adjust clang loop-optimizations for VC4
・innermost loop unrolling
・improve DMA transportation
・auto-vectorization
Implementatio
Issue

・To load 32bit constants, ldi is required
・Dealing with constants are costy
・moveConstantload removes ldi from loops
・But increase register-pressure…
immediate
ldi(r0, 0)
ldi(r2, 10)
L.loop
ldi(r1, 256)
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
ldi(r0, 0)
ldi(r2, 10)
ldi(r1, 256)
L.loop
iadd(r0, r0, r1)
bgt(L.loop)
nop(); nop(); nop()

Instead of rb regfile fields, limited imm can be encoded
・-16~15, 1.0, 2.0, 4.0, …
・by combining them, some imm can be ALU instruction
small immediate
ldi(r0, 0)
ldi(r2, 10)
L.loop
ldi(r1, 256)
iadd(r0, r0, r1)
bgt(L.loop)
nop(); nop(); nop()
mov(r0, 0)
mov(r2, 10)
imul24(r1, -16, -16)
L.loop
iadd(r0, r0, r1)
bgt(L.loop)
nop(); nop(); nop()

Fusion of writing VPM(WIP)
mov(r0, 0)
L.loop
mutex_acquire()
fadd(r1, uniform, 1.0)
iadd(r0, r0, 1).mov(vpm, r1)
wait_dma_store()
mutex_release()
isub(None, r0, 3,
set_flag=True)
bne(L.loop)
nop(); nop(); nop()
fadd(ra0, uniform, 1.0)
mutex_acquire()
mov(vpm, ra0)
mov(vpm, ra1)
mov(vpm, ra2)
wait_dma_store()
mutex_release()
Full unrolling

Hardware limitation
・Cache incoherency is huge problem
・Register-spill
・problematic in other GPU
・Effective TMU load
・If the same region is read/write, it makes wrong
・Use DMA discard parallelism at all

Insufficient use of DMA
kernel void mul2(global float * a)
{
a[id] = a[id] * 2 + 1;
}
・region a is read/write
・a is just read once
・Acutually, Load via TMU is safe
・required complex analysis…???

Complex iteration via OpenCL IDs
・implicit loops (by ids) are hard to convert to natural loops
・global_id + worker_id + local_id ……
・want to remove such parameters by offline-compilation

Fusion of kernels(WIP)
・Fusion of some kernels (GEMM + ReLu + bias, etc…)
・For reducing memory transfer
・Diesel (NVIDIA Compiler project)
reported the impact
from Diesel: DSL for linear algebra and neural net computations on GPUs

Software pipelining(WIP)?
・Probably, it is not effect…
・Due to instruction cache limitation
・We rerolled some kernels…..

Conclusion?
・Introduce VC4
・Dual-issue in-order processor
・You can write its assembly freely
・Introduce VC4C
・heavily under development
・compiler-lovers, here is a unmatured compiler!!!!

Reference
・VideoCore® IV 3D Architecture Reference Guide
・Raspberry PiのGPUで行列乗算(その1)
・Raspberry PiのGPUで行列乗算(その2)
・Hacking the Raspberry Pi's VideoCore IV GPU
・GPU_FFT
・blog@ysugi

Vc4c development of opencl compiler for videocore4

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Vc4c development of opencl compiler for videocore4

Similaire à Vc4c development of opencl compiler for videocore4 (20)

Dernier

Dernier (20)

Vc4c development of opencl compiler for videocore4