SlideShare une entreprise Scribd logo
1  sur  50
Télécharger pour lire hors ligne
VC4C: Development of OpenCL
Compiler for VideoCore4
RaspberryPiのGPUを使うOSS OpenCL
コンパイラ開発の現状と課題
2018/11/10 コンパイラ勉強会@fixstars
私は誰
・光のインターネットの闇
 @no_maddo
・ideinのエンジニア
本日のトピック
- VideoCore IV(以下VC4)の紹介
- Architecture
- Memory characteristics
- VC4Cの紹介
- ドイツの方のOSSプロジェクト、私じゃないよ
- Master論文のためのプロジェクトだったらしい
- VC4Cならではの考慮点
- VC4きつい
- OpenCLきつい
Idein’s technology
・Execute mobilenet v2 1.0 224x224: 8.4 FPS ~= 140ms
Why we can archive the performance?
・Performance maximam performance of VideoCore IV
・Hand-assembling parallel GPU code
・Run only on GPU
・No return during execution of the inference
・CPU usage is very low
・Pi Zero ($ 5 computer!!!) archive the performance
For the detail…..
See our president presentation
Why we “try” the OSS compiler?
・We don’t use VC4C in production now
・Tunning assembly is “hard”
・Diesel, TensorComprehension
・In near future, happy to write good performance
mathematical kernels in compiler…...
Architecture Introduction
VC4 overview
VC4 overview
QPU / Quad Processing Unit
QPU / Quad Processing Unit
・general purpose register A/B x32 (=64 registers)
・accumulator register r[0-3] (= 4 registers)
VC4 overview
TMU / Texture and Memory Lookup Unit
TMU / Texture and Memory Lookup Unit
TMU / Texture and Memory Lookup Unit
VC4 overview
Uniform cache
Uniform cache
Uniform cache
VC4 overview
VPM / Vertex Pipe Memory
VPM / Vertex Pipe Memory
Efficient data transfer
Assembly Example: Hello World
mov(r0, uniform) # load from uniform & set it to `r0
setup_vpm_write() # prepare for vpm write
mov(vpm, 1) # write 1 row (16 elements) to vpm
setup_dma_store(nrows=1)# declaration to output 1 row
mov(vpm_st_addr, r0) # start write to the address of `r0
wait_dma_store() # sync dma_store
exit()
See the repository: py-videocore
Ex: A = A * 2 + 1
ldi(ra1, 64)
ldi(rb1, 16)
mov(rb0, uniform)
mov(ra0, uniform)
imul24(r1, element_number,4)
iadd(r1, uniform, r1)
L.loop
iadd(r1, r1, ra1).mov(tmu0_s, r1)
mutex_acquire()
setup_vpm_write(nrows=1)
nop(sig=’load tmu0’)
fmul(r0, r4, 2.0)
fadd(vpm, r0, 1.0)
setup_dma_store(nrows=1)
mov(vpm_st_addr, rb0)
wait_dma_store()
mutex_release()
isub(ra0, ra0, rb1, set_flags=True)
jzc(L.loop)
iadd(rb0, rb0, ra1); nop(); nop()
exit()
Flow of execution
・allocate GPU memory
・build uniforms
・for each thread
・run driver
with Driver () as drv :
n_threads = 12
r = drv.alloc((n_threads, 128),
’float32’)
a = drv.alloc((n_threads, 128),
’float32’)
………
code = drv.program(mul2)
uniforms = drv.alloc((n_threads, 3),
‘uint32’)
uniforms[:, 0] = r.address()[:, 0]
uniforms[:, 1] = 128
uniforms[:, 2] = a.address()[0][0]
drv.execute(n_threads=n_threads,
program=code, uniforms=uniforms)
performance example: qmkl
$ sudo ./qmkl/test/sgemm 224 224 224
GPU: 6.17614e+09 [flop/s]
CPU: 9.78483e+08 [flop/s]
NEON: 1.06783e+09 [flop/s]
https://github.com/idein/qmkl
・mathematical kernels using VC4 cation: no-trans, no-trans
Performance issue
・low memory band-width:
・4.48 GBPS v.s. 98 GBPS in my computer...
・TMU Latecy (cycle):
・TMU cache hit: 9
・L2 cache hit: 12
・Memory: 20 (if v3d_freq=250 [MHz])
・cache incoherency
Cache incoherency 1
②
③
④
⑤
①
Cache incoherency
Cache incoherency
VC4C: OpenCL compiler for VC4
・Parallel programming framework for
heterogene computing (GPU, DSP, FPGA, etc...)
・Support data paralle computing model
Recap: OpenCL
kernel void mul2(global float * a)
{
int id = get_global_id(0);
a[id] = a[id] * 2 + 1;
}
clCreateContext
clCreateProgramWithSource
clCreateBuffer
clEnqueueWriteBuffer
global_item_size = { 4, 8, 12 };
clEnqueueNDRangeKernel
compile at runtime
enqueue kernel
Host program
Recap: OpenCL
VC4C Overview
OpenCL runtime
offline compiler
Asm structure kernel void mul2(global float * a) {
int id = get_global_id(0);
a[id] = a[id] * 2 + 1;
}
・make implicit loop
・OpenCL parameters are
passed via uniform
・Loop exit are passed via
uniform
VC4C demo
Let’s check the output….
Current status: works if registers are enough
・Works fine if register-allocation is successful
・Lack of register-spilling
・Performance issue
・better instruction scheduling
・adjust clang loop-optimizations for VC4
・innermost loop unrolling
・improve DMA transportation
・auto-vectorization
Implementatio
Issue
VC4 specific optimization
・To load 32bit constants, ldi is required
・Dealing with constants are costy
・moveConstantload removes ldi from loops
・But increase register-pressure…
immediate
ldi(r0, 0)
ldi(r2, 10)
L.loop
ldi(r1, 256)
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
ldi(r0, 0)
ldi(r2, 10)
ldi(r1, 256)
L.loop
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
Instead of rb regfile fields, limited imm can be encoded
・-16~15, 1.0, 2.0, 4.0, …
・by combining them, some imm can be ALU instruction
small immediate
ldi(r0, 0)
ldi(r2, 10)
L.loop
ldi(r1, 256)
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
mov(r0, 0)
mov(r2, 10)
imul24(r1, -16, -16)
L.loop
iadd(r0, r0, r1)
isub(r2, r2, 1, set_flag=True)
bgt(L.loop)
nop(); nop(); nop()
Fusion of writing VPM(WIP)
mov(r0, 0)
L.loop
setup_vpm_write(nrows=1)
mutex_acquire()
fadd(r1, uniform, 1.0)
iadd(r0, r0, 1).mov(vpm, r1)
mov(vpm_st_addr, rb0)
wait_dma_store()
mutex_release()
isub(None, r0, 3,
set_flag=True)
bne(L.loop)
nop(); nop(); nop()
setup_vpm_write(nrows=3)
fadd(ra0, uniform, 1.0)
fadd(ra1, uniform, 1.0)
fadd(ra2, uniform, 1.0)
mutex_acquire()
mov(vpm, ra0)
mov(vpm, ra1)
mov(vpm, ra2)
mov(vpm_st_addr, rb0)
wait_dma_store()
mutex_release()
Full unrolling
Hardware limitation
・Cache incoherency is huge problem
・Register-spill
・problematic in other GPU
・Effective TMU load
・If the same region is read/write, it makes wrong
・Use DMA discard parallelism at all
Insufficient use of DMA
kernel void mul2(global float * a)
{
int id = get_global_id(0);
a[id] = a[id] * 2 + 1;
}
・region a is read/write
・a is just read once
・Acutually, Load via TMU is safe
・required complex analysis…???
Complex iteration via OpenCL IDs
・implicit loops (by ids) are hard to convert to natural loops
・global_id + worker_id + local_id ……
・want to remove such parameters by offline-compilation
Fusion of kernels(WIP)
・Fusion of some kernels (GEMM + ReLu + bias, etc…)
・For reducing memory transfer
・Diesel (NVIDIA Compiler project)
reported the impact
from Diesel: DSL for linear algebra and neural net computations on GPUs
Software pipelining(WIP)?
・Probably, it is not effect…
・Due to instruction cache limitation
・We rerolled some kernels…..
Conclusion?
・Introduce VC4
・Dual-issue in-order processor
・You can write its assembly freely
・Introduce VC4C
・heavily under development
・compiler-lovers, here is a unmatured compiler!!!!
Reference
・VideoCore® IV 3D Architecture Reference Guide
・Raspberry PiのGPUで行列乗算(その1)
・Raspberry PiのGPUで行列乗算(その2)
・Hacking the Raspberry Pi's VideoCore IV GPU
・GPU_FFT
・blog@ysugi

Contenu connexe

Tendances

Design and Implementation of GCC Register Allocation
Design and Implementation of GCC Register AllocationDesign and Implementation of GCC Register Allocation
Design and Implementation of GCC Register AllocationKito Cheng
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMLinaro
 
A Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with MultithreadingA Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with MultithreadingMatsuo and Tsumura lab.
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」Shinya Takamaeda-Y
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法MITSUNARI Shigeo
 
TensorFlow Studying Part II for GPU
TensorFlow Studying Part II for GPUTensorFlow Studying Part II for GPU
TensorFlow Studying Part II for GPUTe-Yen Liu
 
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321Teddy Hsiung
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesSubhajit Sahu
 
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etcComparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etcYukio Okuda
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装MITSUNARI Shigeo
 
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Shinya Takamaeda-Y
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...Andrey Karpov
 
Caffe studying 2017
Caffe studying 2017Caffe studying 2017
Caffe studying 2017Te-Yen Liu
 
Let’s talk about microbenchmarking
Let’s talk about microbenchmarkingLet’s talk about microbenchmarking
Let’s talk about microbenchmarkingAndrey Akinshin
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
C++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingC++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingcppfrug
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - CompilationsHSA Foundation
 

Tendances (20)

Design and Implementation of GCC Register Allocation
Design and Implementation of GCC Register AllocationDesign and Implementation of GCC Register Allocation
Design and Implementation of GCC Register Allocation
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
A Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with MultithreadingA Speculative Technique for Auto-Memoization Processor with Multithreading
A Speculative Technique for Auto-Memoization Processor with Multithreading
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
TensorFlow Studying Part II for GPU
TensorFlow Studying Part II for GPUTensorFlow Studying Part II for GPU
TensorFlow Studying Part II for GPU
 
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etcComparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装
 
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...
 
Caffe studying 2017
Caffe studying 2017Caffe studying 2017
Caffe studying 2017
 
Let’s talk about microbenchmarking
Let’s talk about microbenchmarkingLet’s talk about microbenchmarking
Let’s talk about microbenchmarking
 
OpenMP
OpenMPOpenMP
OpenMP
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
C++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingC++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogramming
 
Introduction to Data Oriented Design
Introduction to Data Oriented DesignIntroduction to Data Oriented Design
Introduction to Data Oriented Design
 
ISCA Final Presentaiton - Compilations
ISCA Final Presentaiton -  CompilationsISCA Final Presentaiton -  Compilations
ISCA Final Presentaiton - Compilations
 

Similaire à Vc4c development of opencl compiler for videocore4

Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander NasonovMultiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonoveurobsdcon
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
SFO15-500: VIXL
SFO15-500: VIXLSFO15-500: VIXL
SFO15-500: VIXLLinaro
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudAndrea Righi
 
Share the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardShare the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardJian-Hong Pan
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkmarkdgray
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bitsChiou-Nan Chen
 
Make ARM Shellcode Great Again - HITB2018PEK
Make ARM Shellcode Great Again - HITB2018PEKMake ARM Shellcode Great Again - HITB2018PEK
Make ARM Shellcode Great Again - HITB2018PEKSaumil Shah
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby SystemsEngine Yard
 
Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1Tom Paulus
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Hajime Tazaki
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
Metasepi team meeting #7: Snatch application on tiny OS
Metasepi team meeting #7: Snatch application on tiny OSMetasepi team meeting #7: Snatch application on tiny OS
Metasepi team meeting #7: Snatch application on tiny OSKiwamu Okabe
 
PVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentPVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentOOO "Program Verification Systems"
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班Paul Chao
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Jorisimec.archive
 
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxCA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxtrupeace
 
1032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.21032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.2Stanley Ho
 

Similaire à Vc4c development of opencl compiler for videocore4 (20)

Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander NasonovMultiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
SFO15-500: VIXL
SFO15-500: VIXLSFO15-500: VIXL
SFO15-500: VIXL
 
Rsltollvm
RsltollvmRsltollvm
Rsltollvm
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Share the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development BoardShare the Experience of Using Embedded Development Board
Share the Experience of Using Embedded Development Board
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdk
 
Php engine
Php enginePhp engine
Php engine
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bits
 
Make ARM Shellcode Great Again - HITB2018PEK
Make ARM Shellcode Great Again - HITB2018PEKMake ARM Shellcode Great Again - HITB2018PEK
Make ARM Shellcode Great Again - HITB2018PEK
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby Systems
 
Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1Getting Started with Raspberry Pi - DCC 2013.1
Getting Started with Raspberry Pi - DCC 2013.1
 
Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01Library Operating System for Linux #netdev01
Library Operating System for Linux #netdev01
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Metasepi team meeting #7: Snatch application on tiny OS
Metasepi team meeting #7: Snatch application on tiny OSMetasepi team meeting #7: Snatch application on tiny OS
Metasepi team meeting #7: Snatch application on tiny OS
 
PVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications developmentPVS-Studio, a solution for resource intensive applications development
PVS-Studio, a solution for resource intensive applications development
 
開放運算&GPU技術研究班
開放運算&GPU技術研究班開放運算&GPU技術研究班
開放運算&GPU技術研究班
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptxCA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
CA-Lec4-RISCV-Instructions-1aaaaaaaaaa.pptx
 
1032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.21032 cs208 g operation system ip camera case share.v0.2
1032 cs208 g operation system ip camera case share.v0.2
 

Dernier

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Dernier (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Vc4c development of opencl compiler for videocore4

  • 1. VC4C: Development of OpenCL Compiler for VideoCore4 RaspberryPiのGPUを使うOSS OpenCL コンパイラ開発の現状と課題 2018/11/10 コンパイラ勉強会@fixstars
  • 3. 本日のトピック - VideoCore IV(以下VC4)の紹介 - Architecture - Memory characteristics - VC4Cの紹介 - ドイツの方のOSSプロジェクト、私じゃないよ - Master論文のためのプロジェクトだったらしい - VC4Cならではの考慮点 - VC4きつい - OpenCLきつい
  • 4. Idein’s technology ・Execute mobilenet v2 1.0 224x224: 8.4 FPS ~= 140ms
  • 5. Why we can archive the performance? ・Performance maximam performance of VideoCore IV ・Hand-assembling parallel GPU code ・Run only on GPU ・No return during execution of the inference ・CPU usage is very low ・Pi Zero ($ 5 computer!!!) archive the performance
  • 6. For the detail….. See our president presentation
  • 7. Why we “try” the OSS compiler? ・We don’t use VC4C in production now ・Tunning assembly is “hard” ・Diesel, TensorComprehension ・In near future, happy to write good performance mathematical kernels in compiler…...
  • 11. QPU / Quad Processing Unit
  • 12. QPU / Quad Processing Unit ・general purpose register A/B x32 (=64 registers) ・accumulator register r[0-3] (= 4 registers)
  • 14. TMU / Texture and Memory Lookup Unit
  • 15. TMU / Texture and Memory Lookup Unit
  • 16. TMU / Texture and Memory Lookup Unit
  • 22. VPM / Vertex Pipe Memory
  • 23. VPM / Vertex Pipe Memory
  • 25. Assembly Example: Hello World mov(r0, uniform) # load from uniform & set it to `r0 setup_vpm_write() # prepare for vpm write mov(vpm, 1) # write 1 row (16 elements) to vpm setup_dma_store(nrows=1)# declaration to output 1 row mov(vpm_st_addr, r0) # start write to the address of `r0 wait_dma_store() # sync dma_store exit() See the repository: py-videocore
  • 26. Ex: A = A * 2 + 1 ldi(ra1, 64) ldi(rb1, 16) mov(rb0, uniform) mov(ra0, uniform) imul24(r1, element_number,4) iadd(r1, uniform, r1) L.loop iadd(r1, r1, ra1).mov(tmu0_s, r1) mutex_acquire() setup_vpm_write(nrows=1) nop(sig=’load tmu0’) fmul(r0, r4, 2.0) fadd(vpm, r0, 1.0) setup_dma_store(nrows=1) mov(vpm_st_addr, rb0) wait_dma_store() mutex_release() isub(ra0, ra0, rb1, set_flags=True) jzc(L.loop) iadd(rb0, rb0, ra1); nop(); nop() exit()
  • 27. Flow of execution ・allocate GPU memory ・build uniforms ・for each thread ・run driver with Driver () as drv : n_threads = 12 r = drv.alloc((n_threads, 128), ’float32’) a = drv.alloc((n_threads, 128), ’float32’) ……… code = drv.program(mul2) uniforms = drv.alloc((n_threads, 3), ‘uint32’) uniforms[:, 0] = r.address()[:, 0] uniforms[:, 1] = 128 uniforms[:, 2] = a.address()[0][0] drv.execute(n_threads=n_threads, program=code, uniforms=uniforms)
  • 28. performance example: qmkl $ sudo ./qmkl/test/sgemm 224 224 224 GPU: 6.17614e+09 [flop/s] CPU: 9.78483e+08 [flop/s] NEON: 1.06783e+09 [flop/s] https://github.com/idein/qmkl ・mathematical kernels using VC4 cation: no-trans, no-trans
  • 29. Performance issue ・low memory band-width: ・4.48 GBPS v.s. 98 GBPS in my computer... ・TMU Latecy (cycle): ・TMU cache hit: 9 ・L2 cache hit: 12 ・Memory: 20 (if v3d_freq=250 [MHz]) ・cache incoherency
  • 34. ・Parallel programming framework for heterogene computing (GPU, DSP, FPGA, etc...) ・Support data paralle computing model Recap: OpenCL kernel void mul2(global float * a) { int id = get_global_id(0); a[id] = a[id] * 2 + 1; } clCreateContext clCreateProgramWithSource clCreateBuffer clEnqueueWriteBuffer global_item_size = { 4, 8, 12 }; clEnqueueNDRangeKernel compile at runtime enqueue kernel Host program
  • 37. Asm structure kernel void mul2(global float * a) { int id = get_global_id(0); a[id] = a[id] * 2 + 1; } ・make implicit loop ・OpenCL parameters are passed via uniform ・Loop exit are passed via uniform
  • 38. VC4C demo Let’s check the output….
  • 39. Current status: works if registers are enough ・Works fine if register-allocation is successful ・Lack of register-spilling ・Performance issue ・better instruction scheduling ・adjust clang loop-optimizations for VC4 ・innermost loop unrolling ・improve DMA transportation ・auto-vectorization Implementatio Issue
  • 41. ・To load 32bit constants, ldi is required ・Dealing with constants are costy ・moveConstantload removes ldi from loops ・But increase register-pressure… immediate ldi(r0, 0) ldi(r2, 10) L.loop ldi(r1, 256) iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop() ldi(r0, 0) ldi(r2, 10) ldi(r1, 256) L.loop iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop()
  • 42. Instead of rb regfile fields, limited imm can be encoded ・-16~15, 1.0, 2.0, 4.0, … ・by combining them, some imm can be ALU instruction small immediate ldi(r0, 0) ldi(r2, 10) L.loop ldi(r1, 256) iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop() mov(r0, 0) mov(r2, 10) imul24(r1, -16, -16) L.loop iadd(r0, r0, r1) isub(r2, r2, 1, set_flag=True) bgt(L.loop) nop(); nop(); nop()
  • 43. Fusion of writing VPM(WIP) mov(r0, 0) L.loop setup_vpm_write(nrows=1) mutex_acquire() fadd(r1, uniform, 1.0) iadd(r0, r0, 1).mov(vpm, r1) mov(vpm_st_addr, rb0) wait_dma_store() mutex_release() isub(None, r0, 3, set_flag=True) bne(L.loop) nop(); nop(); nop() setup_vpm_write(nrows=3) fadd(ra0, uniform, 1.0) fadd(ra1, uniform, 1.0) fadd(ra2, uniform, 1.0) mutex_acquire() mov(vpm, ra0) mov(vpm, ra1) mov(vpm, ra2) mov(vpm_st_addr, rb0) wait_dma_store() mutex_release() Full unrolling
  • 44. Hardware limitation ・Cache incoherency is huge problem ・Register-spill ・problematic in other GPU ・Effective TMU load ・If the same region is read/write, it makes wrong ・Use DMA discard parallelism at all
  • 45. Insufficient use of DMA kernel void mul2(global float * a) { int id = get_global_id(0); a[id] = a[id] * 2 + 1; } ・region a is read/write ・a is just read once ・Acutually, Load via TMU is safe ・required complex analysis…???
  • 46. Complex iteration via OpenCL IDs ・implicit loops (by ids) are hard to convert to natural loops ・global_id + worker_id + local_id …… ・want to remove such parameters by offline-compilation
  • 47. Fusion of kernels(WIP) ・Fusion of some kernels (GEMM + ReLu + bias, etc…) ・For reducing memory transfer ・Diesel (NVIDIA Compiler project) reported the impact from Diesel: DSL for linear algebra and neural net computations on GPUs
  • 48. Software pipelining(WIP)? ・Probably, it is not effect… ・Due to instruction cache limitation ・We rerolled some kernels…..
  • 49. Conclusion? ・Introduce VC4 ・Dual-issue in-order processor ・You can write its assembly freely ・Introduce VC4C ・heavily under development ・compiler-lovers, here is a unmatured compiler!!!!
  • 50. Reference ・VideoCore® IV 3D Architecture Reference Guide ・Raspberry PiのGPUで行列乗算(その1) ・Raspberry PiのGPUで行列乗算(その2) ・Hacking the Raspberry Pi's VideoCore IV GPU ・GPU_FFT ・blog@ysugi