17. Host
Platform
(Type of game)
Devices
(Players)
Context
(Table)
Programs
(Deck of cards)
Kernel
(Your hand)
Command
Queue
(Your hands,
literally)
Query / Get Platform
OpenCL
SDK
Query / Get Device
Create Context
Create Programs (Files
where your CL code exists)
Programs
Build Program
Create Kernel by specifying function
name. e.g. “SayHello”
Create Command Queue
Enqueue Kernel execution command
Get result Back
18. A Kernel is
● 能在單一 core 上被執行的函式 (Data Parallel 或 Task Parallel)
20. Setup Environment
1. OpenCL ICD
Runtime
2. Device
OpenCL Driver
3. Python venv
4. PyOpenCL
libs
5. Run Samples
Windows
starts from
here.
Mac OS X
starts from
here.
Linux starts
from here.
22. Ubuntu 16.04/Intel OpenCL SDK 2017
0. Install driver
a Scripts to download & patch Linux 4.7 kernel (~10 GB disk space needed)
b Ubuntu 16.04.2 default 4.8 kernel works well fairly w/o certain core
features, i.e. OpenCL 2.x device-side enqueue and shared virtual memory,
VTune GPU support.
1. Install SDK prerequists & SDK (Experimental & Optional)
a Scripts to install SDK prerequisites.
b Install SDK for OpenCL 2017_7.0.0.2511 via (install_GUI.sh)
1. OpenCL ICD
Runtime
2. Device
OpenCL Driver
3. Python venv
4. PyOpenCL libs5. Run Samples
https://software.intel.com/en-us/articles/sdk-for-opencl-gsg
23. Windows 10 / Intel OpenCL
2. Download and install the SDK:
○ https://software.intel.com/en-us/intel-opencl/download
1. OpenCL ICD
Runtime
2. Device
OpenCL Driver
3. Python venv
4. PyOpenCL libs5. Run Samples
24. Windows 10 / Nvidia OpenCL
2. Download and install the Nvidia driver from offcial website:
○ http://www.nvidia.com/Download/index.aspx
○ For some old models, we need to install CUDA toolkits
■ https://developer.nvidia.com/cuda-downloads
1. OpenCL ICD
Runtime
2. Device
OpenCL Driver
3. Python venv
4. PyOpenCL libs5. Run Samples
54. Memory Model
(v1.2)
Host Kernel
Global DA
R / W
NA
R / W
Constant DA
R / W
SA
RO
Local DA
No access
SA
R / W
Private NA
No access
SA
R / W
DA : Dynamic Allocation
NA : No Allocation
SA : Static Allocation
71. Example
● 4-1 Global / Local work items & Work groups.
● 4-1-ext - Global / Local worksize 對效能的影響.
● 4-2 Histogram 加速版
○ Global : 7680 * 4320,
○ Local : 64, 1
● 4-3 k-means 分群
72. Example 4-1 Define Work Groups
● Global : 3, 2
● Local : 2, 1
● Offset : None
重要 : 確認所切割出的work group size 與 每一 dimension 的 work item
size 不會超過裝置限制!!
73. Example 4-1-ext WorkGroup Size v.s.
performance.
● 1st round
○ Global : 7680 * 4320,
○ Local : 1,
● 2nd round
○ Global : 7680, 4320,
○ Local : 128, 32
74. Example 4-2: Histogram 加速版
64 work items per group
64 work items per group
64 work items per group
64 work items per group
64 work items per group
75. Example 4-2: Histogram 加速版
● Total Pixels: 7,680 * 4,320 = 33,177,600 (pixels)
● Pixels per Work Item: 256
● Group Work Items: 33,177,600 / 256 = 129,600
● Work Items per Work Group: 64
● Work Groups: 129,600 / 64 = 2,025
● Pixels per Group: 33,177,600 / 64 = 518,400 (Pixels)
76. ● 流程
○ 利用第一個 Work Item 清除 local memory
○ 利用 Global ID 計算出每個 Work Item 的開始 Pixel 與結束 Pixel
○ 利用第一個 Work Item 將 local memory 複製到 global memory
Example 4-2: Histogram 加速版
unsigned int pixel_start_index = gid * PIXELS_PER_ITEM;
unsigned int pixel_end_index = pixel_start_index + PIXELS_PER_ITEM;