This slide is to introduce BFC which plays a big role in GPU memory management ( allocation/deallocation ). It also mentions stream executor and how it works in output tensor's memory allocation in GPU.
1. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
TensorFlow Study (Part II)
for GPU part
劉得彥
Danny Liu
資訊與通訊研究所 ICL
2. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
GPU Options
2
• We can change the GPU options as follows:
• message GPUOptions
▪ double per_process_gpu_memory_fraction = 1;
▪ string allocator_type = 2;
a. "BFC": A "Best-fit with coalescing" algorithm
▪ int64 deferred_deletion_bytes = 3;
a. Delay deletion of up to this many bytes to reduce the number of interactions with gpu driver code.
▪ bool allow_growth = 4;
▪ string visible_device_list = 5;
a. For instance:
» import os
» os.environ[“CUDA_VISIBLE_DEVICES”] = ‘0, 1’
▪ int32 polling_active_delay_usecs = 6;
a. In the event polling loop sleep this many microseconds between PollEvents calls, when the queue is
not empty.
▪ int32 polling_inactive_delay_msecs = 7;
a. In the event polling loop sleep this many millisconds between PollEvents calls, when the queue is
empty.
▪ bool force_gpu_compatible = 8;
a. Force all tensors to be gpu_compatible. On a GPU-enabled TensorFlow, enabling this option forces all
CPU tensors to be allocated with Cuda pinned memory.
3. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
BFC (best-fit with coalescing)
3
• Chunks point to memory.
▪ Their prev/next pointers form a doubly-linked list of addresses sorted by base
address that must be contiguous.
▪ Chunks contain information about whether they are in use or whether they are
free, and contain a pointer to the bin they are in.
GPU memory
AllocationRegion
size
requested_size
allocation_id
prev
next
bin_num
In_use
size
requested_size
allocation_id
prev
next
bin_num
In_use
size
requested_size
allocation_id
prev
next
bin_num
In_use
chunkhandle
chunkhandle
Order by
size
4. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
BFC (best-fit with coalescing)
4
• The BFC Concept:
▪ Operations for bin: Search, Insert, and Delete
http://blog.csdn.net/qq_33096883/article/details/77479647
Size:
256 * 2^0 =
256Bytes
bin 0
Size:
256 * 2^1
bin 1
Size:
256 * 2^2
bin 2
… Size:
256 * 2^20
= 256MB
bin 20
Chunk 0
Chunk 1
RegionManager ( manage all regions)
chunks_
free_chunks_list_
regions_
GPU memory
AllocationRegion
handle_[]
AllocationRegion
handle[]
4
Create one large chunk for the whole memory space
that will be chunked later
1
BFC tries to allocate
memory and cannot
find chunk in bins.
Do extend()
2
If curr_region_allocation_bytes_< the
allocation size, multiplying by a power of two until
that is sufficient.
3 Use a factor = 0.9 to reduce the allocation if it failed.
X 2
Chunk 2
Free memory
management
Search
Insert
X 0.9
5. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
When output tensor is allocated
5
• A customized operation (Op) wants to
allocate memory for its output tensor
6. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
Tensor GPU mem allocation
6
• When a operation try to create a output tensor during its calculation, a tensor gpu
memory allocation will happen.
• GPUBFCAllocator is an BFC implementation class.
Tensor A
Allocator*
Buffer*
BFCAllocator::AllocateRaw()
retry_helper_. AllocateRaw()
GPUBFCAllocator::
AllocateInternal()
BFCAllocate::Extend()
DeviceMemory*
stream_exec_->
AllocateArray().opaque()
• Allocate()
• DeviceMemory::MakeFromBy
teSize()
Suballocator_->Alloc()
It’s where the memory
allocation happens because
GPUMemAllocator inherits
SubAllocator.
return
GPUBFCAllocator
StreamExecutor
CUDAExecutor
CUDADriver
- DeviceAllocate()
GPU memory
StreamExecutor
- Allocate()
7. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
StreamExecutor Runtime Library
7
• A unified wrapper around the CUDA and OpenCL host-side
programming models (runtimes).
• Support cuBlas and cuDNN
(tensorflow/stream_executor/blas.h and dnn.h)
• It lets host code target either CUDA or OpenCL devices with
identically-functioning data-parallel kernels.
Stream Executor
8. 工業技術研究院機密資料 禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
StreamExecutor Runtime Library
8
• Contrast with OpenMP
▪ OpenMP generates both the kernel code that runs on the
device and the host-side code needed to launch the kernel
▪ StreamExecutor only generates the host-side code.
StreamExecutor
StreamExecutorImpl
StreamExecutorInterface