TensorFlow Studying Part II for GPU

工業技術研究院機密資料禁止複製、轉載、外流 ITRI CONFIDENTIAL DOCUMENT DO NOT COPY OR DISTRIBUTE
TensorFlow Study (Part II)
for GPU part
劉得彥
Danny Liu
資訊與通訊研究所 ICL

GPU Options
2
• We can change the GPU options as follows:
• message GPUOptions
▪ double per_process_gpu_memory_fraction = 1;
▪ string allocator_type = 2;
a. "BFC": A "Best-fit with coalescing" algorithm
▪ int64 deferred_deletion_bytes = 3;
a. Delay deletion of up to this many bytes to reduce the number of interactions with gpu driver code.
▪ bool allow_growth = 4;
▪ string visible_device_list = 5;
a. For instance:
» import os
» os.environ[“CUDA_VISIBLE_DEVICES”] = ‘0, 1’
▪ int32 polling_active_delay_usecs = 6;
a. In the event polling loop sleep this many microseconds between PollEvents calls, when the queue is
not empty.
▪ int32 polling_inactive_delay_msecs = 7;
a. In the event polling loop sleep this many millisconds between PollEvents calls, when the queue is
empty.
▪ bool force_gpu_compatible = 8;
a. Force all tensors to be gpu_compatible. On a GPU-enabled TensorFlow, enabling this option forces all
CPU tensors to be allocated with Cuda pinned memory.

BFC (best-fit with coalescing)
3
• Chunks point to memory.
▪ Their prev/next pointers form a doubly-linked list of addresses sorted by base
address that must be contiguous.
▪ Chunks contain information about whether they are in use or whether they are
free, and contain a pointer to the bin they are in.
GPU memory
AllocationRegion
size
requested_size
allocation_id
prev
next
bin_num
In_use
size
requested_size
allocation_id
prev
next
bin_num
In_use
size
requested_size
allocation_id
prev
next
bin_num
In_use
chunkhandle
chunkhandle
Order by
size

BFC (best-fit with coalescing)
4
• The BFC Concept:
▪ Operations for bin: Search, Insert, and Delete
http://blog.csdn.net/qq_33096883/article/details/77479647
Size:
256 * 2^0 =
256Bytes
bin 0
Size:
256 * 2^1
bin 1
Size:
256 * 2^2
bin 2
… Size:
256 * 2^20
= 256MB
bin 20
Chunk 0
Chunk 1
RegionManager ( manage all regions)
chunks_
free_chunks_list_
regions_
GPU memory
AllocationRegion
handle_[]
AllocationRegion
handle[]
4
Create one large chunk for the whole memory space
that will be chunked later
1
BFC tries to allocate
memory and cannot
find chunk in bins.
Do extend()
2
If curr_region_allocation_bytes_< the
allocation size, multiplying by a power of two until
that is sufficient.
3 Use a factor = 0.9 to reduce the allocation if it failed.
X 2
Chunk 2
Free memory
management
Search
Insert
X 0.9

When output tensor is allocated
5
• A customized operation (Op) wants to
allocate memory for its output tensor

Tensor GPU mem allocation
6
• When a operation try to create a output tensor during its calculation, a tensor gpu
memory allocation will happen.
• GPUBFCAllocator is an BFC implementation class.
Tensor A
Allocator*
Buffer*
BFCAllocator::AllocateRaw()
retry_helper_. AllocateRaw()
GPUBFCAllocator::
AllocateInternal()
BFCAllocate::Extend()
DeviceMemory*
stream_exec_->
AllocateArray().opaque()
• Allocate()
• DeviceMemory::MakeFromBy
teSize()
Suballocator_->Alloc()
It’s where the memory
allocation happens because
GPUMemAllocator inherits
SubAllocator.
return
GPUBFCAllocator
StreamExecutor
CUDAExecutor
CUDADriver
- DeviceAllocate()
GPU memory
StreamExecutor
- Allocate()

StreamExecutor Runtime Library
7
• A unified wrapper around the CUDA and OpenCL host-side
programming models (runtimes).
• Support cuBlas and cuDNN
(tensorflow/stream_executor/blas.h and dnn.h)
• It lets host code target either CUDA or OpenCL devices with
identically-functioning data-parallel kernels.
Stream Executor

StreamExecutor Runtime Library
8
• Contrast with OpenMP
▪ OpenMP generates both the kernel code that runs on the
device and the host-side code needed to launch the kernel
▪ StreamExecutor only generates the host-side code.
StreamExecutor
StreamExecutorImpl
StreamExecutorInterface

TensorFlow Studying Part II for GPU

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à TensorFlow Studying Part II for GPU

Similaire à TensorFlow Studying Part II for GPU (20)

Dernier

Dernier (20)

TensorFlow Studying Part II for GPU