A Lock-Free Algorithm of Tree-Based Reduction for Large Scale Clustering on GPGPU
1. A Lock-Free Algorithm of Tree-Based
Reduction for Large Scale Clustering on
GPGPU
National Institute of Informatics, Japan
Ruo Ando
2019 2nd International Conference on Artificial
Intelligence and Pattern Recognition
2019 年第二届人工智能和模式识别国际会议
North China University of Technology (NCUT) / 北方工业大学
August 17th, 2019 11.35-12.15
Slideshare version rev.2019.08.19
2. Abstract
• Recently, the art of concurrency and parallelism
has been advanced rapidly. However, conventional
techniques still suffer of the drawback of lock
contention.
• This talk reports the current situation of massively
parallel computing.
• Based on this situation, a Lock-free technique of
tree-based reduction for large scale clustering on
GPGPU is illustrated.
• In experiment, the performance of native GPU
kernel with atomic instruction, CUDA Thrust
template libraries and proposal method is
compared and evaluated.
3. Bottlenecks for massive parallelism
• Lock contention: Threads should spend as little
time inside a critical section as possible to reduce
the amount of time other threads sit idle waiting to
acquire the lock, a state known as “lock
contention”.
• Using a multitude of small separate critical
sections introduces system overheads associated
with acquiring and releasing each separate lock.
In many cases, contention for
locks reduces parallel efficiency
and hurts scalability.
4. ❑道生一、一生二、二生三、三生萬物 - 老子
❑ Unreasonable Effectiveness of Data
If a machine learning program cannot work with a training of
a million examples, then the intuitive conclusion follows
that it cannot work at all.
However, it has become clear that machine learning using a
huge dataset with a trillion items can be highly effective in
tasks for which machine learning using a sanitized (clean)
dataset with a only million items is NOT useful.
Chen Sun, Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta, “Revisiting
Unreasonable Effectiveness of Data in Deep Learning Era”, ICCV 2017
https://arxiv.org/abs/1707.02968
Scalability
5. Reduction pattern
A reduction combines every element
in a collection into a single element
using an associative combiner
function.
Given the associativity of the
combiner function, many Different
ordering are possible, but with
different spans.
If the combiner function is also
commutative, additional Orderings
are possible.
Tree structure depends on a
reordering of the combiner
Operations by associativity.
"2019/07/02 00:00:00.867","841","25846”
"2019/07/02 03:03:00.511","784","52326”
"2019/07/02 00:00:00.867",“700",“40000”
"2019/07/02 11:11:37.872","336","50346”
"2019/07/02 00:00:00.867",“1541",“65846”
6. Proposal method(2) - large scale clustering
• Fine reduction
- New cluster assignment
- Calculating sums of each cluster
const int fine_shared_memory = 3 * threads * sizeof(float);
fine_reduce<<<blocks, threads, fine_shared_memory>>>;
• Coarse reduction
- Calculating centroids (new means)
const int coarse_shared_memory = 2 * k * blocks * sizeof(float);
coarse_reduce<<<1, k * blocks, coarse_shared_memory>>>
7. Overview and Grid layout
threads threadsthreads threads
Fine
Coarse
fine_reduce<<<blocks, threads, fine_shared_memory>>>
coarse_reduce<<<1, k * blocks, coarse_shared_memory>>>
blocks
X
Y
8. Input and output: fine reduction
fine_reduce<<<blocks, threads, fine_shared_memory>>>
9. Shared memory layout - Fine reduction
threads threadsthreads threads
Fine
fine_reduce<<<blocks, threads, fine_shared_memory>>>
blocks
X
Y
const int fine_shared_memory = 3 * threads * sizeof(float);
11. Shared memory layout - coarse reduction
blocks
coarse_reduce<<<1, k * blocks, coarse_shared_memory>>>
k
X
Y
const int coarse_shared_memory = 2 * k * blocks * sizeof(float);
12. Experimental results
① By using the atomicAdd function, programmer can rewrite the incr kenrel.
This instruction atomically add a value V[i] to the value stored at memory
location M.
__global__ void incr(__global__ int *ptr) { int temp = atomicAdd( ptr, 1); }
② Thrust provides two vector containers, host_vector and device_vector. The
host_vector is stored in host memory while device_vector lives in GPU device
memory
③ Reductions in serial execution like the averaging performed during the update step
scale linearly. However, parallel reductions can be implemented efficiently by using
two-stage tree-reduction: fine and coarse reduction. Key point in fine-coarse
reduction is averaging is not performed over all our data. Instead, for each cluster,
the points assigned to each cluster should be averaged.
thrust::device_vector<float> d_mean_x(h_x.begin(), h_x.begin() + k);
thrust::device_vector<float> d_mean_y(h_y.begin(), h_y.begin() + k);
14. Conclusion
• Recently, the art of concurrency and parallelism
has been advanced rapidly. However, conventional
techniques still suffer of the drawback of lock
contention.
• This talk reports the current situation of massively
parallel computing.
• Based on this situation, a Lock-free technique of
tree-based reduction for large scale clustering on
GPGPU is illustrated.
• In experiment, the performance of native GPU
kernel with atomic instruction, CUDA Thrust
template libraries and proposal method is
compared and evaluated.