K-Means with BSP

K-Means Clustering with BSP
Thomas Jungblut, Testberichte.de, 2012
Study assignment 4th semester, HWR Berlin

Content

 What is K-Means Clustering?
 What is BSP?
 K-Means with BSP

2/33

What is K-Means Clustering?

3/33

What is K-Means Clustering?

 Unsupervised Learning
 Huge number of input vectors
 k initial centers
 Two step iterative algorithm
 Assignment
 Update

9/33

How do we parallelize K-Means?

10/33

What is BSP?

 BSP = Bulk Synchronous Parallel
 Paradigm to design parallel algorithms
 Two basic operations
 Send message
 Barrier synchronization

11/33

What is BSP?

P1 P2 P3

Computation

Superstep Sync

Communication

Sync

12/33

What is BSP?

 Computation phase is queuing messages
 Within two barrier synchronizations messages are
exchanged in bulk
 Messages from previous superstep are available in
next superstep

13

K-Means with BSP

Partition the dataset into equal sized blocks

14/33

K-Means with BSP

Put centers into RAM on each process
Centers

Sum assigned vectors to a new temporary center object

Iterate sequentially over vectors on disk

15/33

K-Means with BSP

Centers Centers Centers

Centers Centers Centers

K-Means with BSP

Sums
Centers • Center 1
• Sum=25
• 5 times summed
• Center 2
• Sum=50
• 10 times summed
• Center 3
• Sum=10
• 5 times summed

17/33

K-Means with BSP
Sum
Centers

Send the sum Sum
Centers

Sum
Centers
Sum
Centers

K-Means mit BSP

Centers Sum
Sum • The same calculation
on every process
Sum • Floating point error
Sum can be corrected by
Divide by total synchronizing when
increments
Total it exceeds a given
Means
Sum threshold

New Centers

20/33

K-Means with BSP

Update Assignment

Sync
21/33

K-Means with BSP

 Partition vectors into equal sized blocks
 # Blocks = # Tasks
 Put centers in RAM
 Assignmentphase
 Iterative vectors on disk sequentially
 Sum up temporary centers with assigned vectors
 Message all tasks with sum and how often something was
summed
 Updatephase
 Calculate the total sum over all received messages and average
 Replace old centers with new centers and calc convergence

22/33

Benchmark

 16 Server, 256 Cores, 10G network

80 seconds!
Possible
starvation: add
more servers

Benchmark

 Logarithmic scaling
 Much better than linear scaling of MapReduce

24

Misc

 Implementation on Github
 https://github.com/thomasjungblut/thomasjungblut-
common/blob/master/src/de/jungblut/clustering/KMe
ansBSP.java
 Will be comitted to Hama‘s ML-package soon
 https://issues.apache.org/jira/browse/HAMA-547

25

K-Means with BSP

Recommandé

Recommandé

Contenu connexe

Similaire à K-Means with BSP

Similaire à K-Means with BSP (6)

K-Means with BSP