SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
K-Means Clustering with BSP
      Thomas Jungblut, Testberichte.de, 2012
    Study assignment 4th semester, HWR Berlin
Content


 What is K-Means Clustering?
 What is BSP?
 K-Means with BSP




                   2/33
What is K-Means Clustering?




             3/33
Was ist K-Means Clustering?
7
What is K-Means Clustering?


   Unsupervised Learning
   Huge number of input vectors
   k initial centers
   Two step iterative algorithm
     Assignment
     Update



                           9/33
How do we parallelize K-Means?




              10/33
What is BSP?


 BSP = Bulk Synchronous Parallel
 Paradigm to design parallel algorithms
 Two basic operations
   Send message
   Barrier synchronization




                              11/33
What is BSP?

             P1   P2           P3

                                    Computation


Superstep         Sync

                                    Communication

                  Sync




                       12/33
What is BSP?


 Computation phase is queuing messages
 Within two barrier synchronizations messages are
  exchanged in bulk
 Messages from previous superstep are available in
  next superstep




                          13
K-Means with BSP

Partition the dataset into equal sized blocks




                                       14/33
K-Means with BSP

Put centers into RAM on each process
      Centers

                       Sum assigned vectors to a new temporary center object




 Iterate sequentially over vectors on disk




                                             15/33
K-Means with BSP


Centers       Centers   Centers




Centers       Centers   Centers
K-Means with BSP

 Sums
Centers     • Center 1
               • Sum=25
               • 5 times summed
            • Center 2
               • Sum=50
               • 10 times summed
            • Center 3
               • Sum=10
               • 5 times summed




                      17/33
K-Means with BSP
 Sum
Centers


            Send the sum      Sum
                             Centers


 Sum
Centers
                    Sum
                   Centers
K-Means with BSP
 Sum
Centers


            Send the sum       Sum
                              Centers


 Sum
Centers
                     Sum
                    Centers
K-Means mit BSP


Centers      Sum
             Sum                                      • The same calculation
                                                        on every process
             Sum                                      • Floating point error
             Sum                                        can be corrected by
                        Divide by total                 synchronizing when
                        increments
             Total                                      it exceeds a given
                                              Means
             Sum                                        threshold


          New Centers



                                          20/33
K-Means with BSP



 Update            Assignment




          Sync
           21/33
K-Means with BSP


 Partition vectors into equal sized blocks
   # Blocks = # Tasks
 Put centers in RAM
 Assignmentphase
   Iterative vectors on disk sequentially
   Sum up temporary centers with assigned vectors
   Message all tasks with sum and how often something was
    summed
 Updatephase
   Calculate the total sum over all received messages and average
   Replace old centers with new centers and calc convergence

                                22/33
Benchmark

 16 Server, 256 Cores, 10G network




                           80 seconds!
                                         Possible
                                         starvation: add
                                         more servers
Benchmark


 Logarithmic scaling
 Much better than linear scaling of MapReduce




                          24
Misc


 Implementation on Github
 https://github.com/thomasjungblut/thomasjungblut-
  common/blob/master/src/de/jungblut/clustering/KMe
  ansBSP.java
 Will be comitted to Hama‘s ML-package soon
 https://issues.apache.org/jira/browse/HAMA-547



                         25

Contenu connexe

Similaire à K-Means with BSP

Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerZahra Sadeghi
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsBita Kazemi
 
Generalized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN TrainingGeneralized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN TrainingDatabricks
 
Mpi collective communication operations
Mpi collective communication operationsMpi collective communication operations
Mpi collective communication operationsShah Zaib
 
006 performance tuningandclusteradmin
006 performance tuningandclusteradmin006 performance tuningandclusteradmin
006 performance tuningandclusteradminScott Miao
 

Similaire à K-Means with BSP (6)

19-7960-07-notes.pptx
19-7960-07-notes.pptx19-7960-07-notes.pptx
19-7960-07-notes.pptx
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasets
 
Generalized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN TrainingGeneralized Pipeline Parallelism for DNN Training
Generalized Pipeline Parallelism for DNN Training
 
Mpi collective communication operations
Mpi collective communication operationsMpi collective communication operations
Mpi collective communication operations
 
006 performance tuningandclusteradmin
006 performance tuningandclusteradmin006 performance tuningandclusteradmin
006 performance tuningandclusteradmin
 

K-Means with BSP

  • 1. K-Means Clustering with BSP Thomas Jungblut, Testberichte.de, 2012 Study assignment 4th semester, HWR Berlin
  • 2. Content  What is K-Means Clustering?  What is BSP?  K-Means with BSP 2/33
  • 3. What is K-Means Clustering? 3/33
  • 4. Was ist K-Means Clustering?
  • 5.
  • 6.
  • 7. 7
  • 8.
  • 9. What is K-Means Clustering?  Unsupervised Learning  Huge number of input vectors  k initial centers  Two step iterative algorithm  Assignment  Update 9/33
  • 10. How do we parallelize K-Means? 10/33
  • 11. What is BSP?  BSP = Bulk Synchronous Parallel  Paradigm to design parallel algorithms  Two basic operations  Send message  Barrier synchronization 11/33
  • 12. What is BSP? P1 P2 P3 Computation Superstep Sync Communication Sync 12/33
  • 13. What is BSP?  Computation phase is queuing messages  Within two barrier synchronizations messages are exchanged in bulk  Messages from previous superstep are available in next superstep 13
  • 14. K-Means with BSP Partition the dataset into equal sized blocks 14/33
  • 15. K-Means with BSP Put centers into RAM on each process Centers Sum assigned vectors to a new temporary center object Iterate sequentially over vectors on disk 15/33
  • 16. K-Means with BSP Centers Centers Centers Centers Centers Centers
  • 17. K-Means with BSP Sums Centers • Center 1 • Sum=25 • 5 times summed • Center 2 • Sum=50 • 10 times summed • Center 3 • Sum=10 • 5 times summed 17/33
  • 18. K-Means with BSP Sum Centers Send the sum Sum Centers Sum Centers Sum Centers
  • 19. K-Means with BSP Sum Centers Send the sum Sum Centers Sum Centers Sum Centers
  • 20. K-Means mit BSP Centers Sum Sum • The same calculation on every process Sum • Floating point error Sum can be corrected by Divide by total synchronizing when increments Total it exceeds a given Means Sum threshold New Centers 20/33
  • 21. K-Means with BSP Update Assignment Sync 21/33
  • 22. K-Means with BSP  Partition vectors into equal sized blocks  # Blocks = # Tasks  Put centers in RAM  Assignmentphase  Iterative vectors on disk sequentially  Sum up temporary centers with assigned vectors  Message all tasks with sum and how often something was summed  Updatephase  Calculate the total sum over all received messages and average  Replace old centers with new centers and calc convergence 22/33
  • 23. Benchmark  16 Server, 256 Cores, 10G network 80 seconds! Possible starvation: add more servers
  • 24. Benchmark  Logarithmic scaling  Much better than linear scaling of MapReduce 24
  • 25. Misc  Implementation on Github  https://github.com/thomasjungblut/thomasjungblut- common/blob/master/src/de/jungblut/clustering/KMe ansBSP.java  Will be comitted to Hama‘s ML-package soon  https://issues.apache.org/jira/browse/HAMA-547 25