Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

C-Cube: Elastic Continuous Clustering in the Cloud

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 11 Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à C-Cube: Elastic Continuous Clustering in the Cloud (20)

Publicité

Plus par Qian Lin (13)

C-Cube: Elastic Continuous Clustering in the Cloud

  1. 1. C-Cube: Elastic Continuous Clustering in the Cloud Speaker: LIN Qian http://www.comp.nus.edu.sg/~linqian
  2. 2. Problem & Objective • Existing solutions for continuous clustering are not elastic – Central server – Distributed setting with a fixed number of dedicated servers. • Objective C-Cube is somewhat tricky on this point. It alternatively maintains a fixed number of VMs. – An elastic algorithm for real-time, continuous clustering analysis 1
  3. 3. Clustering • Divide a set of unlabeled objects into groups that are not pre-defined – objects in the same group  similar – objects in different groups  dissimilar • C-Cube’s elastic solution – Dynamically adjust the amount of computational resources based on the current workload Actually, C-Cube is doing workload-balancing 2
  4. 4. C-Cube • A general and elastic streaming framework to support a variety of clustering algorithms. Provided by Storm Only discuss the distance-based clustering algorithm 3
  5. 5. Elastic Operator Mapper / Spout Reducer / Last Bolt Achieve elasticity by dynamically adjusting Worker nodes / the number of processing units Intermediate Bolts 4
  6. 6. Verification-Reclustering • Scheme – Verify the clustering results computed at a previous timestamp, and – only re-run the clustering algorithm when the verifier module determines that the previous results no longer fit the current data distribution • Verification module – Performed by an elastic operator • Distance-based clustering criteria
  7. 7. Distance-based Clustering • Goal – Partition the objects into clusters to minimize the sum of distances from all objects in a cluster to the cluster center • Distance functions – K-Means and their approximations – K-Median 6
  8. 8. C-Cube Architecture 7
  9. 9. Implementation • 9 PCs – 2 GB memory, 1.8 GHz CPU (2 cores) – Ubuntu 10.0.4 • Storm 0.6.2 – Zookeeper (1 PC) – Nimbus node (1 PC) – Kestrel message queue server (1 PC) – Supervisor nodes (6 PCs)
  10. 10. Scaling Strategy • Start a maximal number of virtual machines at the beginning Still the limitation • Only use a fraction of the virtual machines and keeps other virtual machines in idle • Activate the virtual machines on demand according to the workload 9
  11. 11. System Performance • Number of clusters • Approximation factor • Number of verifiers used in C-Cube • Workload change rate • Number of machines in the cluster 10

×