Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 44 Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (15)

Les utilisateurs ont également aimé (20)

Publicité

Similaire à Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices (20)

Plus par Qian Lin (13)

Publicité

Plus récents (20)

Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

  1. 1. Distributed Machine Learning and Graph Processing with Sparse Matrices Speaker: LIN Qian http://www.comp.nus.edu.sg/~linqian/
  2. 2. Big Data, Complex Algorithms PageRank (Dominant eigenvector) Recommendations (Matrix factorization) Anomaly detection (Top-K eigenvalues) User Importance (Vertex Centrality) Machine learning + Graph algorithms
  3. 3. Large-Scale Processing Frameworks Data-parallel frameworks – MapReduce/Dryad (2004) – Process each record in parallel – Use case: Computing sufficient statistics, analytics queries Graph-centric frameworks – Pregel/GraphLab (2010) – Process each vertex in parallel – Use case: Graphical models Array-based frameworks – MadLINQ (2012) – Process blocks of array in parallel – Use case: Linear Algebra Operations
  4. 4. PageRank using Matrices Power Method Dominant eigenvector Mp M = web graph matrix p = PageRank vector Simplified algorithm repeat { p = M*p } Linear Algebra Operations on Sparse Matrices p
  5. 5. Statistical software moderately-sized datasets single server, entirely in memory
  6. 6. Work-around for massive dataset Vertical scalability Sampling
  7. 7. MapReduce Limited to aggregation processing
  8. 8. Data analytics Deep vs. Scalable Statistical software (R, MATLAB, SPASS, SAS) MapReduce
  9. 9. Improvement ways 1. Statistical sw. += large-scale data mgnt 2. MapReduce += statistical functionality 3. Combining both existing technologies
  10. 10. Parallel MATLAB, pR
  11. 11. HAMA, SciHadoop
  12. 12. MadLINQ [EuroSys’12] Linear algebra platform on Dryad Not efficient for sparse matrix comp.
  13. 13. Ricardo [SIGMOD’10] But ends up inheriting the inefficiencies of the MapReduce interface R Hadoop aggregation-processing queries aggregated data
  14. 14. Array-based Single-threaded Limited support for scaling
  15. 15. Challenge 1: Sparse Matrices
  16. 16. Challenge 1 – Sparse Matrices 1 10 100 1000 10000 1 11 21 31 41 51 61 71 81 91 Blockdensity(normalized) Block ID LiveJournal Netflix ClueWeb-1B 1000x more data  Computation imbalance
  17. 17. Challenge 2 – Data Sharing Sharing data through pipes/network Time-inefficient (sending copies) Space-inefficient (extra copies) Process copy of data local copy Process data Process copy of data Process copy of data Server 1 network copy network copy Server 2 Sparse matrices  Communication overhead
  18. 18. Extend R – make it scalable, distributed Large-scale machine learning and graph processing on sparse matrices
  19. 19. Presto architecture
  20. 20. Presto architecture WorkerWorker Master R instanceR instance DRAM R instance R instanceR instance DRAM R instance
  21. 21. Distributed array (darray) Partitioned Shared Dynamic
  22. 22. foreach Parallel execution of the loop body f (x ) Barrier Call Update to publish changes
  23. 23. PageRank Using Presto M  darray(dim=c(N,N),blocks=(s,N)) P  darray(dim=c(N,1),blocks=(s,1)) while(..){ foreach(i,1:len, calculate(m=splits(M,i), x=splits(P), p=splits(P,i)) { p  m*x } )} Create Distributed Array M p P1 P2 PN/s
  24. 24. PageRank Using Presto M  darray(dim=c(N,N),blocks=(s,N)) P  darray(dim=c(N,1),blocks=(s,1)) while(..){ foreach(i,1:len, calculate(m=splits(M,i), x=splits(P), p=splits(P,i)) { p  m*x } )} Execute function in a cluster Pass array partitions p P1 P2 PN/s M
  25. 25. Dynamic repartitioning To address load imbalance Correctness
  26. 26. Repartitioning Matrices Profile execution Repartition
  27. 27. Invariants compatibility in array sizes
  28. 28. Maintaining Size Invariants invariant(mat, vec, type=ROW)
  29. 29. Data sharing for multi-core Zero-copy sharing across cores
  30. 30. Data sharing challenges 1. Garbage collection 2. Header conflict R object data part R object header R instance R instance
  31. 31. Overriding R’s allocator Allocate process-local headers Map data in shared memory page Shared R object data part Local R object header page boundary page boundary
  32. 32. Immutable partitions  Safe sharing Only share read-only data
  33. 33. Versioning arrays To ensure correctness when arrays are shared across machines
  34. 34. Fault tolerance Master: primary-backup replication Worker: heartbeat-based failure detection
  35. 35. Presto applications Presto doubles LOC w.r.t. purely programming in R.
  36. 36. Evaluation Faster than Spark and Hadoop using in-memory data
  37. 37. Multi-core support benefits
  38. 38. Data sharing benefits 4.45 2.49 1.63 0.71 0.7 0.72 10 20 40 CORES 4.38 2.21 1.22 1.22 2.12 4.16 10 20 40 CORES Compute TransferNo sharing Sharing
  39. 39. Repartitioning benefits 0 20 40 60 80 100 120 140 160 Workers Transfer Compute 0 20 40 60 80 100 120 140 160 WorkersNo Repartition Repartition
  40. 40. Repartitioning benefits 0 50 100 150 200 250 300 350 400 2000 3000 4000 5000 6000 7000 8000 0 5 10 15 20 Cumulativepartitioningtime(s) Timetoconvergence(s) Number of Repartitions Convergence Time Time spent partitioning
  41. 41. Limitations 1. In-memory computation 2. One writer per partition 3. Array-based programming
  42. 42. • Presto: Large scale array-based framework extends R • Challenges with Sparse matrices • Repartitioning, sharing versioned arrays Conclusion
  43. 43. IMDb Rating: 8.5 Release Date: 27 June 2008 Director: Doug Sweetland Studio: Pixar Runtime: 5 min Brief: A stage magician’s rabbit gets into a magical onstage brawl against his neglectful guardian with two magic hats.

Notes de l'éditeur

  • MapReduce excels in massively parallel processing, scalability, and fault tolerance.In terms of analytics, however, such systems have been limited primarily to aggregation processing, i.e., computation of simple aggregates such as SUM, COUNT, and AVERAGE, after using filtering, joining, and grouping operations to prepare the data for the aggregation step. Although most DMSs provide hooks for user-defined functions and procedures, they do not deliver the rich analytic functionality found in statistical packages.
  • Virtually all prior work attempts to get along with only one type of system, either adding large-scale data management capability to statistical packages or adding statistical functionality to DMSs. This approach leads to solutions that are often cumbersome, unfriendly to analysts, or wasteful in that a great deal of well established technology is needlessly re-invented or re-implemented.
  • Convert matrix operations to MapReduce functions.
  • R sending aggregation-processing queries to Hadoop (written in the high-level Jaql query language), and Hadoop sending aggregated data to R for advanced statistical processing or visualization.
  • R has serious limitations when applied to very large datasets: limited support for distributed processing, no strategy for load balancing, no fault tolerance, and is constrained by a server’s DRAM capacity.
  • Large-scale machine learning and graph processing on sparse matrices
  • Distributed array (darray) provides a shared, in-memory view of multi-dimensional data stored across multiple servers.
  • Repartitioning can be used to subdivide an array into a specified number of parts. Repartitioning is an optional performance optimization which helps when there is load imbalance in the system.
  • Note that for programs with general data structures (e.g., trees) writing invariants is difficult. However, for matrix computation, arrays are the only data structure and the relevant invariant is the compatibility in array sizes.
  • Qian’s comment: same concept as the snapshot isolation.

×