Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Spark Meetup July 2015

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 21 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Spark Meetup July 2015 (20)

Publicité

Plus récents (20)

Spark Meetup July 2015

  1. 1. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. Spark Meetup Big Data Analytics Verizon Lab, Palo Alto July 28th, 2015
  2. 2. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 2 Similarity Computation
  3. 3. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 3 • Column based flow for tall-skinny matrices (60 M users, 100K items) • Mapper: emit (item-i, item-j), score-ij • Reducer: reduce over (item-i, item-j) to get similarity-ij • Spark 1.2 RowMatrix.columnSimilarities • Row based flow https://issues.apache.org/jira/browse/SPARK-4823 • Column similarity in tall-wide matrices • 60M users,1M-10M items from advertising use-cases • Kernel generation for tall-skinny matrices • 60M users, 50-400 latent factors from advertising use-cases • 10M devices, skinny features from IoT use-cases Similarity Computation Flows
  4. 4. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 4 • Preprocess • Column similarity in tall-wide matrices : Transpose data matrix • Kernel generation for tall-skinny matrices : Input data matrix • Algorithm • Distributed matrix multiply using blocked cartesian pattern • Shuffle space control using topK and similarity threshold • User specified kernel for vector dot product • Supported kernels: Cosine, Euclidean, RBF, ScaledProduct • Code optimization • Norm caching for efficiency (kernel abstraction differ from scikit-learn) • DGEMM for dense vectors : Spark 1.4 recommendForAll • BLAS.dot for sparse vectors : https://github.com/apache/spark/pull/6213 Row Based Flow
  5. 5. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 5 Kernel Examples CosineKernel: item->item similarity case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel { override def compute(vi: Vector, indexi: Long, vj: Vector, indexj: Long): Double = { val similarity = BLAS.dot(vi, vj) / rowNorms(indexi) / rowNorms(indexj) if (similarity <= threshold) return 0.0 similarity } } ScaledProductKernel: memory based recommendation case class ScaledProductKernel(rowNorms: Map[Long, Double]) extends Kernel { override def compute(vi: Vector, indexi: Long, vj: Vector, indexj: Long): Double = { BLAS.dot(vi, vj) / rowNorms(indexi) } }
  6. 6. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 6 Runtime Analysis Dataset Details ML-1M ML-10M ML-20M Netflix ratings 1M 10M 20M 100M users 6040 69878 138493 480189 items 3706 10677 26744 17770 • Production Examples • Data matrix: 60 M x 2.5 M • minSupport: 500 • itemThreshold: 1000 • Runtime: ~ 4 hrs 0 225 450 675 900 ML-1M ML-10M ML-20M Netflix Runtime(s) items col 1e-2 row 1e-2
  7. 7. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 7 Shuffle Write Analysis 0 10000 20000 30000 40000 ML-1M ML-10M ML-20M Netflix ShuffleWrite(MB) movies col 1e-2 row 1e-2
  8. 8. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 8 TopK Shuffle Write Analysis For Row Based Flow 0 1000 2000 3000 4000 5000 50 100 200 400 1000 row 1e-2 ShuffleWrite(MB) topk ML-1M ML-10M ML-20M Netflix
  9. 9. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 9 Recommendation Engine
  10. 10. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. ‹10› • Memory based: kNN based recommendation algorithm using similarity engine • Model based: ALS based implicit feedback formulation • Datasets – MovieLens 1M – Netflix • Mapped ratings to binary features for comparison • Evaluate recommendation performance using – RMSE – Precision @ k Recommendation Algorithms
  11. 11. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. ‹11› kNN Based Formulation val similarItems = SimilarityMatrix.rowSimilarities( itemFeatures, numNeighbors, threshold) val kernel = new ScaledProductKernel(rowNorms) val recommendation = SimilarityMatrix.multiply( similarItems, userFeatures, kernel, k) Predicted rating
  12. 12. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. ‹12› • Implicit feedback datasets: Unobserved items are considered 0 (implicit feedback) • Minimize • Needs Gram matrix aggregation for 0-ratings ALS Formulation val als = new ALSQp() .setRank(params.rank) .setIterations(params. numIterations) .setUserConstraint(Constraints.SMOOTH) .setItemConstraint(Constraints.SMOOTH) .setImplicitPrefs(true) .setLambda(params.lambda) .setAlpha(params.alpha) val mfModel = als.run(training) RankingUtils.recommendItemsForUsers(mfModel, k, skipItems)
  13. 13. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. ‹13› Comparing kNN and ALS on RMSE 0.561 0.62 kNN 30 neighbors ALS RMSE on MovieLens 0.571 0.661 kNN 30 neighbors ALS RMSE on Netflix
  14. 14. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. ‹14› Comparing kNN and ALS on Prec@k (Netflix)
  15. 15. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 15 Segmentation Engine
  16. 16. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 16 • Input data contains location and time information along with other features • Extract time-unit features for each location (zip code) Segmentation Feature Extraction Id Time (Hour) Zip Code websites abc 10 94301 website1 abc 15 94085 website2 def 10 94301 website1 . . . . . . . . . . . . website1 website2 … 94301 # of hours (1-24) # of hours (1-24) … 94085 # of hours (1-24) # of hours (1-24) … . . . . . . . . . . . . Raw data Sparse Website Matrix Column Count Zip codes 31516 Websites 11646 Ratings 45M
  17. 17. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 17 ALS with Positive Constraints ZipCode x Website Sparse Matrix WT H Each row of WT represent ZipCode factors n×m n×k k×m Each column of H represent Website factors What do columns of WT and rows of H represent?minimize
  18. 18. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 18 Segment Analysis I Local websites Global websites
  19. 19. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 19 Segment Analysis II Segments Most factors display geographic affinity.
  20. 20. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 20 Use ALSQp for Nonnegative Matrix Factorization val als = new ALSQp() .setRank(params.rank) .setIterations(params. numIterations) .setUserConstraint(Constraints.POSITIVE) .setItemConstraint(Constraints.POSITIVE) .setImplicitPrefs(true) .setLambda(params.lambda) val mfModel = als.run(training) Other constraints: .setItemConstraint(Constraints.SIMPLEX) // 1Tw = s, w>=0 and s - constant https://github.com/apache/spark/pull/3221
  21. 21. Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 21 Q and A

Notes de l'éditeur

  • Default cover design.

×