Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
© 2014 MapR Technologies 1© 2014 MapR Technologies
Parallel and Iterative Processing for
Machine Learning Recommendations ...
© 2014 MapR Technologies 2
Agenda
• Collaborative Filtering with Spark
• Model training
• Alternating Least Squares
• The ...
© 2014 MapR Technologies 3
Collaborative Filtering with Spark
• Recommend Items
– (filtering)
• Based on User preferences ...
© 2014 MapR Technologies 4
Train a Model to Make Predictions
New
Data
Model Predictions
Training
Data
ModelAlgorithm
Ted a...
© 2014 MapR Technologies 5
Alternating Least Squares
• approximates sparse user item rating matrix
– as product of two den...
© 2014 MapR Technologies 6
ML Cross Validation Process
Data
Model
Training/
Building
Test Model
Predictions
Test
Set
Train...
© 2014 MapR Technologies 7
Typical MapReduce Workflows
Input to
Job 1
SequenceFile
Last Job
Maps Reduces
SequenceFile
Job ...
© 2014 MapR Technologies 8
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• read only collection of
eleme...
© 2014 MapR Technologies 10
Ratings Data
© 2014 MapR Technologies 11
Parse Input
// parse input UserID::MovieID::Rating
def parseRating(str: String): Rating= {
val...
© 2014 MapR Technologies 12
Build Model
Data
Build
Model
Test
Set
Training
Set
split ratings RDD into training data RDD (8...
© 2014 MapR Technologies 13
Create Model
// Randomly split ratings RDD into training data RDD (80%)
and test data RDD (20%...
© 2014 MapR Technologies 14
Get predictions
// get predicted ratings to compare to test ratings
// call model.predict with...
© 2014 MapR Technologies 15
Compare predictions to Tests
Join predicted ratings to test ratings in order to compare
((user...
© 2014 MapR Technologies 16
Test Model
// prepare predictions for comparison
val predictionsKeyedByUserProductRDD = predic...
© 2014 MapR Technologies 17
Compare predictions to Tests
Find False positives: Where
test rating <= 1 and predicted rating...
© 2014 MapR Technologies 18
Test Model
val falsePositives =(testAndPredictionsJoinedRDD.filter{
case ((user, product), (ra...
© 2014 MapR Technologies 19
Test Model Mean Absolute Error
//Evaluate the model using Mean Absolute Error (MAE) between
te...
© 2014 MapR Technologies 20
Soon to Come
• Spark On Demand Training
– https://www.mapr.com/services/mapr-academy/
• Blogs ...
© 2014 MapR Technologies 21
Machine Learning Blog
• https://www.mapr.com/blog/parallel-and-iterative-processing-
machine-l...
© 2014 MapR Technologies 22
Spark on MapR
• Certified Spark Distribution
• Fully supported and packaged by MapR in partner...
© 2014 MapR Technologies 23
References
• Spark web site: http://spark.apache.org/
• https://databricks.com/
• Spark on Map...
© 2014 MapR Technologies 24
Q&A
@mapr maprtech
Engage with us!
MapR
maprtech
mapr-technologies
Prochain SlideShare
Chargement dans…5
×

Parallel and Iterative Processing for Machine Learning Recommendations with Spark

1 906 vues

Publié le

Recommendation systems help narrow your choices to those that best meet your particular needs. They are among the most popular applications of big data processing. In this Free Code Friday session, you’ll learn how to build a recommendation model from movie ratings using an iterative algorithm and parallel processing with Apache Spark MLlib.

Publié dans : Technologie
  • Identifiez-vous pour voir les commentaires

Parallel and Iterative Processing for Machine Learning Recommendations with Spark

  1. 1. © 2014 MapR Technologies 1© 2014 MapR Technologies Parallel and Iterative Processing for Machine Learning Recommendations with Spark
  2. 2. © 2014 MapR Technologies 2 Agenda • Collaborative Filtering with Spark • Model training • Alternating Least Squares • The code
  3. 3. © 2014 MapR Technologies 3 Collaborative Filtering with Spark • Recommend Items – (filtering) • Based on User preferences data – (collaborative)
  4. 4. © 2014 MapR Technologies 4 Train a Model to Make Predictions New Data Model Predictions Training Data ModelAlgorithm Ted and Carol like Movie B and C Bob likes Movie B, What might he like ? Bob likes Movie B, Predict C
  5. 5. © 2014 MapR Technologies 5 Alternating Least Squares • approximates sparse user item rating matrix – as product of two dense matrices, User and Item factor matrices – tries to learn the hidden features of each user and item – algorithm alternatively fixes one factor matrix and solves for the other
  6. 6. © 2014 MapR Technologies 6 ML Cross Validation Process Data Model Training/ Building Test Model Predictions Test Set Train Test loop Training Set
  7. 7. © 2014 MapR Technologies 7 Typical MapReduce Workflows Input to Job 1 SequenceFile Last Job Maps Reduces SequenceFile Job 1 Maps Reduces SequenceFile Job 2 Maps Reduces Output from Job 1 Output from Job 2 Input to last job Output from last job HDFS Iteration is slow because it writes/reads data to disk
  8. 8. © 2014 MapR Technologies 8 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • read only collection of elements • operated on in parallel • Partitions Cached in memory
  9. 9. © 2014 MapR Technologies 10 Ratings Data
  10. 10. © 2014 MapR Technologies 11 Parse Input // parse input UserID::MovieID::Rating def parseRating(str: String): Rating= { val fields = str.split("::") Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble) } // create an RDD of Ratings objects val ratingsRDD = ratingText.map(parseRating).cache()
  11. 11. © 2014 MapR Technologies 12 Build Model Data Build Model Test Set Training Set split ratings RDD into training data RDD (80%) and test data RDD (20%) build a user product matrix model
  12. 12. © 2014 MapR Technologies 13 Create Model // Randomly split ratings RDD into training data RDD (80%) and test data RDD (20%) val splits = ratingsRDD.randomSplit(Array(0.8, 0.2), 0L) val trainingRatingsRDD = splits(0).cache() val testRatingsRDD = splits(1).cache() // build a ALS user product matrix model with rank=20, iterations=10 val model = (new ALS().setRank(20).setIterations(10).run(trainingRatingsRDD))
  13. 13. © 2014 MapR Technologies 14 Get predictions // get predicted ratings to compare to test ratings // call model.predict with test Userid, MovieId input data val predictionsForTestRDD = model.predict(testUserProductRDD) User, Movie Test Data Model Predicted Ratings
  14. 14. © 2014 MapR Technologies 15 Compare predictions to Tests Join predicted ratings to test ratings in order to compare ((user, product),test rating) ((user, product), predicted rating) ((user, product),(test rating, predicted rating)) Key, Value Key, Value Key, Value
  15. 15. © 2014 MapR Technologies 16 Test Model // prepare predictions for comparison val predictionsKeyedByUserProductRDD = predictionsForTestRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } // prepare test for comparison val testKeyedByUserProductRDD = testRatingsRDD.map{ case Rating(user, product, rating) => ((user, product), rating) } //Join the test with predictions val testAndPredictionsJoinedRDD = testKeyedByUserProductRDD .join(predictionsKeyedByUserProductRDD)
  16. 16. © 2014 MapR Technologies 17 Compare predictions to Tests Find False positives: Where test rating <= 1 and predicted rating >= 4 ((user, product),(test rating, predicted rating)) Key, Value
  17. 17. © 2014 MapR Technologies 18 Test Model val falsePositives =(testAndPredictionsJoinedRDD.filter{ case ((user, product), (ratingT, ratingP)) => (ratingT <= 1 && ratingP >=4) }) falsePositives.take(2) Array[((Int, Int), (Double, Double))] = ((3842,2858),(1.0,4.106488210964762)), ((6031,3194),(1.0,4.790778049100913))
  18. 18. © 2014 MapR Technologies 19 Test Model Mean Absolute Error //Evaluate the model using Mean Absolute Error (MAE) between test and predictions val meanAbsoluteError = testAndPredictionsJoinedRDD.map { case ((user, product), (testRating, predRating)) => val err = (testRating - predRating) Math.abs(err) }.mean() meanAbsoluteError: Double = 0.7244940545944053
  19. 19. © 2014 MapR Technologies 20 Soon to Come • Spark On Demand Training – https://www.mapr.com/services/mapr-academy/ • Blogs and Tutorials: – Movie Recommendations with Collaborative Filtering – Spark Streaming
  20. 20. © 2014 MapR Technologies 21 Machine Learning Blog • https://www.mapr.com/blog/parallel-and-iterative-processing- machine-learning-recommendations-spark
  21. 21. © 2014 MapR Technologies 22 Spark on MapR • Certified Spark Distribution • Fully supported and packaged by MapR in partnership with Databricks – mapr-spark package with Spark, Shark, Spark Streaming today – Spark-python, GraphX and MLLib soon • YARN integration – Spark can then allocate resources from cluster when needed
  22. 22. © 2014 MapR Technologies 23 References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR: – http://www.mapr.com/products/apache-spark • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
  23. 23. © 2014 MapR Technologies 24 Q&A @mapr maprtech Engage with us! MapR maprtech mapr-technologies

×