SlideShare une entreprise Scribd logo
1  sur  37
Ahlia University
College of Information Technology
By:
Khadija Atiya Al-Mohsen 201110493
Supervised by:
Dr. Huda AlJobori
• Overview of Big Data
– 6 Vs of Big Data
– Big Data Technologies
• Recommender system
– Definition
– Problem Formulation
– Approaches; CF in particular
– Challenges of CF
– Singular Value Decomposition (SVD) approach
• My Experiments
• Conclusion
• Future work
Stage 1
Stage 2
Stage 3
olume•
olume
elocity
•
•
ariety•
olume
elocity
•
•
eracity•
ariety•
olume
elocity
•
•
ariability•
eracity•
ariety•
olume
elocity
•
•
• alue
ariability•
eracity•
ariety•
olume
elocity
•
•
• Recommender System (RS)
– For every users, RS Suggests new items that might
be of interest to the user
– It analyzes the users purchase history , their
profiles and their activities on the website
??
• Collaborative Filtering (CF) RS
– Assumption:
• Users may have similar opinions
• User opinion is stable
– 2 Variation:
• User-based RS
• Item-based RS
• Collaborative Filtering (CF) RS
• User-based RS
• Item-based RS
1 2 3 4 5
1    
2 
3  
users movies
• Scalability
• Data Sparsity
• Synonymy
• Cold Start
• Grey sheep
R = U S VT
mxn mxm mxn nxn
. .
U: holds left singular vectors of 𝑅 in its columns
S: holds the singular values of 𝑅 in its diagonal entries in decreasing order
V: holds the right singular vectors of 𝑅 in its columns
R = S
mxn mxm mxn nxn
. .
U: holds users features
S: holds the features
V: holds movie features
User-item matrix User feature item featurefeatures
users
movies
…………..
……comedy
horror
romantic
comedy
horror
romantic
R
≈
U S VT
mxn mxk kxk kxn
. .
=
R ≈ U S
VT
mxn mxk
kxk kxn
. .
User-item matrix User feature
features item feature
?
• Preprocessing:
– Imputation:
• By zero (ByZero)
• By item average rating (ItemAvgRating)
• By user average rating (UserAvgRating)
• By mean of ItemAvgRating and UserAvgRating
(Mean_ItemAvgRating&UserAvgRating)
– Normalization
• Offsetting the difference in rating scale between the users
• Post-processing:
-De-normalization
• SVD-based RS:
– Researches started in 2000.
– Data size started to increase.
– Good results: accuracy, speed and scalability
– Incremental SVD, combining it with neural network,
with traditional CF approaches
– Tested on Big Data : relatively small data set
• Questions:
– No longer valid for TODAY BIG DATA?
• Appropriate value of k
• Appropriate imputation technique
– New Big Data tools and frameworks (Hadoop & Spark)?
• Experimental Environment:
– MacBook Pro with X 10.9.3 OS
– Apache Hadoop 2.4.0
– Apache Spark 1.0.2.
• Data Set:
– 1M MovieLense set
– Divided into training set and test set
• Training ratio x
• Evaluation Metric
– Mean Absolute error (MAE)
• Experimental steps:
– Data set  Hadoop
– Construct user-item matrix
– Several Experiments: Imputation by
Mean_ItemAvgRating&UserAvgRating
– Normalization
– Compute SVD using Apache Spark
• Several Experiments : K = 20
• Result: Uk, Sk and Vt
k
– Evaluation:
• Multiply Uk, Sk and Vt
k == Predictions
• Compute MAE between test set and the predictions
• Determining the appropriate number of
dimensions
Training Ratio
k x= 40% x= 60% x= 80%
10 0.752063387 0.740626631 0.729999025
20 0.751235952 0.737954414 0.724461829
30 0.751605786 0.738107001 0.723582636
40 0.752364535 0.738714968 0.723803209
50 0.753327523 0.739710017 0.725192112
60 0.754025958 0.740751258 0.726154802
70 0.754724263 0.741471676 0.727282644
80 0.755378978 0.742729653 0.728420107
90 0.755930363 0.743485881 0.729623117
100 0.756510921 0.744293948 0.730338578
150 0.759248743 0.748479721 0.735940524
200 0.761302063 0.752022533 0.740869085
250 0.762834111 0.754352648 0.744583552
300 0.764142161 0.756565002 0.747871699
0.72
0.725
0.73
0.735
0.74
0.745
0.75
0.755
0.76
0.765
0.77
0 50 100 150 200 250 300 350
MAE
number of dimension k
SVD prediction quality variation with number of dimensions k
x= 40%
x= 80%
x= 60%
• Determining the best imputation technique
Imputation Technique
k ByZero UserAvgRating ItemAvgRating Mean_ UserAvgRating&ItemAvgRating
10 1.287082589 0.763900832 0.738045498 0.729999025
20 1.340593039 0.758800163 0.732242767 0.724461829
30 1.351316283 0.758681267 0.731619885 0.723582636
40 1.342134276 0.760268678 0.731981172 0.723803209
50 1.331709429 0.762213489 0.733055102 0.725192112
60 1.317791963 0.764543289 0.733818382 0.726154802
70 1.303016117 0.766207466 0.734838138 0.727282644
80 1.288611141 0.768863197 0.736044939 0.728420107
90 1.274236198 0.770501176 0.737138793 0.735940524
100 1.261225796 0.772165966 0.738514469 0.730338578
150 1.198043117 0.780474068 0.744252067 0.735940524
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
0 20 40 60 80 100 120 140 160
MAE
number of dimension k
Comparing the MAE of different imputation technique
ByZero
UserAvgRating
ItemAvgRating
Mean_ UserAvgRating&ItemAvgRating
• Evaluating the accuracy of SVD on different
training ratio
k
Training
Ratio x
k=20 k=100
0.35 0.7549804 0.759679233
0.4 0.751235952 0.756510921
0.45 0.747814989 0.753365509
0.5 0.744018997 0.749880733
0.55 0.741011898 0.747432557
0.6 0.737954414 0.744293948
0.65 0.734098595 0.740644033
0.7 0.731331182 0.737610654
0.75 0.727741033 0.734113758
0.8 0.724461829 0.730338578
0.85 0.721460479 0.727461993
0.9 0.72096045 0.726117655
0.95 0.733633592 0.740065812
0.715
0.72
0.725
0.73
0.735
0.74
0.745
0.75
0.755
0.76
0.765
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
MAE
training ratio x
SVD prediction quality variation with different training ratio x
k=20
k=100
• Results and Discussion:
 Easy parallelization
Powerfulness of Hadoop and spark
Comparable results with previous work
K = 20 (K = 14, k =15)
MAE between 0.724 and 0.730 (0.732 and 0.748)
Importance of preprocessing steps
Sensitivity of prediction quality to data sparsity
Applicability of SVD for Today Big Data
• Big Data  challenges on RS
– Sparsness of the data
– Scalability
– Prediction quality
• Goals of RS in Big Data Era
– Process million of records + good predictions
• SVD is a promising CF approach
• New tools and frameworks are useful in building
large scale system
• Carried on experiments : SVD + Apache Hadoop
and spark
.
.
.
• A careful implementation of a scalable SVD-
based collaborative filtering recommender
system is effective when choosing the right
parameters and the appropriate frameworks
and APIs.
• Deploying it on multi-node cluster
• Incremental SVD/ SSVD
• Expectation- Maximization approach
• A hybrid of these approaches
• Compare Spark implementation with other APIs
and tools like Apache Mahout.
RS in the context of Big Data-v4

Contenu connexe

Similaire à RS in the context of Big Data-v4

Use Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data ClustersUse Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data ClustersDatabricks
 
Optimizing Observability Spend: Metrics
Optimizing Observability Spend: MetricsOptimizing Observability Spend: Metrics
Optimizing Observability Spend: MetricsEric D. Schabell
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for BeginnersSanghamitra Deb
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptopRising Media, Inc.
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR modelsNisha Arankandath
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
Web Performance BootCamp 2013
Web Performance BootCamp 2013Web Performance BootCamp 2013
Web Performance BootCamp 2013Daniel Austin
 
Utilizing Marginal Net Utility for Recommendation in E-commerce
Utilizing Marginal Net Utility for Recommendation in E-commerceUtilizing Marginal Net Utility for Recommendation in E-commerce
Utilizing Marginal Net Utility for Recommendation in E-commerceLiangjie Hong
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...Lucas Jellema
 
IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesNish Parikh
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningitstuff
 
SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014 SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014 Sujit Ghosh
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
 
Rokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptxRokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptxJadna Almeida
 
Rokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptxRokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptxJadna Almeida
 

Similaire à RS in the context of Big Data-v4 (20)

Use Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data ClustersUse Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data Clusters
 
Optimizing Observability Spend: Metrics
Optimizing Observability Spend: MetricsOptimizing Observability Spend: Metrics
Optimizing Observability Spend: Metrics
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
Topic2- Information Systems.pptx
Topic2- Information Systems.pptxTopic2- Information Systems.pptx
Topic2- Information Systems.pptx
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR models
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Web Performance BootCamp 2013
Web Performance BootCamp 2013Web Performance BootCamp 2013
Web Performance BootCamp 2013
 
Utilizing Marginal Net Utility for Recommendation in E-commerce
Utilizing Marginal Net Utility for Recommendation in E-commerceUtilizing Marginal Net Utility for Recommendation in E-commerce
Utilizing Marginal Net Utility for Recommendation in E-commerce
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
 
Internship Presentation.pdf
Internship Presentation.pdfInternship Presentation.pdf
Internship Presentation.pdf
 
IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slides
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log mining
 
SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014 SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014
 
E discovery Process Improvement
E discovery Process ImprovementE discovery Process Improvement
E discovery Process Improvement
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Rokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptxRokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptx
 
Rokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptxRokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptx
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 

RS in the context of Big Data-v4

  • 1. Ahlia University College of Information Technology By: Khadija Atiya Al-Mohsen 201110493 Supervised by: Dr. Huda AlJobori
  • 2. • Overview of Big Data – 6 Vs of Big Data – Big Data Technologies • Recommender system – Definition – Problem Formulation – Approaches; CF in particular – Challenges of CF – Singular Value Decomposition (SVD) approach • My Experiments • Conclusion • Future work
  • 12.
  • 13. • Recommender System (RS) – For every users, RS Suggests new items that might be of interest to the user – It analyzes the users purchase history , their profiles and their activities on the website
  • 14. ??
  • 15. • Collaborative Filtering (CF) RS – Assumption: • Users may have similar opinions • User opinion is stable – 2 Variation: • User-based RS • Item-based RS
  • 16. • Collaborative Filtering (CF) RS • User-based RS • Item-based RS 1 2 3 4 5 1     2  3   users movies
  • 17. • Scalability • Data Sparsity • Synonymy • Cold Start • Grey sheep
  • 18. R = U S VT mxn mxm mxn nxn . . U: holds left singular vectors of 𝑅 in its columns S: holds the singular values of 𝑅 in its diagonal entries in decreasing order V: holds the right singular vectors of 𝑅 in its columns
  • 19. R = S mxn mxm mxn nxn . . U: holds users features S: holds the features V: holds movie features User-item matrix User feature item featurefeatures users movies ………….. ……comedy horror romantic comedy horror romantic
  • 20. R ≈ U S VT mxn mxk kxk kxn . . =
  • 21. R ≈ U S VT mxn mxk kxk kxn . . User-item matrix User feature features item feature ?
  • 22. • Preprocessing: – Imputation: • By zero (ByZero) • By item average rating (ItemAvgRating) • By user average rating (UserAvgRating) • By mean of ItemAvgRating and UserAvgRating (Mean_ItemAvgRating&UserAvgRating) – Normalization • Offsetting the difference in rating scale between the users • Post-processing: -De-normalization
  • 23. • SVD-based RS: – Researches started in 2000. – Data size started to increase. – Good results: accuracy, speed and scalability – Incremental SVD, combining it with neural network, with traditional CF approaches – Tested on Big Data : relatively small data set • Questions: – No longer valid for TODAY BIG DATA? • Appropriate value of k • Appropriate imputation technique – New Big Data tools and frameworks (Hadoop & Spark)?
  • 24. • Experimental Environment: – MacBook Pro with X 10.9.3 OS – Apache Hadoop 2.4.0 – Apache Spark 1.0.2. • Data Set: – 1M MovieLense set – Divided into training set and test set • Training ratio x • Evaluation Metric – Mean Absolute error (MAE)
  • 25. • Experimental steps: – Data set  Hadoop – Construct user-item matrix – Several Experiments: Imputation by Mean_ItemAvgRating&UserAvgRating – Normalization – Compute SVD using Apache Spark • Several Experiments : K = 20 • Result: Uk, Sk and Vt k – Evaluation: • Multiply Uk, Sk and Vt k == Predictions • Compute MAE between test set and the predictions
  • 26.
  • 27. • Determining the appropriate number of dimensions Training Ratio k x= 40% x= 60% x= 80% 10 0.752063387 0.740626631 0.729999025 20 0.751235952 0.737954414 0.724461829 30 0.751605786 0.738107001 0.723582636 40 0.752364535 0.738714968 0.723803209 50 0.753327523 0.739710017 0.725192112 60 0.754025958 0.740751258 0.726154802 70 0.754724263 0.741471676 0.727282644 80 0.755378978 0.742729653 0.728420107 90 0.755930363 0.743485881 0.729623117 100 0.756510921 0.744293948 0.730338578 150 0.759248743 0.748479721 0.735940524 200 0.761302063 0.752022533 0.740869085 250 0.762834111 0.754352648 0.744583552 300 0.764142161 0.756565002 0.747871699
  • 28. 0.72 0.725 0.73 0.735 0.74 0.745 0.75 0.755 0.76 0.765 0.77 0 50 100 150 200 250 300 350 MAE number of dimension k SVD prediction quality variation with number of dimensions k x= 40% x= 80% x= 60%
  • 29. • Determining the best imputation technique Imputation Technique k ByZero UserAvgRating ItemAvgRating Mean_ UserAvgRating&ItemAvgRating 10 1.287082589 0.763900832 0.738045498 0.729999025 20 1.340593039 0.758800163 0.732242767 0.724461829 30 1.351316283 0.758681267 0.731619885 0.723582636 40 1.342134276 0.760268678 0.731981172 0.723803209 50 1.331709429 0.762213489 0.733055102 0.725192112 60 1.317791963 0.764543289 0.733818382 0.726154802 70 1.303016117 0.766207466 0.734838138 0.727282644 80 1.288611141 0.768863197 0.736044939 0.728420107 90 1.274236198 0.770501176 0.737138793 0.735940524 100 1.261225796 0.772165966 0.738514469 0.730338578 150 1.198043117 0.780474068 0.744252067 0.735940524
  • 30. 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 0 20 40 60 80 100 120 140 160 MAE number of dimension k Comparing the MAE of different imputation technique ByZero UserAvgRating ItemAvgRating Mean_ UserAvgRating&ItemAvgRating
  • 31. • Evaluating the accuracy of SVD on different training ratio k Training Ratio x k=20 k=100 0.35 0.7549804 0.759679233 0.4 0.751235952 0.756510921 0.45 0.747814989 0.753365509 0.5 0.744018997 0.749880733 0.55 0.741011898 0.747432557 0.6 0.737954414 0.744293948 0.65 0.734098595 0.740644033 0.7 0.731331182 0.737610654 0.75 0.727741033 0.734113758 0.8 0.724461829 0.730338578 0.85 0.721460479 0.727461993 0.9 0.72096045 0.726117655 0.95 0.733633592 0.740065812
  • 32. 0.715 0.72 0.725 0.73 0.735 0.74 0.745 0.75 0.755 0.76 0.765 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MAE training ratio x SVD prediction quality variation with different training ratio x k=20 k=100
  • 33. • Results and Discussion:  Easy parallelization Powerfulness of Hadoop and spark Comparable results with previous work K = 20 (K = 14, k =15) MAE between 0.724 and 0.730 (0.732 and 0.748) Importance of preprocessing steps Sensitivity of prediction quality to data sparsity Applicability of SVD for Today Big Data
  • 34. • Big Data  challenges on RS – Sparsness of the data – Scalability – Prediction quality • Goals of RS in Big Data Era – Process million of records + good predictions • SVD is a promising CF approach • New tools and frameworks are useful in building large scale system • Carried on experiments : SVD + Apache Hadoop and spark . . .
  • 35. • A careful implementation of a scalable SVD- based collaborative filtering recommender system is effective when choosing the right parameters and the appropriate frameworks and APIs.
  • 36. • Deploying it on multi-node cluster • Incremental SVD/ SSVD • Expectation- Maximization approach • A hybrid of these approaches • Compare Spark implementation with other APIs and tools like Apache Mahout.