1. Ahlia University
College of Information Technology
By:
Khadija Atiya Al-Mohsen 201110493
Supervised by:
Dr. Huda AlJobori
2. • Overview of Big Data
– 6 Vs of Big Data
– Big Data Technologies
• Recommender system
– Definition
– Problem Formulation
– Approaches; CF in particular
– Challenges of CF
– Singular Value Decomposition (SVD) approach
• My Experiments
• Conclusion
• Future work
13. • Recommender System (RS)
– For every users, RS Suggests new items that might
be of interest to the user
– It analyzes the users purchase history , their
profiles and their activities on the website
18. R = U S VT
mxn mxm mxn nxn
. .
U: holds left singular vectors of 𝑅 in its columns
S: holds the singular values of 𝑅 in its diagonal entries in decreasing order
V: holds the right singular vectors of 𝑅 in its columns
19. R = S
mxn mxm mxn nxn
. .
U: holds users features
S: holds the features
V: holds movie features
User-item matrix User feature item featurefeatures
users
movies
…………..
……comedy
horror
romantic
comedy
horror
romantic
21. R ≈ U S
VT
mxn mxk
kxk kxn
. .
User-item matrix User feature
features item feature
?
22. • Preprocessing:
– Imputation:
• By zero (ByZero)
• By item average rating (ItemAvgRating)
• By user average rating (UserAvgRating)
• By mean of ItemAvgRating and UserAvgRating
(Mean_ItemAvgRating&UserAvgRating)
– Normalization
• Offsetting the difference in rating scale between the users
• Post-processing:
-De-normalization
23. • SVD-based RS:
– Researches started in 2000.
– Data size started to increase.
– Good results: accuracy, speed and scalability
– Incremental SVD, combining it with neural network,
with traditional CF approaches
– Tested on Big Data : relatively small data set
• Questions:
– No longer valid for TODAY BIG DATA?
• Appropriate value of k
• Appropriate imputation technique
– New Big Data tools and frameworks (Hadoop & Spark)?
24. • Experimental Environment:
– MacBook Pro with X 10.9.3 OS
– Apache Hadoop 2.4.0
– Apache Spark 1.0.2.
• Data Set:
– 1M MovieLense set
– Divided into training set and test set
• Training ratio x
• Evaluation Metric
– Mean Absolute error (MAE)
25. • Experimental steps:
– Data set Hadoop
– Construct user-item matrix
– Several Experiments: Imputation by
Mean_ItemAvgRating&UserAvgRating
– Normalization
– Compute SVD using Apache Spark
• Several Experiments : K = 20
• Result: Uk, Sk and Vt
k
– Evaluation:
• Multiply Uk, Sk and Vt
k == Predictions
• Compute MAE between test set and the predictions
30. 0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
0 20 40 60 80 100 120 140 160
MAE
number of dimension k
Comparing the MAE of different imputation technique
ByZero
UserAvgRating
ItemAvgRating
Mean_ UserAvgRating&ItemAvgRating
31. • Evaluating the accuracy of SVD on different
training ratio
k
Training
Ratio x
k=20 k=100
0.35 0.7549804 0.759679233
0.4 0.751235952 0.756510921
0.45 0.747814989 0.753365509
0.5 0.744018997 0.749880733
0.55 0.741011898 0.747432557
0.6 0.737954414 0.744293948
0.65 0.734098595 0.740644033
0.7 0.731331182 0.737610654
0.75 0.727741033 0.734113758
0.8 0.724461829 0.730338578
0.85 0.721460479 0.727461993
0.9 0.72096045 0.726117655
0.95 0.733633592 0.740065812
33. • Results and Discussion:
Easy parallelization
Powerfulness of Hadoop and spark
Comparable results with previous work
K = 20 (K = 14, k =15)
MAE between 0.724 and 0.730 (0.732 and 0.748)
Importance of preprocessing steps
Sensitivity of prediction quality to data sparsity
Applicability of SVD for Today Big Data
34. • Big Data challenges on RS
– Sparsness of the data
– Scalability
– Prediction quality
• Goals of RS in Big Data Era
– Process million of records + good predictions
• SVD is a promising CF approach
• New tools and frameworks are useful in building
large scale system
• Carried on experiments : SVD + Apache Hadoop
and spark
.
.
.
35. • A careful implementation of a scalable SVD-
based collaborative filtering recommender
system is effective when choosing the right
parameters and the appropriate frameworks
and APIs.
36. • Deploying it on multi-node cluster
• Incremental SVD/ SSVD
• Expectation- Maximization approach
• A hybrid of these approaches
• Compare Spark implementation with other APIs
and tools like Apache Mahout.