4. What is Apache Giraph?
Iterative and graph processing on massive datasets
Billion vertices, trillion edges
Data mapped to a graph
•Vertex ids and values
•Edges and edge values
“Think like a vertex”
10
5
1
3
5. What is Apache Giraph?
Runs on top of Hadoop
Map only jobs
Keeps data in memory
Mappers communicate through network
11. Neighborhood based CF
Start from user item ratings
Calculate item similarities
For each item pair:
•Users who rated first item
•Users who rated second item
•Users who rated both items
?
u
u
u
u
u
u
I1 I2
12. Neighborhood based CF
Calculate user recommendations
For every user:
•Items rated by user
•Most similar items to these items
?
?
?
?
I4
I5
I6
I7
I1
I2
I3
u
13. Configurable formulas
Accommodating different use cases
Each calculation step is configurable
•User’s contribution to item similarities
•Item similarities based on all user’s contributions
•User to item recommendation score
Passing a piece of Java code through configuration
intersection / Math.sqrt(degree1 * degree2)
14. Users to items edges
Preprocessing:
•Filter out low degree ones
•Calculate global item stats
Users send item lists to items
•Items need other items’ global stats
to calculate similarities
Worker 1
Worker 2
Worker 3
Our solution
i
u
u
u
u
i
i
iu
15. Optimizations
Make item info globally available
•Using reduce/broadcast api
Striping technique
•Split computation across multiple supersteps
•In each stripe process one subset of items
22. Standard approach
A bipartite graph:
•Users and items are vertices
•Known ratings are edges
•Feature vectors sent through edges
Problems:
•Data sent per iteration: #knownRatings * #features
•Memory
•Large degree items
•SGD modifications are different than in the sequential solution
Worker 1
Worker 2
Worker 3
I2
I1
I3
I4
24. Our solution - rotational approach
Worker 1
Worker 2
Worker 3
item
set 3
item
set 1
item
set 2
•Network traffic?
•Memory?
•Skewed item degrees?
•SGD calculation?
25. Recommendations
Finding top inner products
Each (user, item) pair is unfeasible
Creating Ball Tree from item vectors
•Greedy tree traversal
•Pruning subtrees
•100-1000x faster
26. Additional features
Tracking rmse, average rank and precision/recall
Combining SGD & ALS
Using other objective functions
•CF for implicit feedback
•Biases
•Degree based regularization
•Optimizing ranks
27. Applications
Add user and item feature vectors in ranking
Get user to item score in realtime
Direct user recommendations
29. Comparison with Spark MLlib
Performance of Spark MLlib ALS CF published in July 2014
On scaled copies of Amazon reviews datasetCpuminutes
0
150
300
450
600
Millions examples
0 300 600 900 1200
Standard (in Spark)
Rotational (in Giraph)
32. Conclusion
Scalable implementation of Collaborative Filtering
On top of Apache Giraph
Highly performant (>100 billion ratings)
Neighborhood-based models
Matrix factorization
Group and Page recommendations at Facebook