CuMF is a matrix factorization algorithm for recommendation systems that accelerates training using GPUs. It addresses two main challenges: irregular memory access when aggregating user vectors and the inability of a single GPU to handle large matrices. CuMF optimizes memory usage through register tiling and achieves scale-out by distributing computation across multiple GPUs in parallel. Experimental results show CuMF trains up to 10x faster and is 100x more cost-efficient than competing algorithms on CPUs or multiple machines.
S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
1. Wei Tan, IBM T. J. Watson Research Center
wtan@us.ibm.com
CuMF: Large-Scale Matrix Factorization on
Just One Machine with GPUs
2. Agenda
Why accelerate recommendation (matrix factorization) using GPUs?
What are the challenges?
How cuMF tackles the challenges?
So what is the result?
2
3. Why we need a fast and scalable recommender system?
3
Recommendation is pervasive
Drive 80% of Netflix’s watch hours1
Digital ads market in US: 37.3 billion2
Need: fast, scalable, economic
4. ALS (alternating least square) for collaborative filtering
4
Input: users ratings on some items
Output: all missing ratings
How: factorize the rating matrix R into
and minimize the lost on observed ratings:
ALS: iteratively solve
R
X
xu
ΘT θv
5. Matrix factorization/ALS is versatile
5
item
X
xu
ΘT θv
Recommendation
user
X
ΘT
word
wordWord
embedding
X
ΘT
word
documentTopic
model
a very important algorithm
to accelerate
6. Challenges of fast and scalable ALS
6
• ALS needs to solve:
spmm: cublas
• Challenge 1: access and
aggregate many θvs:
irregular (R is sparse) and
memory intensive
LU or Cholesky
decomposition: cublas
• Challenge 2:
Single GPU can NOT handle big m, n and nnz
13. Recap: cuMF tackled two challenges
13
• ALS needs to solve:
spmm: cublas
• Challenge 1: access and
aggregate many θvs:
irregular (R is sparse) and
memory intensive
LU or Cholesky
decomposition: cublas
• Challenge 2:
Single GPU can NOT handle big m, n and nnz
14. Connect cuMF to Spark MLlib
14
Spark applications relying on mllib/ALS need no change
Modified mllib/ALS detects GPU and offload computation
Leverage the best of Spark (scale-out) and GPU (scale-up)
ALS apps
mllib/ALS
cuMFJNI
15. Connect cuMF to Spark MLlib
1 Power 8 node + 2 K40
CUDA
kernel
GPU1
RDD
CUDA
kernel
…RDD RDD RDD
CUDA
kernel
GPU2
RDD
CUDA
kernel
…RDD RDD RDD
shuffle
1 Power 8 node + 2 K40
CUDA
kernel
GPU1
RDD
CUDA
kernel
…RDD RDD RDD
shuffle
…
RDD on CPU: to distribute rating data and shuffle parameters
Solver on GPU: to form and solve
Able to run on multiple nodes, and multiple GPUs per node
15
16. Implementation
16
In C (circa. 10k LOC)
CUDA 7.0/7.5, GCC OpenMP v3.0
Baselines:
– Libmf: SGD on 1 node [RecSys14]
– NOMAD: SGD on >1 nodes [VLDB 14]
– SparkALS: ALS on Spark
– FactorBird: SGD + parameter server for MF
– Facebook: enhanced Giraph
17. CuMF performance
17
X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set
• 1 GPU vs. 30 cores, CuMF slightly
faster than libmf and NOMAD
• CuMF scales well on 1, 2, 4 GPUs
18. Effectiveness of memory optimization
18
X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set
• Aggressively using registers 2x
• Using texture 25%-35% faster
20. CuMF accelerated Spark on Power 8
20
• CuMF @2 K40s achieves 6+x speedup in training (193 sec 1.3k sec)
*GUI designed by Amir Sanjar
21. Conclusion
Why accelerate recommendation (matrix factorization) using GPU?
– Need to be fast, scalable and economic
What are the challenges?
– Memory access, scale to multiple GPUs
How cuMF tackles the challenges?
– Optimize memory access, parallelism and communication
So what is the result?
– Up to 10x as fast, 100x as cost-efficient
– Use cuMF standalone or with Spark
– GPU can tackle ML problems beyond deep learning!
21
22. Wei Tan, IBM T. J. Watson Research Center
wtan@us.ibm.com
http://ibm.biz/wei_tan
Thank you, questions?
Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs.
Wei Tan, Liangliang Cao, Liana Fong. HPDC 2016,
http://arxiv.org/abs/1603.03820
Source code available soon.