To download the slides please go here:
http://www.intelligentmining.com/category/knowledge-base/
Alex's approach to the R package recommender system Kaggle competition, where he placed 4th. Slides presented to NYCPA.
1. An Approach to R Package Recommendation Engine Alex Lin alin@intelligentmining.com Twitter: @alinatwork
2. Initial Thoughts The data set expected to have very strong package-package relationships (dependencies and related package functionalities). The data set (training + test) is not sparse. Most of matrix factorization (MF) techniques in the recommender field optimize square errors on the predicted user ratings not directly optimize for AUC.
3. Steps Modified k-Nearest Neighbor algorithm. User average & package average as prior bias. User-specific package Maintainer Affinity. Matrix factorization (MF) to post-process the residuals. Other rules.
4. Modified k-Nearest Neighbor algorithm Calculate cosine similarity for each pkg-pkg pair. Scale the cosine similarity with “square user support” ie. cosine * (support / ttl_user_cnt)**2 Unlike the regular kNN that is only additive, we use the same kNN rules to penalize the package if other related package was not installed. For unknown records, we choose to take ZAN approach. We treat the unknown entries as negative. k=all
5. User average and Package average as prior bias User average = user installed pkg count / user observation count Package average = pkg installed by users count / pkg observation count Add them into the kNN result score.
6. User-specific Package Maintainer Affinity This metric measured as the installed package percent of a given maintainer for an user. We use the percentage to predict how likely the user will install the other package from the same maintainer. Combine with kNN result score with weight of 0.25.
7. So Far – baseline model Very heuristic Public AUC = 0.976x
8. Matrix Factorization Analyze the residuals only. The goal is to find out structural errors in our baseline prediction. prediction := baseline_output + residual residual := pkg_bias + user_bias + pkgFactors . userFactors residuals is related to Wilcoxon-Mann-Whitney (WMW) statistics
10. Other Rules For those duplicate records found exist in both testing and training set, copy answers from training set. Assume when a user install a package P, the user also installs the packages that P depends on.