Top 10 Most Downloaded Games on Play Store in 2024
Crab: A Python Framework for Building Recommender Systems
1. Crab
A Python Framework for Building
Recommendation Engines
PythonBrasil 2011, São Paulo, SP
Marcel Caraciolo Ricardo Caspirro Bruno Melo
@marcelcaraciolo @ricardocaspirro @brunomelo
2. What is Crab ?
A python framework for building recommendation engines
A Scikit module for collaborative, content and hybrid filtering
Mahout Alternative for Python Developers :D
Open-Source under the BSD license
https://github.com/muricoca/crab
3. When started ?
It began one year ago
Community-driven, 4 members
Since April,2011 the open-source labs Muriçoca incorporated it
Since April,2011 rewritting it as Scikit
https://github.com/muricoca/
4. Knowing Scikits
Scikits are Scipy Toolkits - independent and projects hosted
under a common namespace.
Scikits Image
Scikits MlabWrap
Scikits AudioLab
Scikit Learn
....
http://scikits.appspot.com/scikits
5. Knowing Scikits
Scikit-Learn
Machine Learning Algorithms + scientific Python packages
(Numpy, Scipy and Matplotlib)
http://scikit-learn.sourceforge.net/
Our goal: Incorporate the Crab as Scikit and incorporate
some parts of them at Scikit-learn
7. Why Recommendations
* +,&-.$/).#&0#/"1.#$%234(".# ?
$/)#5(&6 7&.2.#"$4,#)$8
We are overloaded
* 93((3&/.#&0#:&'3".;#5&&<.#
$/)#:-.34#2%$4<.#&/(3/"
Thousands of news articles and blog posts each day
* =/#>$/&3;#?#@A#+B#4,$//"(.;#
2,&-.$/).#&0#7%&6%$:.#
Millions of movies, books and music tracks online
"$4,#)$8
Several Places, Offers and Events
* =/#C"1#D&%<;#."'"%$(#
Even Friends sometimes we are overloaded !
2,&-.$/).#&0#$)#:"..$6".#
."/2#2&#-.#7"%#)$8
8. Why Recommendations ?
We really need and consume only a few of them!
“A lot of times, people don’t know what
they want until you show it to them.”
Steve Jobs
“We are leaving the Information age, and
entering into the Recommendation age.”
Chris Anderson, from book Long Tail
9. Why Recommendations ?
Can Google help ?
Yes, but only when we really know what we are looking for
But, what’s does it mean by “interesting” ?
Can Facebook help ?
Yes, I tend to find my friends’ stuffs interesting
What if i had only few friends and what they like do not always
attract me ?
Can experts help ?
Yes, but it won’t scale well.
But it is what they like, not me! Exactly same advice!
10. Why Recommendations ?
Recommendation Systems
Systems designed to recommend to me something I may like
12. The current Crab
Collaborative Filtering algorithms
User-Based, Item-Based and Factorization Matrix (SVD)
Evaluation of the Recommender Algorithms
Precision, Recall, F1-Score, RMSE
Precision-Recall Charts
36. The current Crab
>>> from crab.recommenders.knn import UserBasedRecommender
>>> recsys = UserBasedRecommender(model=m,
similarity=similarity, capper=True,with_preference=True)
37. The current Crab
>>> from crab.recommenders.knn import UserBasedRecommender
>>> recsys = UserBasedRecommender(model=m,
similarity=similarity, capper=True,with_preference=True)
>>> recsys.recommend(5)
array([[ 5. , 3.45712869],
[ 1. , 2.78857832],
[ 6. , 2.38193068]])
40. The current Crab
Using REST APIs to deploy the recommender
django-piston, django-rest, django-tastypie
41. Crab is already in production
News from Abril Publisher recommendations!
Collecting over 10 magazines, 20 books and 100+ articles
Running on Python
+ Scipy +
Django
Content-Based-Filtering
Easy-to-use interface
Still in development
42. Content Based Filtering
Similar
Duro de O Vento Toy
Armagedon Items
Matar Levou Store
recommend
likes
Marcel Users
43. Crab is already in production
PythonBrasil keynotes Recommender
Recommending keynotes based on a hybrid approach
Running on Python
+ Scipy +
Django
Content-Based-Filtering
+
Collaborative Filtering
Schedule your
keynotes
Still in development
44. source, the recommendation architecture that we propose will would rely more on collaborative-filtering techniques, that is,
aggregate the results of such filtering techniques. Bezerra and Carvalho proposed approaches where the results
the reviews from similar users.
We aim at integrating the previously mentioned hybrid prod- Figure 1 shows a overview of our meta recommender
achieved showed to be very promising [19].
approach. By combining the content-based filtering and the
uct recommendation approach in a mobile application so the
A.
Crab is already in production
users could benefit from useful and logical recommendations. collaborative-based one into a hybrid recommender system, it
Moreover, we aim at providing a suited explanation for each would use the services/products III. S YSTEM catalogues
repositories which D ESIGN
recommendation to the user, since the current approaches just the services to be recommended, and the review repository
Application data information our mobile recommender sys-
that contains the user opinions about those services. All this for
only deliver product recommendations with a overall score
without pointing out the appropriateness of such recommen- datatembecan be from data source containers in the web product description
can extracted divided into two parts: the rec
dation [13]. Besides the basic information provided by the such(such location-based social network Foursquare its attributes) and the user
as the as location, description and [17] as
Hybrid Meta Approach gives the system’s architecture and
suppliers, the system will deliver the explanation, providing
relevant reviews of similar users, we believe that it will
tags, etc.). The Figure 3
increase the confidence in the buying decision process and the
displayed at the Figure 2 and the location recommendation
engine from Google: Google HotPot [18]. by user (such as rating, comments,
reviews or ratings provided
mo
wh
product accepptance rate. In the mobile context this approach
po
could help the users in this process and showing the user
relative components. thi
opinions could contribute to achieve this task. rec
spe
!"#$"%&'$ 5&-$
!"#$%&'%($) !".,"/#) acc
!"*+#,$+'-) !"*+#,$+'-) +,-*.&$
!(#$()&'*&%$
/01&'234&$ !6#$6,00&41&7$
wh
res
!<#$<'&2&'&04&%A$B,431*,0A$&14C$
ves
0+44%6+'%$,.")1%#"2)
0+($"($)1%#"2)
3,4$"',(5)
ou
3,4$"',(5)
)))67,8,#%)+,4%$91$'%4)-1":))))
suc
!"#$%&"'()*+,#&-,.)
/$%,0"12()*3$4%)3""5.)
))))1,;&,<4)<1&%%,')=2)4&:&8$1))
)))))))))))%$4%,5)94,14>?) <',7)41$
pro
8&=,%*1,'>$
exp
8&4,99&0731*,0$:0;*0&$ !B#$B*%1$,2$D4,'&7$<',7)41%$
!(#$()&'*&%$
ma
8&?*&@$
we
Fig. 2. User Reviews from Foursquare Social Network 8&=,%*1,'>$
com
7"$%)
!"8+99"(2"'))
!8#$830E&7$<',7)41%$
The content-based filtering approach will be used to filter ext
the product/service repository, while the collaborative based
8&%).1%$ B.
approach will derive the product review recommendations. In
addition we will use text mining techniques to distinct the
!"8+99"(2%$,+(#) polarity of the user review between positive or negative one.
This information summarized would contribute in the product Architecture
Fig. 3. Mobile Recommender System rat
score recommendation computation. The final product recom-
Fig. 1. Meta Recommender Architecture
mendation score is computed by integrating the result of both
me
recommenders. By now, weproduct/service recommender, the user could
In our mobile are considering to use different and
Since one of the goals of this work is to incorporate options regarding this integration approach, one and get a list of recommen-
different data sources of user opinions and descriptions, we filter some products or services at special oth
is the symbolic data analysis approach (SDA) [19], which
have addopted an meta recommendation architecture. By using eachtations. The user user ratings/reviews arehis preferences or give his
product description and also can enter modeled ow
a meta recommender architecture, the system would provide
a personalized control over the generated recommendation list
feedback to some offered product recommendation.
as set of modal symbolic descriptions that summarizes the Re
information provided by the corresponding data sources. It is
45. Crab is already in production
Brazilian Social Network called Atepassar.com
Educational network with more than 60.000 students and 120 video-classes
Running on Python
+ Numpy + Scipy and
Django
Backend for Recommendations
MongoDB - mongoengine
Daily Recommendations
with Explanations
46. Evaluating your recommender
Crab implements the most used recommender metrics.
Precision, Recall, F1-Score, RMSE
Using matplotlib
for a plotter utility
Implement new metrics
Simulations support maybe (??)
54. Distributing the recommendation computations
Use Hadoop and Map-Reduce intensively
Investigating the Yelp mrjob framework https://github.com/pfig/mrjob
Develop the Netflix and novel standard-of-the-art used
Matrix Factorization, Singular Value Decomposition (SVD), Boltzman machines
The most commonly used is Slope One technique.
Simple algebra math with slope one algebra y = a*x+b
55. Cache/Paralelism with joblib
http://packages.python.org/joblib/index.html
from joblib import Memory
memory = Memory(cachedir=’’, verbose=0)
class UserSimilarity(BaseSimilarity):
...
@memory.cache
def get_similarity(self, source_id, target_id):
source_preferences = self.model.preferences_from_user(source_id)
target_preferences = self.model.preferences_from_user(target_id)
...
return self.distance(source_preferences, target_preferences)
if not source_preferences.shape[1] == 0
and not target_preferences.shape[1] == 0 else np.array([[np.nan]])
def get_similarities(self, source_id):
return[(other_id, self.get_similarity(source_id, other_id))
for other_id, v in self.model]
56. Cache/Paralelism with joblib
http://packages.python.org/joblib/index.html
from joblib import Memory
memory = Memory(cachedir=’’, verbose=0)
class UserSimilarity(BaseSimilarity):
...
@memory.cache
def get_similarity(self, source_id, target_id):
source_preferences = self.model.preferences_from_user(source_id)
target_preferences = self.model.preferences_from_user(target_id)
...
return self.distance(source_preferences, target_preferences)
if not source_preferences.shape[1] == 0
and not target_preferences.shape[1] == 0 else np.array([[np.nan]])
def get_similarities(self, source_id):
return[(other_id, self.get_similarity(source_id, other_id))
for other_id, v in self.model]
>>> #Without memory.cache
57. Cache/Paralelism with joblib
http://packages.python.org/joblib/index.html
from joblib import Memory
memory = Memory(cachedir=’’, verbose=0)
class UserSimilarity(BaseSimilarity):
...
@memory.cache
def get_similarity(self, source_id, target_id):
source_preferences = self.model.preferences_from_user(source_id)
target_preferences = self.model.preferences_from_user(target_id)
...
return self.distance(source_preferences, target_preferences)
if not source_preferences.shape[1] == 0
and not target_preferences.shape[1] == 0 else np.array([[np.nan]])
def get_similarities(self, source_id):
return[(other_id, self.get_similarity(source_id, other_id))
for other_id, v in self.model]
>>> #Without memory.cache >>># With memory.cache
58. Cache/Paralelism with joblib
http://packages.python.org/joblib/index.html
from joblib import Memory
memory = Memory(cachedir=’’, verbose=0)
class UserSimilarity(BaseSimilarity):
...
@memory.cache
def get_similarity(self, source_id, target_id):
source_preferences = self.model.preferences_from_user(source_id)
target_preferences = self.model.preferences_from_user(target_id)
...
return self.distance(source_preferences, target_preferences)
if not source_preferences.shape[1] == 0
and not target_preferences.shape[1] == 0 else np.array([[np.nan]])
def get_similarities(self, source_id):
return[(other_id, self.get_similarity(source_id, other_id))
for other_id, v in self.model]
>>> #Without memory.cache >>># With memory.cache
>>> timeit similarity.get_similarities
(‘marcel_caraciolo’)
59. Cache/Paralelism with joblib
http://packages.python.org/joblib/index.html
from joblib import Memory
memory = Memory(cachedir=’’, verbose=0)
class UserSimilarity(BaseSimilarity):
...
@memory.cache
def get_similarity(self, source_id, target_id):
source_preferences = self.model.preferences_from_user(source_id)
target_preferences = self.model.preferences_from_user(target_id)
...
return self.distance(source_preferences, target_preferences)
if not source_preferences.shape[1] == 0
and not target_preferences.shape[1] == 0 else np.array([[np.nan]])
def get_similarities(self, source_id):
return[(other_id, self.get_similarity(source_id, other_id))
for other_id, v in self.model]
>>> #Without memory.cache >>># With memory.cache
>>> timeit similarity.get_similarities >>> timeit similarity.get_similarities
(‘marcel_caraciolo’) (‘marcel_caraciolo’)
60. Cache/Paralelism with joblib
http://packages.python.org/joblib/index.html
from joblib import Memory
memory = Memory(cachedir=’’, verbose=0)
class UserSimilarity(BaseSimilarity):
...
@memory.cache
def get_similarity(self, source_id, target_id):
source_preferences = self.model.preferences_from_user(source_id)
target_preferences = self.model.preferences_from_user(target_id)
...
return self.distance(source_preferences, target_preferences)
if not source_preferences.shape[1] == 0
and not target_preferences.shape[1] == 0 else np.array([[np.nan]])
def get_similarities(self, source_id):
return[(other_id, self.get_similarity(source_id, other_id))
for other_id, v in self.model]
>>> #Without memory.cache >>># With memory.cache
>>> timeit similarity.get_similarities >>> timeit similarity.get_similarities
(‘marcel_caraciolo’) (‘marcel_caraciolo’)
100 loops, best of 3: 978 ms per loop
61. Cache/Paralelism with joblib
http://packages.python.org/joblib/index.html
from joblib import Memory
memory = Memory(cachedir=’’, verbose=0)
class UserSimilarity(BaseSimilarity):
...
@memory.cache
def get_similarity(self, source_id, target_id):
source_preferences = self.model.preferences_from_user(source_id)
target_preferences = self.model.preferences_from_user(target_id)
...
return self.distance(source_preferences, target_preferences)
if not source_preferences.shape[1] == 0
and not target_preferences.shape[1] == 0 else np.array([[np.nan]])
def get_similarities(self, source_id):
return[(other_id, self.get_similarity(source_id, other_id))
for other_id, v in self.model]
>>> #Without memory.cache >>># With memory.cache
>>> timeit similarity.get_similarities >>> timeit similarity.get_similarities
(‘marcel_caraciolo’) (‘marcel_caraciolo’)
100 loops, best of 3: 978 ms per loop 100 loops, best of 3: 434 ms per loop
62. Cache/Paralelism with joblib
http://packages.python.org/joblib/index.html
Investigate how to use multiprocessing and parallel packages with similarities
computation
from joblib import Parallel
...
def get_similarities(self, source_id):
return Parallel(n_jobs=3) ((other_id, delayed(self.get_similarity)
(source_id, other_id)) for other_id, v in self.model)
64. Distributed Computing with mrJob
https://github.com/Yelp/mrjob
It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or
local (for testing)
65. Distributed Computing with mrJob
https://github.com/Yelp/mrjob
It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or
local (for testing)
66. Distributed Computing with mrJob
https://github.com/Yelp/mrjob
"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re
WORD_RE = re.compile(r"[w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or
local (for testing)
67. Distributed Computing with mrJob
https://github.com/Yelp/mrjob
Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
68. Distributed Computing with mrJob
https://github.com/Yelp/mrjob
Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
69. Future studies with Sparse Matrices
Real datasets come with lots of empty values
http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html
Solutions:
scipy.sparse package
Sharding operations
Matrix Factorization
techniques (SVD)
Apontador Reviews Dataset
70. Future studies with Sparse Matrices
Real datasets come with lots of empty values
http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html
Solutions:
scipy.sparse package
Sharding operations
Matrix Factorization
techniques (SVD)
Crab implements a Matrix
Factorization with Expectation
Maximization algorithm
Apontador Reviews Dataset
71. Future studies with Sparse Matrices
Real datasets come with lots of empty values
http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html
Solutions:
scipy.sparse package
Sharding operations
Matrix Factorization
techniques (SVD)
Crab implements a Matrix
Factorization with Expectation
Maximization algorithm
scikits.crab.svd package
Apontador Reviews Dataset
72. Optimizations with Cython
http://cython.org/
Cython is a Python extension that lets developers annotate functions so they can be compiled to C.
http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
73. Optimizations with Cython
http://cython.org/
Cython is a Python extension that lets developers annotate functions so they can be compiled to C.
# setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
# for notes on compiler flags see:
# http://docs.python.org/install/index.html
setup(
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension("spearman_correlation_cython",
["spearman_correlation_cython.pyx"])]
)
http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
74. Optimizations with Cython
http://cython.org/
Cython is a Python extension that lets developers annotate functions so they can be compiled to C.
# setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
# for notes on compiler flags see:
# http://docs.python.org/install/index.html
setup(
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension("spearman_correlation_cython",
["spearman_correlation_cython.pyx"])]
)
http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
75. Benchmarks
Pure Python w/ Python w/ Scipy
Dataset
dicts and Numpy
MovieLens 100k 15.32 s 9.56 s
http://www.grouplens.org/node/73
Old Crab New Crab
76. Benchmarks
Pure Python w/ Python w/ Scipy
Dataset
dicts and Numpy
MovieLens 100k 15.32 s 9.56 s
http://www.grouplens.org/node/73
Old Crab New Crab
Time ellapsed ( Recommend 5 items)
0 4 8 12 16
77. Benchmarks
Pure Python w/ Python w/ Scipy
Dataset
dicts and Numpy
MovieLens 100k 15.32 s 9.56 s
http://www.grouplens.org/node/73
Old Crab New Crab
Time ellapsed ( Recommend 5 items)
0 4 8 12 16
78. Benchmarks
Pure Python w/ Python w/ Scipy
Dataset
dicts and Numpy
MovieLens 100k 15.32 s 9.56 s
http://www.grouplens.org/node/73
Old Crab New Crab
Time ellapsed ( Recommend 5 items)
0 4 8 12 16
79. Why migrate ?
Old Crab running only using Pure Python
Recommendations demand heavy maths calculations and lots of processing
Compatible with Numpy and Scipy libraries
High Standard and popular scientific libraries optimized for scientific calculations in Python
Scikits projects are amazing!
Active Communities, Scientific Conferences and updated projects (e.g. scikit-learn)
Turn the Crab framework visible for the community
Join the scientific researchers and machine learning developers around the Globe coding with
Python to help us in this project
Be Fast and Furious
81. How are we working ?
Sprints, Online Discussions and Issues
https://github.com/muricoca/crab/wiki/UpcomingEvents
82. How are we working ?
Our Project’s Home Page
http://muricoca.github.com/crab
83. Future Releases
Planned Release 0.1
Collaborative Filtering Algorithms working, sample datasets to load and test
Planned Release 0.11
Sparse Matrixes and Database Models support
Planned Release 0.12
Slope One Agorithm, new factorization techniques implemented
....
84. Join us!
1. Read our Wiki Page
https://github.com/muricoca/crab/wiki/Developer-Resources
2. Check out our current sprints and open issues
https://github.com/muricoca/crab/issues
3. Forks, Pull Requests mandatory
4. Join us at irc.freenode.net #muricoca or at our
discussion list
http://groups.google.com/group/scikit-crab