Crab: A Python Framework for Building Recommender Systems

Crab
A Python Framework for Building
Recommendation Engines
PythonBrasil 2011, São Paulo, SP

Marcel Caraciolo Ricardo Caspirro Bruno Melo
@marcelcaraciolo @ricardocaspirro @brunomelo

What is Crab ?

A python framework for building recommendation engines
A Scikit module for collaborative, content and hybrid ﬁltering
Mahout Alternative for Python Developers :D
Open-Source under the BSD license

https://github.com/muricoca/crab

When started ?

It began one year ago
Community-driven, 4 members
Since April,2011 the open-source labs Muriçoca incorporated it
Since April,2011 rewritting it as Scikit

https://github.com/muricoca/

Knowing Scikits
Scikits are Scipy Toolkits - independent and projects hosted
under a common namespace.

Scikits Image
Scikits MlabWrap
Scikits AudioLab
Scikit Learn
....

http://scikits.appspot.com/scikits

Knowing Scikits

Scikit-Learn

Machine Learning Algorithms + scientiﬁc Python packages
(Numpy, Scipy and Matplotlib)

http://scikit-learn.sourceforge.net/

Our goal: Incorporate the Crab as Scikit and incorporate
some parts of them at Scikit-learn

Why Recommendations ?
The world is an over-crowded place
!"#$%&'()$*+$,-$&.#'/0'&%)#)$1(,0#

Why Recommendations
* +,&-.$/).#&0#/"1.#$%234(".# ?
$/)#5(&6 7&.2.#"$4,#)$8
We are overloaded
* 93((3&/.#&0#:&'3".;#5&&<.#
$/)#:-.34#2%$4<.#&/(3/"
Thousands of news articles and blog posts each day
* =/#>$/&3;#?#@A#+B#4,$//"(.;#
2,&-.$/).#&0#7%&6%$:.#
Millions of movies, books and music tracks online
"$4,#)$8
Several Places, Offers and Events

* =/#C"1#D&%<;#."'"%$(#
Even Friends sometimes we are overloaded !

2,&-.$/).#&0#$)#:"..$6".#
."/2#2&#-.#7"%#)$8

We really need and consume only a few of them!

“A lot of times, people don’t know what
they want until you show it to them.”
Steve Jobs

“We are leaving the Information age, and
entering into the Recommendation age.”
Chris Anderson, from book Long Tail

Can Google help ?
Yes, but only when we really know what we are looking for
But, what’s does it mean by “interesting” ?
Can Facebook help ?
Yes, I tend to ﬁnd my friends’ stuffs interesting
What if i had only few friends and what they like do not always
attract me ?
Can experts help ?
Yes, but it won’t scale well.
But it is what they like, not me! Exactly same advice!

Recommendation Systems
Systems designed to recommend to me something I may like

!"#$%&"'$"'(')*#*+,)
Recommendation Systems

-+*#)+. -#/') 0#)1#

!
2' 23&4"+')1 5,6 7),*%'"&863

Graph Representation

The current Crab

Collaborative Filtering algorithms
User-Based, Item-Based and Factorization Matrix (SVD)

Evaluation of the Recommender Algorithms
Precision, Recall, F1-Score, RMSE

Precision-Recall Charts

The current Crab

Precision-Recall Charts

Collaborative Filtering

O Vento Toy
Thor Armagedon Items
Levou Store

like
recommends

Marcel Rafael Amanda Users

Similar

The current Crab
>>>#load the dataset

The current Crab

>>> from crab.datasets import load_sample_movies

The current Crab

>>> data = load_sample_movies()

The current Crab

>>> data

The current Crab

>>> data
{'DESCR': 'sample_movies data set was collected by the book called
nProgramming the Collective Intelligence by Toby Segaran nnNotesn-----
nThis data set consists ofnt* n ratings with (1-5) from n users to n movies.',
'data': {1: {1: 3.0, 2: 4.0, 3: 3.5, 4: 5.0, 5: 3.0},
  2: {1: 3.0, 2: 4.0, 3: 2.0, 4: 3.0, 5: 3.0, 6: 2.0},
  3: {2: 3.5, 3: 2.5, 4: 4.0, 5: 4.5, 6: 3.0},
  4: {1: 2.5, 2: 3.5, 3: 2.5, 4: 3.5, 5: 3.0, 6: 3.0},
  5: {2: 4.5, 3: 1.0, 4: 4.0},
  6: {1: 3.0, 2: 3.5, 3: 3.5, 4: 5.0, 5: 3.0, 6: 1.5},
  7: {1: 2.5, 2: 3.0, 4: 3.5, 5: 4.0}},
'item_ids': {1: 'Lady in the Water',
  2: 'Snakes on a Planet',
  3: 'You, Me and Dupree',
  4: 'Superman Returns',
  5: 'The Night Listener',
  6: 'Just My Luck'},
'user_ids': {1: 'Jack Matthews',
  2: 'Mick LaSalle',
  3: 'Claudia Puig',
  4: 'Lisa Rose',
  5: 'Toby',
  6: 'Gene Seymour',
  7: 'Michael Phillips'}}

The current Crab

>>> from crab.models import MatrixPreferenceDataModel

The current Crab

>>> m = MatrixPreferenceDataModel(data.data)

The current Crab

>>> m = MatrixPreferenceDataModel(data.data)

>>> print m
MatrixPreferenceDataModel (7 by 6)
1 2 3 4 5 ...
1 3.000000 4.000000 3.500000 5.000000 3.000000
2 3.000000 4.000000 2.000000 3.000000 3.000000
3 --- 3.500000 2.500000 4.000000 4.500000
4 2.500000 3.500000 2.500000 3.500000 3.000000
5 --- 4.500000 1.000000 4.000000 ---
6 3.000000 3.500000 3.500000 5.000000 3.000000
7 2.500000 3.000000 --- 3.500000 4.000000

The current Crab
>>> #import pairwise distance

The current Crab
>>> from crab.metrics.pairwise import
euclidean_distances

The current Crab
euclidean_distances
>>> #import similarity

The current Crab
euclidean_distances
>>> from crab.similarities import UserSimilarity

The current Crab
euclidean_distances
>>> similarity = UserSimilarity(m,
euclidean_distances)

The current Crab
euclidean_distances
>>> similarity[1]

The current Crab
euclidean_distances
>>> similarity[1]
[(1, 1.0),
(6, 0.66666666666666663),
(4, 0.34054242658316669),
(3, 0.32037724101704074),
(7, 0.32037724101704074),
(2, 0.2857142857142857),
(5, 0.2674788903885893)]

The current Crab
euclidean_distances
>>> similarity[1]
[(1, 1.0),
(6, 0.66666666666666663), MatrixPreferenceDataModel (7 by 6)
1 2 3 4 5
(4, 0.34054242658316669), 1 3.000000 4.000000 3.500000 5.000000 3.000000
(3, 0.32037724101704074), 2 3.000000 4.000000 2.000000 3.000000 3.000000
3 --- 3.500000 2.500000 4.000000 4.500000
(7, 0.32037724101704074), 4 2.500000 3.500000 2.500000 3.500000 3.000000
5 --- 4.500000 1.000000 4.000000 ---
(2, 0.2857142857142857), 6 3.000000 3.500000 3.500000 5.000000 3.000000
(5, 0.2674788903885893)] 7 2.500000 3.000000 --- 3.500000 4.000000

The current Crab

>>> from crab.recommenders.knn import UserBasedRecommender

The current Crab

>>> recsys = UserBasedRecommender(model=m,
similarity=similarity, capper=True,with_preference=True)

The current Crab


>>> recsys.recommend(5)
array([[ 5. , 3.45712869],
[ 1. , 2.78857832],
[ 6. , 2.38193068]])

The current Crab


array([[ 5. , 3.45712869],
       [ 1. , 2.78857832],
       [ 6. , 2.38193068]])

>>> recsys.recommended_because(user_id=5,item_id=1)
array([[ 2. , 3. ],
       [ 1. , 3. ],
       [ 6. , 3. ],
       [ 7. , 2.5],
       [ 4. , 2.5]])

The current Crab


array([[ 5. , 3.45712869],
       [ 1. , 2.78857832],
       [ 6. , 2.38193068]])

>>> recsys.recommended_because(user_id=5,item_id=1)
array([[ 2. , 3. ],
       [ 1. , 3. ], MatrixPreferenceDataModel (7 by 6)
         1 2 3 4 5 ...
       [ 6. , 3. ], 1 3.000000 4.000000 3.500000 5.000000 3.000000
2 3.000000 4.000000 2.000000 3.000000 3.000000
       [ 7. , 2.5], 3 --- 3.500000 2.500000 4.000000 4.500000
       [ 4. , 2.5]]) 4 2.500000 3.500000 2.500000 3.500000 3.000000
5 --- 4.500000 1.000000 4.000000 ---
6 3.000000 3.500000 3.500000 5.000000 3.000000
7 2.500000 3.000000 --- 3.500000 4.000000

The current Crab

Using REST APIs to deploy the recommender
django-piston, django-rest, django-tastypie

Crab is already in production

News from Abril Publisher recommendations!
Collecting over 10 magazines, 20 books and 100+ articles

Running on Python
+ Scipy +
Django

Content-Based-Filtering

Easy-to-use interface

Still in development

Content Based Filtering

Similar

Duro de O Vento Toy
Armagedon Items
Matar Levou Store

recommend
likes

Marcel Users


PythonBrasil keynotes Recommender
Recommending keynotes based on a hybrid approach

Running on Python
+ Scipy +
Django
Content-Based-Filtering
+
Collaborative Filtering

Schedule your
keynotes

Still in development

source, the recommendation architecture that we propose will would rely more on collaborative-filtering techniques, that is,
aggregate the results of such filtering techniques. Bezerra and Carvalho proposed approaches where the results
the reviews from similar users.
We aim at integrating the previously mentioned hybrid prod- Figure 1 shows a overview of our meta recommender
achieved showed to be very promising [19].
approach. By combining the content-based filtering and the
uct recommendation approach in a mobile application so the
A.

users could benefit from useful and logical recommendations. collaborative-based one into a hybrid recommender system, it
Moreover, we aim at providing a suited explanation for each would use the services/products III. S YSTEM catalogues
repositories which D ESIGN
recommendation to the user, since the current approaches just the services to be recommended, and the review repository
Application data information our mobile recommender sys-
that contains the user opinions about those services. All this for
only deliver product recommendations with a overall score
without pointing out the appropriateness of such recommen- datatembecan be from data source containers in the web product description
can extracted divided into two parts: the rec
dation [13]. Besides the basic information provided by the such(such location-based social network Foursquare its attributes) and the user
as the as location, description and [17] as

Hybrid Meta Approach gives the system’s architecture and
suppliers, the system will deliver the explanation, providing
relevant reviews of similar users, we believe that it will
tags, etc.). The Figure 3
increase the confidence in the buying decision process and the
displayed at the Figure 2 and the location recommendation
engine from Google: Google HotPot [18]. by user (such as rating, comments,
reviews or ratings provided
mo
wh
product accepptance rate. In the mobile context this approach
po
could help the users in this process and showing the user
relative components. thi
opinions could contribute to achieve this task. rec
spe
!"#$"%&'$ 5&-$
!"#$%&'%($) !".,"/#) acc
!"*+#,$+'-) !"*+#,$+'-) +,-*.&$
!(#$()&'*&%$
/01&'234&$ !6#$6,00&41&7$
wh
res
!<#$<'&2&'&04&%A$B,431*,0A$&14C$
ves
0+44%6+'%$,.")1%#"2)
0+($"($)1%#"2)
3,4$"',(5)
ou
3,4$"',(5)
)))67,8,#%)+,4%$91$'%4)-1":))))
suc
!"#$%&"'()*+,#&-,.)
/$%,0"12()*3$4%)3""5.)
))))1,;&,<4)<1&%%,')=2)4&:&8$1))
)))))))))))%$4%,5)94,14>?) <',7)41$
pro
8&=,%*1,'>$
exp
8&4,99&0731*,0$:0;*0&$ !B#$B*%1$,2$D4,'&7$<',7)41%$
!(#$()&'*&%$
ma
8&?*&@$
we
Fig. 2. User Reviews from Foursquare Social Network 8&=,%*1,'>$
com
7"$%)
!"8+99"(2"'))
!8#$830E&7$<',7)41%$
The content-based filtering approach will be used to filter ext
the product/service repository, while the collaborative based
8&%).1%$ B.
approach will derive the product review recommendations. In
addition we will use text mining techniques to distinct the
!"8+99"(2%$,+(#) polarity of the user review between positive or negative one.
This information summarized would contribute in the product Architecture
Fig. 3. Mobile Recommender System rat
score recommendation computation. The final product recom-
Fig. 1. Meta Recommender Architecture
mendation score is computed by integrating the result of both
me
recommenders. By now, weproduct/service recommender, the user could
In our mobile are considering to use different and
Since one of the goals of this work is to incorporate options regarding this integration approach, one and get a list of recommen-
different data sources of user opinions and descriptions, we filter some products or services at special oth
is the symbolic data analysis approach (SDA) [19], which
have addopted an meta recommendation architecture. By using eachtations. The user user ratings/reviews arehis preferences or give his
product description and also can enter modeled ow
a meta recommender architecture, the system would provide
a personalized control over the generated recommendation list
feedback to some offered product recommendation.
as set of modal symbolic descriptions that summarizes the Re
information provided by the corresponding data sources. It is


Brazilian Social Network called Atepassar.com
Educational network with more than 60.000 students and 120 video-classes

Running on Python
+ Numpy + Scipy and
Django

Backend for Recommendations
MongoDB - mongoengine

Daily Recommendations
with Explanations

Evaluating your recommender
Crab implements the most used recommender metrics.
Precision, Recall, F1-Score, RMSE

Using matplotlib
for a plotter utility

Implement new metrics

Simulations support maybe (??)

>>> from crab.metrics.classes import CfEvaluator

>>> evaluator = CfEvaluator()


>>> evaluator.evaluate(recommender=recsys,metric='rmse')


{'rmse': 0.69467177857026907}


{'rmse': 0.69467177857026907}
>>> evaluator.evaluate_on_split(recommender=recsys, at =2)


{'rmse': 0.69467177857026907}
>>> evaluator.evaluate_on_split(recommender=recsys, at =2)
({'error': [{'mae': 0.345, 'nmae': 0.4567, 'rmse': 0.568},
{'mae': 0.456, 'nmae': 0.356778, 'rmse': 0.6788},
{'mae': 0.456, 'nmae': 0.356778, 'rmse': 0.6788}],
'ir': [{'f1score': 0.456, 'precision': 0.78557, 'recall':0.55677},
{'f1score': 0.64567, 'precision': 0.67865, 'recall': 0.785955},
{'f1score': 0.45070, 'precision': 0.74744, 'recall': 0.858585}]},
{'final_score': {'avg': {'f1score': 0.495955,
'mae': 0.429292,
'nmae': 0.373739,
'precision': 0.63932929,
'recall': 0.729939393,
'rmse': 0.3466868},
'stdev': {'f1score': 0.09938383 ,
'mae': 0.0593933,
'nmae': 0.03393939,
'precision': 0.0192929,
'recall': 0.031293939,
'rmse': 0.234949494}}})

Distributing the recommendation computations

Use Hadoop and Map-Reduce intensively
Investigating the Yelp mrjob framework https://github.com/pﬁg/mrjob

Develop the Netﬂix and novel standard-of-the-art used
Matrix Factorization, Singular Value Decomposition (SVD), Boltzman machines

The most commonly used is Slope One technique.
Simple algebra math with slope one algebra y = a*x+b

Cache/Paralelism with joblib
http://packages.python.org/joblib/index.html

from joblib import Memory
memory = Memory(cachedir=’’, verbose=0)

class UserSimilarity(BaseSimilarity):
    ...

    @memory.cache
def get_similarity(self, source_id, target_id):
        source_preferences = self.model.preferences_from_user(source_id)
         target_preferences = self.model.preferences_from_user(target_id)
...
        return self.distance(source_preferences, target_preferences)
            if not source_preferences.shape[1] == 0
                and not target_preferences.shape[1] == 0 else np.array([[np.nan]])

def get_similarities(self, source_id):
        return[(other_id, self.get_similarity(source_id, other_id))
for other_id, v in self.model]



    ...

    @memory.cache
...


>>> #Without memory.cache



    ...

    @memory.cache
...


>>> #Without memory.cache >>># With memory.cache



    ...

    @memory.cache
...


>>> timeit similarity.get_similarities
(‘marcel_caraciolo’)



    ...

    @memory.cache
...


>>> timeit similarity.get_similarities >>> timeit similarity.get_similarities
(‘marcel_caraciolo’) (‘marcel_caraciolo’)



    ...

    @memory.cache
...


100 loops, best of 3: 978 ms per loop



    ...

    @memory.cache
...


100 loops, best of 3: 978 ms per loop 100 loops, best of 3: 434 ms per loop


Investigate how to use multiprocessing and parallel packages with similarities
computation

from joblib import Parallel
...

return Parallel(n_jobs=3) ((other_id, delayed(self.get_similarity)
(source_id, other_id)) for other_id, v in self.model)

Distributed Computing with mrJob
https://github.com/Yelp/mrjob


It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or
local (for testing)


"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[w']+")

class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def reducer(self, word, counts):
        yield (word, sum(counts))

if __name__ == '__main__':
    MRWordFreqCount.run()

It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or
local (for testing)


Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce

Future studies with Sparse Matrices
Real datasets come with lots of empty values
http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html

Solutions:

scipy.sparse package

Sharding operations

Matrix Factorization
techniques (SVD)

Apontador Reviews Dataset


Solutions:


Sharding operations

techniques (SVD)

Crab implements a Matrix
Factorization with Expectation
Maximization algorithm



Solutions:


Sharding operations

techniques (SVD)

Crab implements a Matrix
Factorization with Expectation
Maximization algorithm
scikits.crab.svd package

Optimizations with Cython
http://cython.org/

Cython is a Python extension that lets developers annotate functions so they can be compiled to C.

http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html

Optimizations with Cython
http://cython.org/

Cython is a Python extension that lets developers annotate functions so they can be compiled to C.

# setup.py

from distutils.core import setup

from distutils.extension import Extension

from Cython.Distutils import build_ext

# for notes on compiler flags see:

# http://docs.python.org/install/index.html

setup(

cmdclass = {'build_ext': build_ext},

ext_modules = [Extension("spearman_correlation_cython",
["spearman_correlation_cython.pyx"])]

)

http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html

Benchmarks

Pure Python w/ Python w/ Scipy
Dataset
dicts and Numpy
MovieLens 100k 15.32 s 9.56 s
http://www.grouplens.org/node/73

Old Crab New Crab

Benchmarks

Pure Python w/ Python w/ Scipy
Dataset
dicts and Numpy
MovieLens 100k 15.32 s 9.56 s
http://www.grouplens.org/node/73

Old Crab New Crab

Time ellapsed ( Recommend 5 items)

0 4 8 12 16

Why migrate ?
Old Crab running only using Pure Python
Recommendations demand heavy maths calculations and lots of processing

Compatible with Numpy and Scipy libraries
High Standard and popular scientific libraries optimized for scientific calculations in Python

Scikits projects are amazing!
Active Communities, Scientific Conferences and updated projects (e.g. scikit-learn)

Turn the Crab framework visible for the community
Join the scientific researchers and machine learning developers around the Globe coding with
Python to help us in this project

Be Fast and Furious

Why migrate ?

Numpy optimized with PyPy

2.x - 48.x Faster

http://morepypy.blogspot.com/2011/05/numpy-in-pypy-status-and-roadmap.html

How are we working ?
Sprints, Online Discussions and Issues

https://github.com/muricoca/crab/wiki/UpcomingEvents

How are we working ?
Our Project’s Home Page

http://muricoca.github.com/crab

Future Releases
Planned Release 0.1
Collaborative Filtering Algorithms working, sample datasets to load and test

Planned Release 0.11
Sparse Matrixes and Database Models support

Planned Release 0.12
Slope One Agorithm, new factorization techniques implemented

....

Join us!

1. Read our Wiki Page
https://github.com/muricoca/crab/wiki/Developer-Resources

2. Check out our current sprints and open issues
https://github.com/muricoca/crab/issues

3. Forks, Pull Requests mandatory
4. Join us at irc.freenode.net #muricoca or at our
discussion list
http://groups.google.com/group/scikit-crab

Recommended Books

Toby Segaran, Programming Collective SatnamAlag, Collective Intelligence in
Intelligence, O'Reilly, 2007 Action, Manning Publications, 2009

ACM RecSys, KDD , SBSC...

Crab
A Python Framework for Building
Recommendation Engines

https://github.com/muricoca/crab

Marcel Caraciolo Ricardo Caspirro Bruno Melo
@marcelcaraciolo @ricardocaspirro @brunomelo

{marcel, ricardo,bruno}@muricoca.com

Crab: A Python Framework for Building Recommender Systems

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (6)

En vedette

En vedette (8)

Similaire à Crab: A Python Framework for Building Recommender Systems

Similaire à Crab: A Python Framework for Building Recommender Systems (20)

Plus de Marcel Caraciolo

Plus de Marcel Caraciolo (20)

Dernier

Dernier (20)

Crab: A Python Framework for Building Recommender Systems