2. Agenda
Meet Apache Mahout
Part 1: Recommendation
Part 2: Clustering
Part 3: Classification
3. Meet Apache Mahout
It is an open source machine learning library
from Apache
It is scalable
It is a Java library
It can be used with Hadoop to deal with large
scale data.
4. Famous Engines
Recommender engines:
Amazon.com
Netflix
Dating sites like Líbímseti
Social networking sites like Facebook
Clustering engines:
Google News
Search engines like Clusty
Classification engines:
Spam emails
Google’s Picasa
Optical character recognition software
Apple’s Genius feature in iTunes
17. User-based Recommender
The algorithm
for every item i that u has no preference for yet
for every other user v that has a preference for i
compute a similarity s between u and v
incorporate v's preference for i, weighted by s, into a running
average
return the top items, ranked by weighted average
18. Recommender Components
Data model, implemented via DataModel
User-user similarity metric, implemented via
UserSimilarity
User neighborhood definition, implemented via
UserNeighborhood
Recommender engine, implemented via a
Recommender (here,
21. similarity metrics
Pearson correlation–based similarity
− It is a number between –1 and 1 that measures
the tendency of two series of numbers, paired up
one-to-one, to move together
− Problems:
It doesn’t take into account the number of items in
which two users’ preferences overlap, which is probably
a weakness in the context of recommender engines.
If two users overlap on only one item, no correlation can
be computed because of how the computation is
defined
22. similarity metrics
Euclidean distance similarity
− 1 / (1+euclidean distance)
Cosine measure similarity
− between –1 and 1
Tanimoto coefficient similarity
− The ratio of the size of the
intersection to the size of
the union of their preferred items
23. Item-based recommendation
The algorithm
for every item i that u has no preference for yet
for every item j that u has a preference for
compute a similarity s between i and j
add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average
25. Slope-one recommender
The algorithm
for every item i the user u expresses no preference for
for every item j that user u expresses a preference for
find the average preference difference between j and i
add this diff to u's preference value for j
add this to a running average
return the top items, ranked by these averages