SlideShare a Scribd company logo
1 of 24
- S.V.Giri
Provides implementation for Scalable Machine Learning Algorithms
-- Wikipedia
Machine Learning Algorithms
 Collaborative Filtering
 Clustering
 Classification
 Dimensionality reduction
 Anomaly detection
2
Similarity – Number of Common Movies between users
SIM(US1, US2)= 0 , SIM(US1, US3)= 3
Threshold for Similarity
The more the user watches movies, the more is he similar to others
3
Cosine Similarity
Tanimoto Coefficient
Pearson Correlation Coefficient
Euclidean Distance
LogLikelihood Similarity
Spearman Rank Correlation
4
 A measure of similarity between 2 vectors
 Values from 0 to 1
5
n
i i
n
i i
n
i ii
yx
yx
yx
yx
yx
1
2
1
2
1
),cos( 


Cos(US1,US2)= 5*0 + 4*2 + 0*4 + 1*5 / (8.19*6.71) = 0.22
Cos(US1,US3)= 5*5 + 4*5 + 5*5+ 0*2 + 1*1 / (8.19*8.94) = 0.97
Cos(US2,US4)= 0*1+4*4+5*3/(6.71*5.09)= 0.91
6
October, 2006 – 1 million Dollar
Training Data Set
Users – 480,000
Movies – 18,000
Pairs – 100 Million
Ratings : 1- 5
Test Data Set
Ratings to be predicted – 1.5 Million Pairs
Metrics - RMSE
Cinematch – 0.9514
Best RMSE – 0.8563 (Cracked by – BelKor’s Pragmatic Chaos)
7
Actual Values –
(us1,mv1,5)(u2,mv1,3)(u3,mv2,1)(u5,mv6,3)
Predicted Values –
(us1,mv1,4)(u2,mv1,3)(u3,mv2,2)(u5,mv6,2)
RMSE = √((5-4)²+(3-3)²+(1-2)²+(3-2)²)/4
= 0.86
8
(US4, SW4) =??
Average of all the other user ratings for this movie
= 4+2+5/3 = round(3.66) = 4
9
10
Sim(US4,US1) = 0.19
Sim(US4,US2) = 0.91
Sim(US4,US3)= 0.35
US4 is similar to US2
Hence Rating(US4,SW2)= Rating(US2,SW2)=2
11
Sim(US5,US2) = 0.955
Rating(US5,SW2)= Rating(US2,SW2)= 2
Avg(US2)= 3, AVG (US5)=2
Rating(US5,SW2)= Rating(US2,SW2)+ AVG (US5)- AVG(US2)= 1
12
Training Data Set
Users – 480,000
Movies – 18,000
Ratings – 100 Million
Sparse Matrix
Actual Possible pairings – 480,000*18,000 = 8.6 Billion
Pairs Present = 1.1%
Best Representation:
(Key, Value) pair
13
Similarity Matrix Computation
Time Complexity
User based Similarity :
For all Users (Sim (UserVector, User vector))
Number of users = 480,000
Number of user pairs = 480,000 * 480,000= 230 Billion user pairs
Number of comparisons for one sim val = 18000
Total Computations = 230 Billion * 18000 = 4140 Trillion
Operations
14
Dimensionality Reductions :
SVD (Singular Valued Decomposition)
MinHasing
Locality Sensitive Hashing (LSH)
15
US1 SW1 5
US1 SW2 4
US1 LOTR1 5
US1 Notting Hill 0
US1 Mean Girls 1
US2 SW1 0
US2 SW2 2
US2 LOTR1 -
…
16
17
User Based – Similarity Between Users
Product Based – Similarity Between Products
Click Based – Based on user Clicks/Likes
Content Based – Based on tags, reviews, ratings.
18
19
Cos(SW1,SW2)= 0.94
Cos(SW1, Notting Hill)= 0.233
Cos(Mean Girls, Notting Hill)= 0.94
20
US1 US2 US3 US4
SW1 5 0 5 1
SW2 4 2 5 -
LOTR1 5 - 5 -
Notting Hill 0 4 2 4
Mean Girls 1 5 1 3
The Firm ∼ The RainMaker
The Bourne Identity ∼ The Bourne Ultimatum
 Uniform Weight
 Weighted Parameters
21
Author Category Year
The Firm John Grisham Thriller 1991
The Bourne
Identity
Robert Ludlum Thriller 1980
The Bourne
Ultimatum
Robert Ludlum Thriller 1990
The Rainmaker John Grisham Thriller 1995
Problem:
 User Reads a news article
 Find Similar news articles
 Don’t find same news article.
How to convert document into a vector?
 Extract all the words
 Remove stop words
 Identify Named Entities
22
New Movie
- No views (or less views)
- No similar Movies
New User
- No ratings (fewer ratings)
- No similar Users
23
Thank you
24

More Related Content

Similar to Mahout Taste Engine

2014-mo444-final-project
2014-mo444-final-project2014-mo444-final-project
2014-mo444-final-project
Paulo Faria
 
Sociocast NODE vs. Collaborative Filtering Benchmark
Sociocast NODE vs. Collaborative Filtering BenchmarkSociocast NODE vs. Collaborative Filtering Benchmark
Sociocast NODE vs. Collaborative Filtering Benchmark
Albert Azout
 
ZunqiuPresentationOct05
ZunqiuPresentationOct05ZunqiuPresentationOct05
ZunqiuPresentationOct05
Chen Zunqiu
 
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
Osama Hosam
 
Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014
ijcsbi
 

Similar to Mahout Taste Engine (20)

2014-mo444-final-project
2014-mo444-final-project2014-mo444-final-project
2014-mo444-final-project
 
Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology
 
Sociocast NODE vs. Collaborative Filtering Benchmark
Sociocast NODE vs. Collaborative Filtering BenchmarkSociocast NODE vs. Collaborative Filtering Benchmark
Sociocast NODE vs. Collaborative Filtering Benchmark
 
IRJET- Random Valued Impulse Noise Detection Schemes
IRJET- Random Valued Impulse Noise Detection SchemesIRJET- Random Valued Impulse Noise Detection Schemes
IRJET- Random Valued Impulse Noise Detection Schemes
 
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data Mining
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data MiningMetaheuristic Tuning of Type-II Fuzzy Inference System for Data Mining
Metaheuristic Tuning of Type-II Fuzzy Inference System for Data Mining
 
RecSys Challenge 2014, SemWexMFF group
RecSys Challenge 2014, SemWexMFF groupRecSys Challenge 2014, SemWexMFF group
RecSys Challenge 2014, SemWexMFF group
 
Search-driven String Constraint Solving for Vulnerability Detection
Search-driven String Constraint Solving for Vulnerability DetectionSearch-driven String Constraint Solving for Vulnerability Detection
Search-driven String Constraint Solving for Vulnerability Detection
 
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, ...
Session-Based Recommendations with Recurrent Neural Networks (Balazs Hidasi, ...
 
Factorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsFactorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender Systems
 
Glowworm Swarm Optimisation
Glowworm Swarm OptimisationGlowworm Swarm Optimisation
Glowworm Swarm Optimisation
 
ZunqiuPresentationOct05
ZunqiuPresentationOct05ZunqiuPresentationOct05
ZunqiuPresentationOct05
 
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
Reconstructing and Watermarking Stereo Vision Systems-PhD Presentation
 
(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report(Gaurav sawant & dhaval sawlani)bia 678 final project report
(Gaurav sawant & dhaval sawlani)bia 678 final project report
 
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular AutomataCost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
Cost Optimized Design Technique for Pseudo-Random Numbers in Cellular Automata
 
Music Recommender Systems
Music Recommender SystemsMusic Recommender Systems
Music Recommender Systems
 
Literature Survey on Interest Points based Watermarking
Literature Survey on Interest Points based WatermarkingLiterature Survey on Interest Points based Watermarking
Literature Survey on Interest Points based Watermarking
 
Analysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR DataAnalysis of Adaptive and Advanced Speckle Filters on SAR Data
Analysis of Adaptive and Advanced Speckle Filters on SAR Data
 
Adam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the OddballsAdam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the Oddballs
 
Kaggle kenneth
Kaggle kennethKaggle kenneth
Kaggle kenneth
 
Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014Vol 14 No 1 - July 2014
Vol 14 No 1 - July 2014
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Mahout Taste Engine

  • 2. Provides implementation for Scalable Machine Learning Algorithms -- Wikipedia Machine Learning Algorithms  Collaborative Filtering  Clustering  Classification  Dimensionality reduction  Anomaly detection 2
  • 3. Similarity – Number of Common Movies between users SIM(US1, US2)= 0 , SIM(US1, US3)= 3 Threshold for Similarity The more the user watches movies, the more is he similar to others 3
  • 4. Cosine Similarity Tanimoto Coefficient Pearson Correlation Coefficient Euclidean Distance LogLikelihood Similarity Spearman Rank Correlation 4
  • 5.  A measure of similarity between 2 vectors  Values from 0 to 1 5 n i i n i i n i ii yx yx yx yx yx 1 2 1 2 1 ),cos(   
  • 6. Cos(US1,US2)= 5*0 + 4*2 + 0*4 + 1*5 / (8.19*6.71) = 0.22 Cos(US1,US3)= 5*5 + 4*5 + 5*5+ 0*2 + 1*1 / (8.19*8.94) = 0.97 Cos(US2,US4)= 0*1+4*4+5*3/(6.71*5.09)= 0.91 6
  • 7. October, 2006 – 1 million Dollar Training Data Set Users – 480,000 Movies – 18,000 Pairs – 100 Million Ratings : 1- 5 Test Data Set Ratings to be predicted – 1.5 Million Pairs Metrics - RMSE Cinematch – 0.9514 Best RMSE – 0.8563 (Cracked by – BelKor’s Pragmatic Chaos) 7
  • 8. Actual Values – (us1,mv1,5)(u2,mv1,3)(u3,mv2,1)(u5,mv6,3) Predicted Values – (us1,mv1,4)(u2,mv1,3)(u3,mv2,2)(u5,mv6,2) RMSE = √((5-4)²+(3-3)²+(1-2)²+(3-2)²)/4 = 0.86 8
  • 9. (US4, SW4) =?? Average of all the other user ratings for this movie = 4+2+5/3 = round(3.66) = 4 9
  • 10. 10
  • 11. Sim(US4,US1) = 0.19 Sim(US4,US2) = 0.91 Sim(US4,US3)= 0.35 US4 is similar to US2 Hence Rating(US4,SW2)= Rating(US2,SW2)=2 11
  • 12. Sim(US5,US2) = 0.955 Rating(US5,SW2)= Rating(US2,SW2)= 2 Avg(US2)= 3, AVG (US5)=2 Rating(US5,SW2)= Rating(US2,SW2)+ AVG (US5)- AVG(US2)= 1 12
  • 13. Training Data Set Users – 480,000 Movies – 18,000 Ratings – 100 Million Sparse Matrix Actual Possible pairings – 480,000*18,000 = 8.6 Billion Pairs Present = 1.1% Best Representation: (Key, Value) pair 13
  • 14. Similarity Matrix Computation Time Complexity User based Similarity : For all Users (Sim (UserVector, User vector)) Number of users = 480,000 Number of user pairs = 480,000 * 480,000= 230 Billion user pairs Number of comparisons for one sim val = 18000 Total Computations = 230 Billion * 18000 = 4140 Trillion Operations 14
  • 15. Dimensionality Reductions : SVD (Singular Valued Decomposition) MinHasing Locality Sensitive Hashing (LSH) 15
  • 16. US1 SW1 5 US1 SW2 4 US1 LOTR1 5 US1 Notting Hill 0 US1 Mean Girls 1 US2 SW1 0 US2 SW2 2 US2 LOTR1 - … 16
  • 17. 17
  • 18. User Based – Similarity Between Users Product Based – Similarity Between Products Click Based – Based on user Clicks/Likes Content Based – Based on tags, reviews, ratings. 18
  • 19. 19
  • 20. Cos(SW1,SW2)= 0.94 Cos(SW1, Notting Hill)= 0.233 Cos(Mean Girls, Notting Hill)= 0.94 20 US1 US2 US3 US4 SW1 5 0 5 1 SW2 4 2 5 - LOTR1 5 - 5 - Notting Hill 0 4 2 4 Mean Girls 1 5 1 3
  • 21. The Firm ∼ The RainMaker The Bourne Identity ∼ The Bourne Ultimatum  Uniform Weight  Weighted Parameters 21 Author Category Year The Firm John Grisham Thriller 1991 The Bourne Identity Robert Ludlum Thriller 1980 The Bourne Ultimatum Robert Ludlum Thriller 1990 The Rainmaker John Grisham Thriller 1995
  • 22. Problem:  User Reads a news article  Find Similar news articles  Don’t find same news article. How to convert document into a vector?  Extract all the words  Remove stop words  Identify Named Entities 22
  • 23. New Movie - No views (or less views) - No similar Movies New User - No ratings (fewer ratings) - No similar Users 23